SemPubFlow: a novel Scientific Publishing Workflow using Knowledge Graphs, Wikidata and LLMs – the CEUR-WS use case

Tracking #: 3657-4871

Authors: 
Wolfgang Fahl
Tim Holzheim
Christoph Lange
Stefan Decker1

Responsible editor: 
Guest Editors KG Gen from Text 2023

Submission type: 
Full Paper
Abstract: 
The CEUR Workshop Proceedings (CEUR-WS) platform has been pivotal in disseminating scientific workshop and conference proceedings since 1995. This paper introduces a paradigm shift towards a semantified, consistent, and FAIR (Findable, Accessible, Interoperable, and Reusable) knowledge graph, emphasizing the critical role of Single Source of Truth (SSoT) and Single Point of Truth (SPoT) in scholarly publishing and reducing the data quality responsibility burden on CEUR-WS editors. Our SemPubFlow approach modernizes the legacy pipeline of manual HTML and PDF content curation by expecting the metadata to be supplied first. It enables the public open source collection of necessary data for event series, events, proceedings, papers, editors, authors, and affiliated institutions directly by the stakeholders of a scientific event as early as possible. The traditional Extract, Transform, Load (ETL) processes that convert existing artifacts into a comprehensive knowledge graph are only needed during the transition to this workflow. The novel approach leverages Large Language Models (LLMs) and the Wikidata knowledge graph, generating the SPoT representing CEUR-WS as the SSoT. This way our methodology not only streamlines the recreation of legacy artifacts but also addresses the \tquote{long tail} problem inherent in CEUR-WS's diverse and evolving data. This paper outlines the transition strategy, avoiding a \tquote{big bang} approach, to ensure the continuity and integrity of scholarly communication. The resulting solution is efficient in attaining the necessary level of coverage, accuracy and scalability. Data protection issues can easily be overcome in this context since even the personal data is intended to be public. The advancements presented promise to enhance publication processes across various contexts, offering a blueprint for future scholarly publishing infrastructures.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 02/Apr/2024
Suggestion:
Major Revision
Review Comment:

The paper presents a metadata-first knowledge graph-based approach for the semantification of scientific publications. The authors also investigate the utilisation of LLMs. Overall, as a reader, I see the added benefit of the approach in mitigating data (or scientific paper) obsoletion and supporting FAIR data. However, I find the presented approach difficult to follow. The paper lacks logical structure which affects how the content is perceived and has numerous formatting issues. The language used in it is also at times informal and overall the paper reads as a draft version. I think that the paper needs significant revision to be considered for publishing. Specific comments:

The introduction presents some relevant information on the topic but reads disconnected and more like reporting. There is a lack of flow of information leading to a full story presenting the motivation for the work. Section 1.3 is quite useful so it should be better represented in terms of what each challenge means and why it is a challenge in the first place. This can also be represented as a table. Annotations in Figure 2 such as numbering of the workflow could be beneficial. Now it’s not clear where to start in the graph. Most of the contributions in section 1.4 are clear apart from contributions 1 and 2. These two should be better explained/written. Not clear what the “15 aspects” on line 49 are. If already online then provide a reference, if not they should be in the paper.

The related work section (and most of the other sections) are missing one or two sentences of introduction of what will be presented next. Section 2 needs restructuring as now it presents sections on quite different things (from a semantic publishing challenge to technology such as persistent identifiers). Missing references and explanation of Task 2 and 3 in section 2.1. Missing explanation of the first principle of Den Haag Manifesto. The paper should be accessible (in terms of understanding) to a wide audience so such things need to be explained in short. Each subsection in the related work presents information on a topic but what do the authors derive/learn from this? Papers are just referenced and a summary with the papers’ limitations etc. is missing. The overall summary of the related work section is missing as well.

Section 3 starts well and each subsection links to a specific phase in the semantification process. Here again, I am missing the numbering of the stages in Figure 4. Figure 5 presents an overview of the CEUR publishing process, however, it is first presented in section 4. The authors might want to present all relevant information earlier on to help the reader better comprehend their approach. Missing introductions to sections and subsections. Table 3 and its description in the text should be presented earlier on (maybe even in the introduction) as motivation for the work. For some of the formats in the table, the challenges are missing. Not clear where in the paper the description of the authors’ approach begins. This calls for restructuring and better naming of sections and subsections.

Figure 6 could be improved. What does the “agreement” refer to? Agreement to publish, the decision to accept or not, publisher negotiations and agreements etc. Is it the same as “decision”? The publication stages (e.g. camera ready etc.) can be better visualised and annotated as well.
In section 4.4, the notion of “exotic details” reads strange for a scientific publication and is generally not clear. The same is relevant for “big bang”. I encourage the authors to use more formal scientific language in their paper. There is also a lot of mixing of future and present tense and a lot of “shall be, might be, will be” which better be avoided as it makes the reader think the authors are unsure of their work. In section 4.5 when mentioning the legal basis for data processing, the GDPR should be mentioned. After all, these are GDPR bases for lawful data processing and not from the CPPA or any other law.

Section 5 is informative and it is good that the authors provide open access to their implementation. In section 5.2, the figure should be below the text introducing it. The content of section 5.5 should be presented much earlier on in the section as it is after all one of the main contributions of the authors. Unclear what the quality metrics mentioned in the title of section 6 are.

The conclusions are clear, however, the limitations of the work have not been discussed. The authors have provided a detailed plan for future work which showcases their motivation and knowledge of the current state of the developed approach. This should be used to improve the start of section 7. Section 7.2 is unnecessary as the content seems to be more suitable for an appendix in the paper.
More specific comments:
• The paper title is a bit confusing
• Missing references to SSoT, SPoT when first mentioned
• Inconsistent use of abbreviations throughout the paper
• Multiple paragraphs of just 1 sentence – this is not a common good practice
• Inconsistent formatting of the names of technologies and websites (sometimes capitalises, other times not)
• Missing reference to tasks 1 and 2 of the semantic web challenge in section 2.1
• Incorrect referencing format on page 6, line 34
• “ask queries” should be “make queries” or just “query”
• Page 18, line 41 – “is implemented here”- where?
• Missing references to endpoints mentioned on page 19
• Missing reference to Open AI and ChatGPT
• Page 21, line 47 can be on a new page
• Page 22, line 26 - what does “temperature of 0.0” mean?
• Missing reference/footnote to pyLodStorage in section 6.3
• Page 23, line 26 – “where” should be “were”?
• Language Model System -> LLMs? Names of technology etc. should be used consistently in the paper
• Inconsistent formatting of references (see [19][44])

Review #2
By Víctor Rodríguez-Doncel submitted on 08/May/2024
Suggestion:
Major Revision
Review Comment:

This paper presents SemPubFlow, an academic publishing workflow that replaces the manual curation in the publication of the current "CEUR proceedings" with an automated tool-based approach. The workflow leverages Large Language Model Systems and Wikidata to enhance the FAIRness of CEUR by focusing on all relevant core entities early in the process. The paper emphasizes the benefits of having high-quality metadata.

There is not much originality in this work, for any person skilled in the subject matter would have come across with similar solutions. However, the paper is very well explained, and the case has some beauty for its simplicity. Provided that the demo works, the software is better documented, and the insufficiencies of Section 4.5 are overcome, I would suggest the paper to be accepted. Another factor is that the paper is presented by researchers from RWTH Aachen University researchers --same institution hosting the CEUR proceedings-- so we can presume the model is correct.

* The major problem I have found, is that as of May 2024, the demo at https://cvb.wikidata.dbis.rwth-aachen.de/ was not running for me ("no such table: Proceedings", database error?). Also, this hampers the evaluation of the work. The paper cannot be accepted until this is fixed!

* The source code was open for both the CEUR semantification and the single-point-of-truth metadata handling (Apache license), I could easily build the projects, although there were no clear instructions. The README points to an external Wiki, with insufficient information. The README.md should describe what the program does, the requisites etc.

* The demo is processing personal data, please add a privacy policy similar to https://ceur-ws.org/dsvgo.txt.

* The choice of Wikidata to store information is a bold one, although I see some risks: what if vandalism happens? (e.g. others edit the Wikidata information). The design choice is perhaps acceptable, but contingency measures should be described.

* The effort made now misses some other improvements made in the scientific publishing world. For example, there seems to be no means to publish "executable papers" (scientific publications that combine text, raw data, and code used for analysis in a dynamic and interactive way). Research objects (associated demos and datasets) might be better connected to the research papers.

* The section "4.5. Legal aspects of Publishing Personal Data of Scholars" is bloated and maybe incorrect. GDPR (only mentioned once) describes only six legal basis for processing personal data, not eight as in the text. If German legislation is relevant, it should be mentioned, too. And explaining legal grounds such as "vital interest" is unnecesary in this case. The reference to Art. 89 is good, if better explained; a reference to Art. 14 would also help the reader to better understand other legal angles. Even if data has not been obtained from the data subject because it was public (typically a co-author's), the data subjects have rights. This must be emphasized!

Editorial improvements
---
The paper is well written, discussing the SemPubFlow project in a clear and structured manner. However, I have spotted a couple of errors:

* Repeated "the". Page 3: "The the entities at the core of this work"
* The title in the headers of the pages is far too large. Use \shorttitle.
* Strange puctuation. Page 16 "authority. E.g., when "

Review #3
By Jose Emilio Labra Gayo submitted on 17/Nov/2024
Suggestion:
Reject
Review Comment:

The paper presents SepPubFlow, a proposal for a new workflow to publish CEUR-WS workshop proceedings leveraging on Wikidata as a Knowledge Graph infrastructure. Although the paper has several positive points, I think it is not ready for publication at this stage. One of the problems is that the paper has been submitted as a Full paper and I think it doesn’t fulfill the criteria required for this kind of paper.
(1) originality. In my opinion the workflow presented is not quite original and is based on several aspects that have been applied in other projects that try to convert data from semistructured data or PDFs and generate semantic content. I think the section about related work in the paper lacks a proper organization of the related work which should start reviewing general projects that have attempted this conversion from HTML/PDF to a semantic data portal and later review projects that have attempted to do this in the scholar/proceedings publishing context. There have also been other projects that use Wikidata as the target for their data like GenWiki project. Although I think that the project as a whole is original in the sense that the combination of Wikidata + Workshop proceedings domain + semantification + recovering previous data + LLMs + other aspects, those aspects individually are not original, and at least the authors should review and comment on them in the related work section
(2) significance of the results, this is the aspect that is less clear for me. The results seem quite preliminary and there it is not clear how those results could be reproduced or generalized to other contexts. In my opinion, the authors present a workflow, which can have some potential, but they don’t indicate also the drawbacks or the pros and cons of some of the decisions taken, as well as the alternatives. One aspect that is partially discussed is the potential problem of vandalism if the data is mainly exported to Wikidata. Even if there is no vandalism, the authors don’t seem to have a contingency plan for cases where some external person or bot edits the Wikidata contents. Although I like the idea of exporting the data to Wikidata and using it as a target, I think the paper doesn’t discuss the possible problems that can arise, either from vandalism or from inconsistencies in the data. Another question that I had when reading the paper is that given that the authors create the semantification of the content, why don’t they store a local copy of that single source of truth in a semantic format like RDF? The authors suggest the use of JSON for it, but they don’t give more justification about why not using a proper semantic based format, which could have a proper ontology and be based on semantic web standards. I think this point can be important in a semantic web journal. Another aspect that I think is not clear in the paper is I think there are two different processes involved which are different and could have different treatment. One process is the semantification of existing or past proceedings, which would require some of those techniques for NER and NEL, but a different process could be setup for the next proceedings, where the system could ask the authors and editors, to add their metadata in a proper way, even linking directly to their wikidata id, or generating one if they don’t have it. When I read the paper, I was thinking why don’t the authors do that for the future proceedings instead of relying on a semantification process that can create erroneous data.
(3) quality of writing. The paper contains several typos and the structure of the paper is not very clear. As an example, I found the conclusions section quite large and including a subsection for future work which is also presenting a future framework. Sometimes the paper includes some statements that in my opinion are not scientific (I will indicate later some of those), the related work section lacks a line or structure, and some of the decisions are taken from granted without justifying them.
In my opinion the paper seems more a resource than a fully research paper so I would suggest that maybe the authors should consider re-submit it in that category.
The authors provide links to source code in github. One minor problem is that they indicate that if someone wants to try the system, they have to ask the authors…I wonder if they could setup a test server so the reviewers could see the system running without having to ask for it.
The experiments provided using LLMs seem quite preliminary for me and if the paper’s focus is on combining LLMs to extract the metadata, maybe the paper should be structure in a different way, giving more emphasis to that aspect. In my opinion, adding LLMs to the process can also raise some problems with hallucintations that the authors don’t mention and if such a system goes to production, I think it could be problematic as those hallucinations could be difficult to spot.
More detailed comments:
- In the abstract and in the introduction, the authors include the accronyms SSoT and SPoT, which are later re-defined…although I agree with the authors on the importance of these two concepts…it was not clear for me where is that single point of truth…specially because in section 7.1 the authors say that the SPoT will be stored in JSON….does that mean that they are using a single JSON file? Or is it a Json-based database like MongoDB? Why not using in that case a proper RDF triplestore defining or reusing a proper ontology which could later be translated to Wikidata?
- Figure 1 seems like the entities of an ontology…however, it seems the authors are not using ontologies at all? Why not? Why not following some of the practices that have been promoted by the semantic web community in the past years? In my opinion, although I understand that the authors could argue that they export the data to Wikidata, which offers a SPARQL endpoint and RDF serialization…that RDF is different from a really semantic data model which could contain concepts from one or more OWL ontologies.
- Another aspect that the authors don’t mention is how they represent the provenance of the statements that they publish in Wikidata…I would suggest the use of qualifiers and references which could make the data less exposed to vandalism as those references could be used to verify the provenance.
- I think the authors could define a proper data model of the entities depicted in figure 1 using entity schemas which could help validate the data and could also be used to document the data model presented in that figure.
- Sometimes the authors use dblp in lowercase and other DBLP
- Page 3, line 43 “The the”
- Page 4, is it necessary to include the FAIR principles in such a long way? Maybe simplify them as they have already presented in a lot of other papers?´
- Notice that line 38 says that the metadata should use an established ontology…do the authors follow that principle?
- Page 5, “all 15 aspects”, I think some of the aspects shouldn’t be counted, for example, aspect A1 is subdivided in A1.1, A1.2 so the count shouldn’t include it.
- In my opinion the related work should be rewritten and should include more references from similar publishing institutions as well as references from similar semantification efforts. Some of the entries in the related work are just a list of papers without indicating the relationship with the current paper.
- Page 7. “Part of this work has been reused and extended to a fully fledged parser in the work we are reporting here.”. I would be careful to say that something is a “fully fredged parser”, maybe a “more complete parser”.
- Section 3.1 starts directly after section 3 which looks a bit strange, I would probably suggest to add a small introductory paragraph.
- Page 8, “The semantified Metadata Records are now available and may be stored in the format we see fit. JSON and YAML are candidate formats see table 3.”, why JSON or YAML? Why not RDF when looking to that table, RDF seems to me the best candidate?
- Page 9, “The BeautifulSoup4 Python library is used for lenient HTML parsing as a basis.” This affirmation and other like it look like a decision that has been taken without any consideration about alternatives…why that library and not others? Why a python library?
- Page 10, line 41 contains a closed parenthesis which hasn’t been opened
- Page 12, “Wikidata [41] is a knowledge graph based on an RDF triple store that…” is wrong, Wikidata is not based on RDF. It exports RDF and provides a SPARQL endpoint, but internally it is not based on RDF.
- Page 12, line 22, “to to handle”
- Table 3, “such as OCR for text extraction for content extraction.
- Page 13, line 21 contains a strange newline for footnote 37.
- The size of figure 5 is very large compared with other figures
- Looking at Figure 6, who uploads the metadata or how is it uploaded to wikidata?
- Page 16 “any more to further effort
- Page 16, why do you include the “vital aspects” if they are not necessary for this use case?
- Page 17, line 34, “The example shown is for CEUR-WS Volume 3262 Wikidata Workshop 2022” I think it is a different workshop
- Page 18, “and allows to both the …”
- Figure 11, the size is too small compared with other figures.
- Page 20, first paragraph of section 6.1, the numbers are not very well explained. I think the authors could probably present those numbers in a table or something that could help the reader understand them.
- In my opinion the whole LLM experiment seems like a simple experiment that works, but the results seem a bit preliminary to be included in a production setting. Some of the statements in that section look like they would be quite different in a few years…for example the sentence about the cost: “The total cost of the API usage was US$1.07 or 0.4 cents per homepage” looks quite fragile.
- Page 23 “since the documentation of DNB URN-NBN check digit generation is obscure…” I think that statement is not proper for an academic paper…how can the authors quality that the documentation is obscure?
- Page 23, first paragraph of section 6.1, how are those numbers obtained?
- Page 25 “project hat”
- Page 25 “The LLM approach already proven…”