Review Comment:
The paper presents SepPubFlow, a proposal for a new workflow to publish CEUR-WS workshop proceedings leveraging on Wikidata as a Knowledge Graph infrastructure. Although the paper has several positive points, I think it is not ready for publication at this stage. One of the problems is that the paper has been submitted as a Full paper and I think it doesn’t fulfill the criteria required for this kind of paper.
(1) originality. In my opinion the workflow presented is not quite original and is based on several aspects that have been applied in other projects that try to convert data from semistructured data or PDFs and generate semantic content. I think the section about related work in the paper lacks a proper organization of the related work which should start reviewing general projects that have attempted this conversion from HTML/PDF to a semantic data portal and later review projects that have attempted to do this in the scholar/proceedings publishing context. There have also been other projects that use Wikidata as the target for their data like GenWiki project. Although I think that the project as a whole is original in the sense that the combination of Wikidata + Workshop proceedings domain + semantification + recovering previous data + LLMs + other aspects, those aspects individually are not original, and at least the authors should review and comment on them in the related work section
(2) significance of the results, this is the aspect that is less clear for me. The results seem quite preliminary and there it is not clear how those results could be reproduced or generalized to other contexts. In my opinion, the authors present a workflow, which can have some potential, but they don’t indicate also the drawbacks or the pros and cons of some of the decisions taken, as well as the alternatives. One aspect that is partially discussed is the potential problem of vandalism if the data is mainly exported to Wikidata. Even if there is no vandalism, the authors don’t seem to have a contingency plan for cases where some external person or bot edits the Wikidata contents. Although I like the idea of exporting the data to Wikidata and using it as a target, I think the paper doesn’t discuss the possible problems that can arise, either from vandalism or from inconsistencies in the data. Another question that I had when reading the paper is that given that the authors create the semantification of the content, why don’t they store a local copy of that single source of truth in a semantic format like RDF? The authors suggest the use of JSON for it, but they don’t give more justification about why not using a proper semantic based format, which could have a proper ontology and be based on semantic web standards. I think this point can be important in a semantic web journal. Another aspect that I think is not clear in the paper is I think there are two different processes involved which are different and could have different treatment. One process is the semantification of existing or past proceedings, which would require some of those techniques for NER and NEL, but a different process could be setup for the next proceedings, where the system could ask the authors and editors, to add their metadata in a proper way, even linking directly to their wikidata id, or generating one if they don’t have it. When I read the paper, I was thinking why don’t the authors do that for the future proceedings instead of relying on a semantification process that can create erroneous data.
(3) quality of writing. The paper contains several typos and the structure of the paper is not very clear. As an example, I found the conclusions section quite large and including a subsection for future work which is also presenting a future framework. Sometimes the paper includes some statements that in my opinion are not scientific (I will indicate later some of those), the related work section lacks a line or structure, and some of the decisions are taken from granted without justifying them.
In my opinion the paper seems more a resource than a fully research paper so I would suggest that maybe the authors should consider re-submit it in that category.
The authors provide links to source code in github. One minor problem is that they indicate that if someone wants to try the system, they have to ask the authors…I wonder if they could setup a test server so the reviewers could see the system running without having to ask for it.
The experiments provided using LLMs seem quite preliminary for me and if the paper’s focus is on combining LLMs to extract the metadata, maybe the paper should be structure in a different way, giving more emphasis to that aspect. In my opinion, adding LLMs to the process can also raise some problems with hallucintations that the authors don’t mention and if such a system goes to production, I think it could be problematic as those hallucinations could be difficult to spot.
More detailed comments:
- In the abstract and in the introduction, the authors include the accronyms SSoT and SPoT, which are later re-defined…although I agree with the authors on the importance of these two concepts…it was not clear for me where is that single point of truth…specially because in section 7.1 the authors say that the SPoT will be stored in JSON….does that mean that they are using a single JSON file? Or is it a Json-based database like MongoDB? Why not using in that case a proper RDF triplestore defining or reusing a proper ontology which could later be translated to Wikidata?
- Figure 1 seems like the entities of an ontology…however, it seems the authors are not using ontologies at all? Why not? Why not following some of the practices that have been promoted by the semantic web community in the past years? In my opinion, although I understand that the authors could argue that they export the data to Wikidata, which offers a SPARQL endpoint and RDF serialization…that RDF is different from a really semantic data model which could contain concepts from one or more OWL ontologies.
- Another aspect that the authors don’t mention is how they represent the provenance of the statements that they publish in Wikidata…I would suggest the use of qualifiers and references which could make the data less exposed to vandalism as those references could be used to verify the provenance.
- I think the authors could define a proper data model of the entities depicted in figure 1 using entity schemas which could help validate the data and could also be used to document the data model presented in that figure.
- Sometimes the authors use dblp in lowercase and other DBLP
- Page 3, line 43 “The the”
- Page 4, is it necessary to include the FAIR principles in such a long way? Maybe simplify them as they have already presented in a lot of other papers?´
- Notice that line 38 says that the metadata should use an established ontology…do the authors follow that principle?
- Page 5, “all 15 aspects”, I think some of the aspects shouldn’t be counted, for example, aspect A1 is subdivided in A1.1, A1.2 so the count shouldn’t include it.
- In my opinion the related work should be rewritten and should include more references from similar publishing institutions as well as references from similar semantification efforts. Some of the entries in the related work are just a list of papers without indicating the relationship with the current paper.
- Page 7. “Part of this work has been reused and extended to a fully fledged parser in the work we are reporting here.”. I would be careful to say that something is a “fully fredged parser”, maybe a “more complete parser”.
- Section 3.1 starts directly after section 3 which looks a bit strange, I would probably suggest to add a small introductory paragraph.
- Page 8, “The semantified Metadata Records are now available and may be stored in the format we see fit. JSON and YAML are candidate formats see table 3.”, why JSON or YAML? Why not RDF when looking to that table, RDF seems to me the best candidate?
- Page 9, “The BeautifulSoup4 Python library is used for lenient HTML parsing as a basis.” This affirmation and other like it look like a decision that has been taken without any consideration about alternatives…why that library and not others? Why a python library?
- Page 10, line 41 contains a closed parenthesis which hasn’t been opened
- Page 12, “Wikidata [41] is a knowledge graph based on an RDF triple store that…” is wrong, Wikidata is not based on RDF. It exports RDF and provides a SPARQL endpoint, but internally it is not based on RDF.
- Page 12, line 22, “to to handle”
- Table 3, “such as OCR for text extraction for content extraction.
- Page 13, line 21 contains a strange newline for footnote 37.
- The size of figure 5 is very large compared with other figures
- Looking at Figure 6, who uploads the metadata or how is it uploaded to wikidata?
- Page 16 “any more to further effort
- Page 16, why do you include the “vital aspects” if they are not necessary for this use case?
- Page 17, line 34, “The example shown is for CEUR-WS Volume 3262 Wikidata Workshop 2022” I think it is a different workshop
- Page 18, “and allows to both the …”
- Figure 11, the size is too small compared with other figures.
- Page 20, first paragraph of section 6.1, the numbers are not very well explained. I think the authors could probably present those numbers in a table or something that could help the reader understand them.
- In my opinion the whole LLM experiment seems like a simple experiment that works, but the results seem a bit preliminary to be included in a production setting. Some of the statements in that section look like they would be quite different in a few years…for example the sentence about the cost: “The total cost of the API usage was US$1.07 or 0.4 cents per homepage” looks quite fragile.
- Page 23 “since the documentation of DNB URN-NBN check digit generation is obscure…” I think that statement is not proper for an academic paper…how can the authors quality that the documentation is obscure?
- Page 23, first paragraph of section 6.1, how are those numbers obtained?
- Page 25 “project hat”
- Page 25 “The LLM approach already proven…”
|