Review Comment:
This paper presents a pipeline for generating Schema.org markup based on LLMs and a pipeline to compare such markups with other markups generated by humans.
The described processes and pipeline are in general reasonable and some evaluations are provided. The paper is well organized, nonetheless some part should be better explained.
It is claimed that GPT4 outperforms human annotations, something that can be measure with the proposed metric. My main concern here is about the reasons for that. Could you provide any explanations for this? Is there any possibility of GPT4 being trained with the same data that is being used for the experiments considering that the WEbDataCommons was released in 2022 and the training dataset for GPT is until 2023 while GPT3.5 was trained until 21?
Abstract
It is not clear which outputs from LLMs and from humans are being compared after filtering out errors. Do you filter errors from both humans and LLMs?
Regarding "Our study identifies that 40-50% of the markup produced by GPT-3.5 and GPT-
4 are either invalid, non-factual, or non-compliant with the Schema.org ontology": is there any suggestion about why this happens?
1. Introduction
line 13: "a real ontology..." what do you mean by real ontology? how is it defined? Does it refer to being broadly adopted?
line 22: Is the subset of the WebDataCommons data a scientific contribution of the paper or a side resource generated as part of the process? I agree that it might be useful for other experiments as has a value to be recognized, but it seems to be an output from an engineering process more than a scientific contribution.
lines 31 "an ideal merged markup": It is not completely clear to me how this ideal merged markup is defined, is it the union of the human and LLMs annotations? It that case it might not be complete, considering that you have the SHACL shapes for each entity, would it be possible to have an ideal markup not only considering the existing annotations but also the possible properties from the shapes? Please explain, whether this is possible or has also drawbacks or would not be feasible or sound. This comment applies also to section 3.5.
2. Background and motivations
"JSON-LD is a compact representation of an RDF graph using the types and properties defined in the Schema.org ontology": review this definition, JSON-LD is an RDF serialization independent of schema.org.
Definition 2: "Given an RDF graph G, a markup entity, denoted by its subject s,": this seems to reduce the definition of a graph to the set of triples that share the same subject. Please refine or review this definition.
In general, there is a mix in the use of "correctness", "completeness", "accuracy" of the annotations and the goal of the pipelines. Review and be consistent with the terminology.
3. LLM4Schema.org Overview
Is the markup from humans (Mh) selected from webDataCommons? If so, there is a chance that GPT4 has read it and GPT3.5 maybe not for a subset of it?
Page 7 "the value must approximately match..." how is this approximation define and measure?
Regarding the SHACL shapes, are they validated somehow to make sure they are correct and complete?
Why the prompt presented in page 8 is selected? Were other options considered? Would the output change in the definition of the property were provided in the prompt?
The same comment applies for the rest of prompt defined.
"In our experiments, we assess the qualitative perspective of the MIMR metric using human evaluations.": it would be nice having a pointer to the section.
Others:
Figure 4 strikeout colour is not distinguishable when printed in b&w.
4. Experiments
In page 11 when describing the negative tests for intrinsic principles the values for some pairs are exchanged. Is there any additional verification done to avoid still valid pairs after swapping values? Is it guaranteed that no valid values could be generated after swapping? Are only the indexed compared or also the actual values?
For the extrinsic principle it is checked that M1 and M2 are disjoint, but they could share common properties if they belong to the same hierarchy and inherit properties from upper levels.
Information in table 1 and 3 do not contribute more than the numbers already given in the text.
Page 14: "Equation 4.3" is not defined as such in the paper or I missed it.
"Table 5 shows some examples of C-sets from WebDataCommons 2023." but table 5 says 2022.
In page 15, please detail a bit more how C-set are defined. The one for Figure 5b shouldn't include recipeIngredients...?
Section 4.4: again, here please explain and consider whether GPT4 could have seen the human markup and how that could be interpreted in the results.
"We observe the same pattern in both features: the MIMR metric is higher for Humans in the High quantiles, while it is higher for GPT-4 in the Low quantiles": is there any conclusion about this observation or explanation about why?
Explain better the comparison in table 9, it could seem that it is being compared whether the LLM generated seems close to human markup or whether it is better human or LLM markup. What do those numbers for human and MIMR mean? Why for each LLM version is it compared
Section 4.5. Are the cases validated by humans the 18 in the annex? That is the 10% of 180? This validation might not be too representative and considering only 180 pages, the percentage of validated markup could be easily increased by the experiment. Also in this experiment it is mentioned that there were 7 participants but there is no information about their profile, previous knowledge, relation with the project, how many pages validated each one, how many participants were assigned to each page, etc. This experiment could be significantly improved.
5. Related Work
The first sentence summarizes again the work described, as in introduction and conclusions. I would suggest keeping this section to compared with existing works.
In the last paragraph it is claimed "In this light, LLM4Schema.org enhances scalability and practicality for analyzing thousands of real-world web pages": are there scalability tests or experiments done?
Consider comparing to: Bengtson, Jason. "Testing the Feasibility of Schema. org Metadata Refinement Through the Use of a Large Language Model." Journal of Library Metadata 24.4 (2024): 275-290.
Minor
line 13 "... markup wins the match" sounds informal
footnotes 2 and 3 are the same one.
|