LLM4Schema.org: Generating Schema.org Markups with Large Language Models

Tracking #: 3716-4930

Authors: 
Minh-Hoang Dang
Thi Hoang Thi Pham
Pascal Molli
Hala Skaf-Molli
Alban Gaignard

Responsible editor: 
Guest Editors KG Construction 2024

Submission type: 
Full Paper
Abstract: 
The integration of Schema.org markup in web pages has resulted in the creation of billions of RDF triples, yet approximately 75% of all web pages still lack these essential markups. Large Language Models (LLMs) offer a potential solution by automatically generating the missing Schema.org markups. However, the accuracy and reliability of LLM-generated markups compared to human annotators remain uncertain. This paper introduces LLM4Schema.org, a novel approach to evaluate the performance of LLMs in generating Schema.org markups. Our study identifies that 40-50% of the markup produced by GPT-3.5 and GPT-4 are either invalid, non-factual, or non-compliant with the Schema.org ontology. We show that these errors can be identified and removed using specialized agents powered by LLMs. Once errors are filtered out, GPT-4 outperforms human annotators in generating accurate and comprehensive Schema.org markups. Both GPT-3.5 and GPT-4 are capable of making improvements in areas where human annotators fall short. LLM4Schema.org highlights the potential and challenges of using LLMs for semantic annotation, emphasizing the importance of curation to achieve reliable results.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 28/Aug/2024
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

1. Originality
The manuscript presents a novel approach to generating Schema.org markups using large language models (LLMs), particularly focusing on GPT-3.5 and GPT-4. The originality lies in the combination of LLMs with a structured pipeline for validating and comparing machine-generated markup with human-generated markup. The study advances the field by proposing a practical and innovative solution to improve the accuracy and completeness of web page annotations, which is essential for enhancing web content discoverability.

2. Significance of the Results
The results presented in the manuscript are significant, especially in demonstrating that, with proper curation, LLMs can generate Schema.org markups that are more accurate and comprehensive than those produced by human annotators. The study provides valuable insights into the current capabilities and limitations of LLMs in semantic web tasks, highlighting the potential for these models to be integrated into web content management systems for automated markup generation. The findings could have substantial implications for improving search engine optimization (SEO) and the overall quality of web data. However, there are several points that could be improved:

Complexity of the evaluation pipeline: The evaluation pipeline is thorough but complex, which may make it difficult for others to replicate or extend the study. The reliance on multiple agents and a complex scoring system could benefit from clearer explanations and possibly simplifications to enhance accessibility.

Limited discussion on limitations: The paper could benefit from a more detailed discussion of the limitations of the proposed approach. For example, the potential biases introduced by the training data of the LLMs or the limitations of the chosen validation tools (like SHACL shapes) are not explored.

Generalization to other ontologies: While the focus on Schema.org is appropriate, the paper does not discuss the applicability of the proposed methods to other, potentially more complex or less structured ontologies.

Scalability: The process of chunking text to fit within the LLM's context window and subsequently merging the results raises concerns about scalability and the consistency of the final markup. The paper could explore additional strategies to handle large texts more efficiently, such as summarization techniques, hierarchical processing, sliding windows, or context-aware chunking, to improve scalability.

3. Quality of Writing
The manuscript is generally well-written. However, the complexity of the evaluation pipeline might be challenging for readers who are not deeply familiar with the technical details of LLMs and semantic web technologies. Some sections could benefit from additional explanations or simplifications to make the content more accessible. Furthermore, the discussion of limitations could be expanded to provide a more balanced view of the work.

Assessment of the Data File
(A) Organization and README File
The data file provided at the specified GitHub repository is well-organized. It includes a README file that provides an overview of the repository's contents, instructions for setting up the environment, and a guide to replicating the experiments. The README is comprehensive and serves as a helpful entry point for users attempting to use the data and code. It includes all the necessary commands and dependencies required to run the experiments.

(B) Completeness of Resources for Replication
The resources provided in the repository appear to be complete for the replication of experiments. The repository includes the source code, data files, and detailed instructions needed to reproduce the results presented in the paper. The presence of example scripts and detailed explanations in the README file further supports the ease of replication.

(C) Appropriateness of Repository for Long-term Discoverability
GitHub is an appropriate choice. The authors might consider integrating the GitHub repository with Zenodo for DOI generation to ensure enhanced archival quality and long-term accessibility of the data.

(D) Completeness of Data Artifacts
The data artifacts provided are complete in terms of what is required to replicate the experiments. The repository includes the datasets, scripts, and detailed instructions. The organization of the repository is logical, with clear distinctions between code, data, and documentation.

Review #2
Anonymous submitted on 02/Sep/2024
Suggestion:
Minor Revision
Review Comment:

## Summary
The paper "Generating Schema.org Markups with Large Language Models" effectively explores using LLMs to generate Schema.org markups from web page text.
It is well-aligned with the particular journal issue on automatic KG construction.
The associated code is publicly available, well-structured, and thoroughly documented for reproducibility.

### Introduction
The introduction is good and establishes the study's importance, identifying key gaps and contributions.

### Background
The background section explains using the JSON-LD format for embedding Schema.org markups. However, the paper would benefit from discussing other formats like microdata, which offer a higher semantic level (annotating hash URIs/DOM objects) and play a significant role in Linked Data and the Semantic Web. A brief discussion would clarify the difference in complexity when targeting JSON-LD (and its limitations) instead of embedded Schema.org markup.

### Methodology
The methodology is well-detailed, with Figure 4 effectively illustrating the pipeline. The paper describes three validation phases for the generated markups:

- Validity Agent: Ensures syntactical correctness using SHACL.
- Factual Agent: Verifies the accuracy and relevance of the markup (succinctness)
- Compliance Agent: Checks the validity of values according to Schema.org properties.

The process of prompting Schema.org types and properties remains unclear and needs clarification, especially regarding the decision to limit the vocabulary. Are only the C-set types used in the final evaluation? This would significantly impact the evaluation, as it simplifies the “matching” process for the LLM compared to offering types that are not expected in the final markups.

Knowing the final number of chunks would also be useful for future research and estimating this method's cost-effectiveness.

A novel metric, Markup Ideal Match Ratio (MIMR), estimates the contribution of human and LLM-generated markups to an ideal merged markup.

### Evaluation
The evaluation assesses factuality, compliance, and the MIMR score, concluding with a robust analysis of the LLM4Schema.org pipeline results. Testing the two agents against a dataset from Schema.org documentation shows high accuracy, but a comparison with OpenAI models would strengthen the analysis. For instance, only the errors described in Appendix A and B could be tested.

Clarification is needed on whether benchmark samples from the WDC were pre-validated using the agents (maybe it was not clear to me).

### Related Work
The related work appears appropriate given the massive current development of the LLM+KG research area.

###Conclusion
Additional supporting numbers, such as the RR and MIMIR/human kappa score, would strengthen the conclusion.

### Future Work
Future work could explore testing other Schema.org markup annotations, like microdata.

## Strengths
- Effective running examples and visualization (prompts/figures/diagrams)
- Thorough explanation of the methodology
- In-depth evaluation of the pipeline and methods used

## Weaknesses
- Brief background on different Schema.org methods, particularly the limitations of the JSON-LD markup compared to embedded annotations
- Some unclear sections, particularly regarding prompt chunking with the Schema.org vocabulary of over 800 types
- Overall, captions could be clearer or more descriptive (e.g., Fig 1: Apple HTML -> HTML and JSON-LD about an apple pie recipe with its ingredients)
- Occasional delays in explanations (e.g., the caption and text explanation of Fig. 1 regarding the “Eiffel_Tower” value; the 0 to 1 output of both agents in Section 4.1/4.2)

## Overall
Despite minor issues, the paper makes a valuable contribution and provides a solid scientific evaluation of the problem. After addressing these minor points, I recommend accepting the paper.

Review #3
Anonymous submitted on 30/Sep/2024
Suggestion:
Minor Revision
Review Comment:

The paper presents LLM4Schema.org, a method for evaluating large language models in generating Schema.org markups for webpages. LLMs were tested to generate markup however 40-50% of their markup is invalid, non-factual, or non-compliant with standards. To resolve these issues, the authors developed specialized agents for syntax, factuality, and compliance validation. After error filtering, GPT-4 outperforms human annotators, demonstrating the need for curated LLM-generated markup.
First, the authors created a corpus of webpages with Schema.org markup from WebDataCommons, selecting a subset from an initial 877 million pages. The selection was based on groups of properties shared by entities across webpages, revealing common patterns in how entities are described on the web.
Additionally, they developed three agents for validating and curating the markup: one for syntax validation using shape constraints, one to verify the factual accuracy of extracted tags with an LLM, and another to ensure compliance with Schema.org standards, also using an LLM. Both the factuality and compliance agents were validated against ground truth data, achieving strong performance with F1-scores above 90%.
They also introduced the MIMR metric to quantitatively compare human and LLM performance in annotating web pages, measuring markup completeness. This metric identifies the winning markup as the one with more properties deemed correct by the three agents. Since MIMR is purely quantitative, human evaluations were used to complement the analysis. These evaluations showed that MMIR results closely aligned with human assessments, validating the metric.
The work effectively highlights the gap in previous research and clearly demonstrates its uniqueness and originality. The results showcase the potential of using LLMs to generate Schema.org markups. By applying filters to reduce hallucinations, the system is able to produce high-quality, accurate markups.
The paper is well-structured and easy to follow. The simplicity of the language enhances clarity and ensures that the content is accessible to a wide audience.
The data file is well-organized, particularly with the inclusion of a README file, making it straightforward to assess and understand the dataset.
The provided resources appear to be sufficient for replicating the experiments. Additionally, the GitHub repository is suitable for long-term discoverability.

I have a few remarks that I would like to see in the revised version of the paper: the background of the annotators is not specified, which raises questions about the reliability of their evaluations. Additionally, the paper does not thoroughly address the 10% overlap between chunks, leaving this aspect insufficiently discussed.

Review #4
Anonymous submitted on 02/Nov/2024
Suggestion:
Major Revision
Review Comment:

This paper presents a pipeline for generating Schema.org markup based on LLMs and a pipeline to compare such markups with other markups generated by humans.

The described processes and pipeline are in general reasonable and some evaluations are provided. The paper is well organized, nonetheless some part should be better explained.

It is claimed that GPT4 outperforms human annotations, something that can be measure with the proposed metric. My main concern here is about the reasons for that. Could you provide any explanations for this? Is there any possibility of GPT4 being trained with the same data that is being used for the experiments considering that the WEbDataCommons was released in 2022 and the training dataset for GPT is until 2023 while GPT3.5 was trained until 21?

Abstract

It is not clear which outputs from LLMs and from humans are being compared after filtering out errors. Do you filter errors from both humans and LLMs?

Regarding "Our study identifies that 40-50% of the markup produced by GPT-3.5 and GPT-
4 are either invalid, non-factual, or non-compliant with the Schema.org ontology": is there any suggestion about why this happens?

1. Introduction

line 13: "a real ontology..." what do you mean by real ontology? how is it defined? Does it refer to being broadly adopted?

line 22: Is the subset of the WebDataCommons data a scientific contribution of the paper or a side resource generated as part of the process? I agree that it might be useful for other experiments as has a value to be recognized, but it seems to be an output from an engineering process more than a scientific contribution.

lines 31 "an ideal merged markup": It is not completely clear to me how this ideal merged markup is defined, is it the union of the human and LLMs annotations? It that case it might not be complete, considering that you have the SHACL shapes for each entity, would it be possible to have an ideal markup not only considering the existing annotations but also the possible properties from the shapes? Please explain, whether this is possible or has also drawbacks or would not be feasible or sound. This comment applies also to section 3.5.

2. Background and motivations

"JSON-LD is a compact representation of an RDF graph using the types and properties defined in the Schema.org ontology": review this definition, JSON-LD is an RDF serialization independent of schema.org.

Definition 2: "Given an RDF graph G, a markup entity, denoted by its subject s,": this seems to reduce the definition of a graph to the set of triples that share the same subject. Please refine or review this definition.

In general, there is a mix in the use of "correctness", "completeness", "accuracy" of the annotations and the goal of the pipelines. Review and be consistent with the terminology.

3. LLM4Schema.org Overview

Is the markup from humans (Mh) selected from webDataCommons? If so, there is a chance that GPT4 has read it and GPT3.5 maybe not for a subset of it?

Page 7 "the value must approximately match..." how is this approximation define and measure?

Regarding the SHACL shapes, are they validated somehow to make sure they are correct and complete?

Why the prompt presented in page 8 is selected? Were other options considered? Would the output change in the definition of the property were provided in the prompt?
The same comment applies for the rest of prompt defined.

"In our experiments, we assess the qualitative perspective of the MIMR metric using human evaluations.": it would be nice having a pointer to the section.

Others:
Figure 4 strikeout colour is not distinguishable when printed in b&w.

4. Experiments

In page 11 when describing the negative tests for intrinsic principles the values for some pairs are exchanged. Is there any additional verification done to avoid still valid pairs after swapping values? Is it guaranteed that no valid values could be generated after swapping? Are only the indexed compared or also the actual values?

For the extrinsic principle it is checked that M1 and M2 are disjoint, but they could share common properties if they belong to the same hierarchy and inherit properties from upper levels.

Information in table 1 and 3 do not contribute more than the numbers already given in the text.

Page 14: "Equation 4.3" is not defined as such in the paper or I missed it.

"Table 5 shows some examples of C-sets from WebDataCommons 2023." but table 5 says 2022.

In page 15, please detail a bit more how C-set are defined. The one for Figure 5b shouldn't include recipeIngredients...?

Section 4.4: again, here please explain and consider whether GPT4 could have seen the human markup and how that could be interpreted in the results.

"We observe the same pattern in both features: the MIMR metric is higher for Humans in the High quantiles, while it is higher for GPT-4 in the Low quantiles": is there any conclusion about this observation or explanation about why?

Explain better the comparison in table 9, it could seem that it is being compared whether the LLM generated seems close to human markup or whether it is better human or LLM markup. What do those numbers for human and MIMR mean? Why for each LLM version is it compared

Section 4.5. Are the cases validated by humans the 18 in the annex? That is the 10% of 180? This validation might not be too representative and considering only 180 pages, the percentage of validated markup could be easily increased by the experiment. Also in this experiment it is mentioned that there were 7 participants but there is no information about their profile, previous knowledge, relation with the project, how many pages validated each one, how many participants were assigned to each page, etc. This experiment could be significantly improved.

5. Related Work

The first sentence summarizes again the work described, as in introduction and conclusions. I would suggest keeping this section to compared with existing works.

In the last paragraph it is claimed "In this light, LLM4Schema.org enhances scalability and practicality for analyzing thousands of real-world web pages": are there scalability tests or experiments done?

Consider comparing to: Bengtson, Jason. "Testing the Feasibility of Schema. org Metadata Refinement Through the Use of a Large Language Model." Journal of Library Metadata 24.4 (2024): 275-290.

Minor

line 13 "... markup wins the match" sounds informal

footnotes 2 and 3 are the same one.