Transformer-Based Architectures versus Large Language Models in Semantic Event Extraction: Evaluating Strengths and Limitations

Tracking #: 3673-4887

Authors: 
Tin Kuculo
Sara Abdollahi
Simon Gottschalk

Responsible editor: 
Guest Editors KG Gen from Text 2023

Submission type: 
Full Paper
Abstract: 
Understanding complex societal events reported on the Web, such as military conflicts and political elections, is crucial in digital humanities, computational social science, and news analyses. While event extraction is a well-studied problem in Natural Language Processing, there remains a gap in semantic event extraction methods that leverage event ontologies for capturing multifaceted events in knowledge graphs since existing methods for event extraction often fall short in the semantic depth or lack the flexibility required for a comprehensive event extraction. In this article, we aim to compare two paradigms to address this task of semantic event extraction: The fine-tuning of traditional transformer-based models versus the use of Large Language Models (LLMs). We exemplify these paradigms with two newly developed approaches: T-SEE for transformer-based and L-SEE for LLM-based semantic event extraction. We present and evaluate these two approaches and discuss their complementary strengths and shortcomings to understand the needs and solutions required for semantic event extraction. For comparison, both approaches employ the same dual-stage architecture; the first stages focus on multilabel event classification, and the second on relation extraction. While our first approach utilises a span prediction transformer model, our second approach prompts an LLM for event classification and relation extraction, providing the potential event classes and properties. For evaluation, we first assess the performances of T-SEE and L-SEE on two novel datasets sourced from Wikipedia, Wikidata, and DBpedia, containing over 80,000 sentences and semantic event representations. Then, we perform an extensive analysis of the different types of errors made by these two approaches to discuss a set of phenomena relevant to semantic event extraction. Our work makes substantial contributions to (i) the integration of Semantic Web technologies and NLP, particularly in the underexplored domain of semantic event extraction, and (ii) the understanding of how LLMs can further enhance semantic event extraction and what challenges need to be considered in comparison to traditional approaches.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 24/Jul/2024
Suggestion:
Minor Revision
Review Comment:

Transformer-Based Architectures versus Large Language Models in Semantic Event Extraction is an interesting and timely topic. This is a very nice and interesting work with a well-founded methodology. The experiments are reproducible. In general the paper is very good, I just have a minor remark: LLMs may generate different answers for the same input or prompt. This is because the model does not always generate exactly the same output, even if presented with the same input. If an LLM is used in an experimental study, it is important to consider this variability. How multiple executions of the same prompt impacts the results?

Review #2
Anonymous submitted on 25/Oct/2024
Suggestion:
Minor Revision
Review Comment:

The paper focuses on applying transformer-based architectures and large language models (LLMs) to the task of event relation extraction. The authors make three significant contributions:
1. Dataset Construction: They build two datasets from Wikidata and Wikipedia, comprising event classes and their relations.
2. Methodology and Experiments: The authors evaluate transformer-based models (primarily BERT) for event extraction and relation extraction, and they also experiment with LLMs using prompting techniques.
3. Comparison of Transformer-based Models and LLMs: The authors compare their proposed Transformer-based approach with prior works, such as Text2Event and EventGraph, as well as with methods that leverage large language models (LLMs).

The proposed pipeline for event relation extraction consists of three key steps:
1. Ontology Creation: The authors extract an ontology from Wikipedia and Wikidata, identifying event classes and their corresponding properties.
2. Event Extraction: They extract event classes from the text, matching them to the predefined classes from their ontology.
3. Relation Classification: After identifying event classes, the next step involves matching these events to appropriate properties (properties that co-occur with the event class at least 50 times). They then train a classifier to predict the relations between events, the relation type, and its associated value based on the text.

Strengths:
- The paper is generally easy to follow.
- It provides an appropriate level of detail for a deep understanding of the work.
- The comparison between LLMs and transformer-based models is thorough, supported by statistical results and manual assessments. It addresses various aspects such as granularity of extraction, extraction inaccuracies, and annotation discrepancies. These comparisons were conducted using a subset of the constructed DBpedia dataset (100 examples).
- Novelty: The paper’s novelty lies in both the creation of new datasets, which the authors highlight its superiority on terms of number of event classes and properties by comparisons with existing event relation extraction datasets like ACE05, and by their suggested event relation extraction method. this method showcases innovation by bridging the gap between NLP and the Semantic Web, allowing for event extraction adaptable to different ontologies, combining the strengths of both fields.
- Reproducibility: The datasets and code are made available online, ensuring the work is reproducible.
- Soundness: The paper presents a new state-of-the-art method for event relation extraction by integrating ontology information with NLP in transformer-based architectures, demonstrating a solid, well-grounded approach.

Limitations:
- The authors provide a custom definition of an event for this work rather than referring to established ontologies or definition, such as the one of Simple Event Ontology
- The choice of using a threshold of 50 occurrences for properties is not well explained.
- References are missing for key elements like the attention mechanism, distance label generation, and GPT-3.5-turbo-1106.
- The discussion of the dataset would be better placed before the models section, to help readers understand the availability of gold labels for different subtasks.
- It would be helpful to know if the authors faced difficulties fitting event-property pairs into the problem, and what solutions were considered.
- The reasoning for not aligning the Wikidata and DBpedia datasets is insufficiently explained.
- The model was trained for only 30 epochs, while previous state-of-the-art models were trained for 40 epochs—this discrepancy is not justified.
- Table 6: It would be clearer to present “before” and “after” results in separate columns rather than using arrows between values.
- Table 4: Some column names are misnamed. The datasets should be referred to as “DBpedia-EE” and “Wikidata-EE,” rather than wikidata and DBpedia as they are distinct datasets.

Review #3
By Daniel Hernandez submitted on 27/Nov/2024
Suggestion:
Major Revision
Review Comment:

This paper compares Transformer-Based Architectures and Large Language Models regarding the Semantic Event Extraction task. To this end, the authors develop two methods, T-SEE and L-SEE, which represent the Transformer-Based architectures and the Large Language Models, respectively. The design of these two models assumes the separation of the Semantic Event Extraction task into two tasks: the first classifies events and the second extracts relations. Additionally, the authors include two baselines: Text2Event and EventGraph.

Regarding the reproducibility of the papers, the authors provided access to the code, and a long-term preserving dataset published in Zenodo.

In general, I found this an interesting paper because comparing different approaches can provide new insight on the applicability of these methods. However, I have some questions that convince me to recommend the paper for a major revision:

Q1. Why do you define two datasets if you consider three sources? On page 1, line 32, you state that the sources are Wikipedia, Wikidata and DBPedia.

Q2. On page 3, line 19, you state that is a relation. However, this does not follow the general notion of what a relation is (from mathematics and database theory). In this case, what a binary relation is. The pair contains the information to state a relationship, labeled country, from the event to Poland. A relation is usually a set of relationships (or a set of tuples). You also do this on page 4, line 29, where you say that relations and edges are the same, instead of stating that a relation is defined by a set of edges.

Q3. I would not call Event Ontology to what is stated in Definition 1. More than an ontology, according to Definition 1, an event ontology consists of a pair of two sets. According to this definition, the pairs ({1}, {2,3}) and ({1,2}, {2,3}) can be called event ontologies. It seems that what you intend to define here is the vocabulary of the ontology.

Q4. On page 4, line 24, you should write ⊆ instead of = because you don't want to define R as the set of all possible relationships.

Q5. Suggestion: I do not recommend using $p_{type}$ to denote the predicate to define the types of elements because subscripts are less readable and people may think that p is another predicate. It would be simpler to either use write (e, type, C). Furthermore, currently your definition allows writing (e, type, d) where d is not in C. To fix the definition, you can say that triples with property type only allow elements of C in the third component. However, you can also avoid introducing such a restriction by encoding the types of entities as another relation T ⊆ E × C. Do you want to define types only for events or for any element in E?

Q6. Definition 3 is not clear on what is expected for the extracted relations. I can imagine the goal is to extract triples of the form (e_t, p, o) with p in P and o in E ∪ L. However, this is not explicit. One may also extract triples whose subject is not an event.

Q7. The problem statement should indicate what assumptions are made, and what data is provided to the models to learn the task.

Q8. In figures 1 and 2, you refer to queries as pairs . You should make explicit what these queries mean because the word query has a specific meaning in graph database systems, and I do not understand what your queries do.

Q9. You have an extra parenthesis in Algorithm 1, line 9.

Q10. In Algorithm 1, line 9, I do not understand why the method ECM.classifyEvents(t, O) returns a set of events. The name suggests it should return a set of pairs (e, C) or a set of triples (e, type, C) --see my question Q5-- where e is an extracted event and C is the class of the extracted event. However, in line 13, you introduce the notation e.c, which implicitly said you already assumed that the class is an attribute of the events generated in line 9. These notations are confusing because in Definition 2 you used V as a set of elements, which are a subset of E, and in Algorithm 1, the elements of V are objects that have attributes like c. That is, c is not a class, but the name of the attribute of the object e, which contains the class of e. You should stick with the notation that was already introduced.

Note that this is super confusing on page 8, line 24, where you write e_t.c = c. The symbol c is used with two different meanings in this identity.

Q11. On page 7, line 4, you write, "[...] additional constraints can be applied to remove queries from Q." Can be applied or are applied?

Q12. On page 8, line 1, you write, "traditional multilabel classification approaches [...]." The word traditional is subjective. Instead, you should cite the methods you want to refer to. You also use the adjective "traditional" in a vague way on page 9, line 37.

Q13. On page 8, line 25, in the definition of the set of queries, you should add space around the symbol | denoting "such that." You can use the latex symbol \mid.

Q14. I think that Figure 5 does not clearly indicate where the language model is accessed. For example, the event classification shows two arrows that end in the classified elements. Does this mean that the arrows represent the processing of the prompts by the LLM, and the separated results are combined into the classified events, or that the prompts are combined, and then the LLM returns the classified events? Are all the arrows calls to the LLM? What about the last arrow? It appears that there is no prompt associated with that arrow.

Q15. Sections 4.1 and 4.2 are so short. I would appreciate a more detailed description of the prompts used.

Q16. What LLM did you use in your evaluation?