N-ary Relation Extraction for Joint T-Box and A-Box Knowledge Base Augmentation

Tracking #: 1270-2482

Marco Fossati
Emilio Dorigatti
Claudio Giuliano

Responsible editor: 
Philipp Cimiano

Submission type: 
Full Paper
The Web has evolved into a huge mine of knowledge carved in different forms, the predominant one still being the free-text document. This motivates the need for Intelligent Web-reading Agents: hypothetically, they would skim through disparate Web sources corpora and generate meaningful structured assertions to fuel Knowledge Bases (KBs). Ultimately, comprehensive KBs, like Wikidata and DBpedia, play a fundamental role to cope with the issue of information overload. On account of such vision, this paper depicts the Fact Extractor, a complete Natural Language Processing (NLP) pipeline which reads an input textual corpus and produces machine-readable statements. Each statement is supplied with a confidence score and undergoes a disambiguation step via entity linking, thus allowing the assignment of KB-compliant URIs. The system implements four research contributions: it (1) executes N-ary relation extraction by applying the Frame Semantics linguistic theory, as opposed to binary techniques; it (2) jointly populates both the T-Box and the A-Box of the target KB; it (3) relies on a lightweight NLP machinery, namely part-of-speech tagging only; it (4) enables a completely supervised yet reasonably priced machine learning environment through a crowdsourcing strategy. We assess our approach by setting the target KB to DBpedia and by considering a use case of 52,000 Italian Wikipedia soccer player articles. Out of those, we yield a dataset of more than 213,000 triples with a 78.5% F1. We corroborate the evaluation via (i) a performance comparison with a baseline system, as well as (ii) an analysis of the T-Box and A-Box augmentation capabilities. The outcomes are incorporated into the Italian DBpedia chapter, can be queried through its SPARQL endpoint, and/or downloaded as standalone data dumps. The codebase is released as free software and is publicly available in the DBpedia Association repository.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Roman Klinger submitted on 26/Feb/2016
Minor Revision
Review Comment:

The paper “N-ary Relation Extraction for Joint T-Box and A-Box Knowledge Base Augmentation” describes a pipeline for database enrichments. It builds on top of frame semantics.

The introduction comprehensively explains the need and usefulness for knowbase-completion and enumerates different efforts for making knowledge publicly available in a structured manner. It makes very clear what the main contributions of this paper are: It is a whole framework/pipeline from text to database entries. The actual information extraction technology is not very sophisticated, what the authors present as an advantage of the approach (and I agree, if this is of sufficient performance, of course).

The paper explains and partially focuses on technical details. It mentions the use of XML/Wikisyntax and actually explains how these documents are preprocessed. It explains the technology in terms of CPU and RAM used to process the data. I am not fully convinced that this level of detail is helpful in this article. It might be considered to move such detailed description to the homepage where the system can be downloaded and contribute to a more concise description.

The mathematically motivated methods, like TF-IDF-standard deviation-based ranking of entities could be introduced more formally.

The evaluation seems to be sound and is interesting.

This paper could be improved by adding discussions for the different design decisions made when developing the pipeline. For instance, it remains unclear why it has been chosen to use a POS tagger, but no chunking, no dependency parsing. If such steps do not contribute to the overall performance, or if the pipeline would be to slow, that’s fine. But I would prefer to read about these trade-offs and possible impacts of other decisions.

The work is built on top of a manually annotated data set. I am wondering how that would scale to future applications of the pipeline. How is this approach working when new roles or entities are added? Do you expect reannotation to be necessary?

The confidence score calculation is very important when it comes to KB completion, however, section 9.1 is only describing that the pipeline outputs such scores. I more formal description of properties and distributions of these scores would be interesting.

The related work section does not mention distributional methods and matrix factorization-based approaches for relation extraction. A short discussion of advantages and disadvantage would be interesting..

Finally, the developed system shares properties with semantic role labeling systems. How does your development compare to existing systems? Could an SRL-system contribute to a strong baseline?

* T-Box and A-Box are not terms which are well known in all communities which might be interested in this work. These terms should be introduced early in the paper.
* Figure 1 is very text focused. I understand that this is a screenshot (or print) from an actual system. However, the layout makes it quite difficult to understand just from the depiction what is the purpose. Perhaps a graphical annotation might help here.
* The use of English is not perfect and could be improved, however, this does not lead to any problems with understanding the content of the paper.
* Figure 3 could be additionally provided in a translated version, such that non-italien readers could understand it.
* Captions on Figures (e.g. 8) are too small.
* Figure 8 is in general hard to interpret, as lines are hard to distinguish.

Review #2
By Matthias Hartung submitted on 29/Feb/2016
Major Revision
Review Comment:

This article presents work on "N-ary Relation Extraction for Joint T-Box and A-Box Knowledge Base Augmentation". The authors propose the FactExtractor system, i.e., a workflow that runs unstructured natural language text through an NLP pipeline in order to generate machine-readable statements that can be used to extend an existing knowledge base. Their approach capitalizes on Frame Semantics as a theoretical backbone from linguistic theory that serves as an interface between an ontology or data model and natural language. The authors demonstrate the capabilities of FactExtractor in a use case based on Italian Wikipedia text (snapshot of 52.000 articles about soccer players) and DBpedia as the target knowledge base to be enriched. The mapping between the DBPO data model and the natural language extractions is achieved by manually defined frames, which provide event classes and expressive roles partipating in these events, both of which can be readily transformed into RDF statements in order to populate the KB. For the given use case, the authors had to define a total of six frames and 15 roles which are particularly tailored to the domain at hand. As such, the proposed method provides an interesting complement to KB population from semi-structured sources such as Wikipedia infoboxes that is commonly used approach in the DBpedia community. Therefore, and due to its novel linguistic underpinnings, I consider this work highly original.

The paper is generally well structured, the line of argumentation mostly clear and comprehensible, with some qualifications however:

* From my perspective, the aspect of joint T-Box and A-Box population is somewhat overstated. Certainly, FactExtractor is capable of populating both T-Box and A-Box _simultaneously_, i.e., relying on one and the same pipeline of analyis. However, I cannot see any aspect in the system that indicates a genuinely _joint_ approach in the sense that T-Box and A-Box knowledge acquisition is closely intertwined in order to exploit mutual dependencies between the two (which would correspond to the common use of the term in the machine learning or NLP literature). I would suggest to change the terminology here.
* The approach is claimed to be based on "supervised, yet reasonably priced" machine learning methods. However, this comes at the cost of a highly demanding crowdsourcing step that somehow questions the generalizability of the approach: I can barely imagine a crowd of laymen annotating natural language text according to a large-scale, generalized frame inventory. From a more long-term perspective, such a generalization step to (a) less restricted domains and (b) beyond Wikipedia text would be clearly necessary at some point, if the authors take their own argument seriously that KB content should be validated against third-party (i.e., non Wikimedia) resources.
* In this context, I also do not completely understand the "anatomy" (weird term) of the crowdsourcing task: The description in Section 7.2.1 and Figure 3 suggest that the sentence to be annotated is presented to the workers together with the frame label. How can this be determined in advance? I suspect that this is done by assuming a fixed mapping between lexical units and a frame, which obviously neglects potential lexical ambiguity at the level of lexical units. This aspect needs clarification, and it should be quantified to what extent such ambiguities really occur and pose a problem to the system.

Given the considerable amount of substantial work that underlies the paper, it is a bit unfortunate that the significance of the results suffers from issues in the experimental settings and the evaluation:

* The evaluation of classification performance (Section 11.1) is conducted in a rather lenient fashion only, as full credit is given to partial overlap of predicted and correct chunks of text. At least for comparison, I would like to see a more strict setting relying on complete overlap (or a discussion why this is not feasible). What is more, it seems to me that chunks that are labeled with "0" in the gold standard (i.e., should not be labeled by the system) are excluded from the evaluation in the first place. Figure 4 suggests, however, that there is a considerable proportion of cases where "0" chunks are erroneously assigned an FE label by the system. This clearly leads to an illegitimate boost of precision. The final version must at least include an additional setting where these cases are correctly evaluated as false positives.
* In purely quantiative terms, the relative gains obtained from A-Box and T-Box augmentation as reported in Tables 5 and 6 are very impressive. However, it would also be interesting to assess the correctness of the additional statements. Given the reservations mentioned in the previous point, I could imagine that there might be a considerable proportion of noise in the extractions. Please provide a snapshot evaluation, e.g., by manually annotating a random sample of extracted assertions.
* The experimental settings include a rather simple strategy for seed selection (for both training the frame/FE classifiers and selecting the sentences to be used for extracting assertions in the first place), viz., sentence filtering according to a maximum length of words. First, for the sake of exactness and replicability of the results, this threshold should be explicitly stated. Second, I am a bit concerned that this strategy might introduce a bias towards shorter sentence with a relatively simple syntactic structure, which might explain why Named Entity Linking serves well as a surrogate of syntactic parsing. If so, this clearly questions the scalability of the approach. In any case, I would like to see a more comprehensive discussion of these aspects.

Further comments:

* The article should be rendered more self-contained by making less extensive use of references to the authors' own previous work (Fossati et al., 2013) without giving any substantial details about the approach taken there.
* While it is certainly fair to say that the workflow as proposed in the paper makes use of a "lightweight NLP machinery" only, the NLP pipeline still requires a lot of manual effort due to the construction of domain-specific FrameNets and the manual annotation work that is needed in order to train classifiers for frame and frame element detection. These modules being core parts of the pipeline, it is certainly not adequate to claim that there be "no need for ... semantic role labeling" in FactExtractor.
* Section 5.2: Why is lexical units selection framed as a ranking problem (rather than a filtering/classification problem), and how are the two scores (TF/IDF and standard deviation) combined?
* In Section 9, the formulation "...which we call reification" is misleading, as reification is certainly not a new term that is introduced here.
* Table 2: What are "gold units", what are "untrusted judgments"? Please explain.
* In Section 8, I was surprised to see that low agreement among the annotators on numerical FEs can be recovered from by using rule-based heuristics. What was the source of the low agreement then?
* Section 11.1.1: "Due to an error in the training set crowdsourcing step, we lack of VITTORIA and PARTITA samples": This issue should be corrected in the final version.
* Table 4 mentions "frequency %" in the heading of column 1; the corresponding description in Section 11.2 talks about "absolute occurence frequencies". Please harmonize.
* Figure 8 definitely needs a better resolution. In the current version, the curves are barely distinguishable, the legend hardly readable.
* Section 13.1: RE (in the authors' use of the term) and OIE are certainly not "two principal fields in Information Extraction", but rather refer two different paradigms in relation extraction (which is in itself a subtask of information extraction).
* p. 13: "lack of ontology property usage in 4 out of 7 classes" --> 3 out of 6?

Review #3
By Andrea Giovanni Nuzzolese submitted on 01/Mar/2016
Major Revision
Review Comment:

The paper presents Fact Extractor, a NLP pipeline that allows to generate machine-readable statements based on the extraction of n-ary relations from a textual corpus.
The pipeline shows its potentialities when applied to KB enrichment by exploiting textual corpora (e.g., Wikipedia articles for DBpedia).

The overall quality of writing is good. However, some issues may need clarification, furthermore, not too much, but some minor issues like typos exist that require some changes (see below). The structure of the paper is very clear and consistent with the hypotheses and contributions listed in the abstract/introduction. Nevertheless, some sections could be merged together (e.g., Section 2 -> 1, Section 8 -> 7, etc.).

Some claims should be toned down. For example, in Section 3 (Use case) when the authors motivate the reason behind their choice of adopting the Soccer domain for the use case, they argue that 5% of the whole English Wikipedia is a significant portion of the whole chapter. However, 5% can be hardly considered a significant portion of a dataset/sample. Moreover, this ratio is provided with respect to the English Wikipedia, while in the rest of the paper and in the evaluation the authors use the Italian Wikipedia. Hence, the ratio should be provided with respect to the Italian Wikipedia or at least the authors should motivate this incoherence.
In Section 5 (Corpus Analysis) the authors "argue that the loss of information is not significant and can be neglected despite the recall costs". Again, the term significant should be used with more accuracy or ir should be better justified by providing supporting data and proper analysis.

The authors should provide more details for the following parts:
* Section 6: provide ranks and more context about the selection of the LUs esordire, giocare, perdere, rimanere and vincere.
* Section 8: how many rules were defined for the normalisation of numerical expressions? Are these rules available for consultation?
* Section 11.1: no information about the inter-rater reliability is provided. This information is needed in order to demonstrate the value of the
goldstandard built for the evaluation.
* Section 11.2: can the data about property statistics made be available?

The representation of generated statements should be better clarified. In fact, in Section 2 they authors provide an example that does not reflect the semantics provided by the example in Section 9 (page 10). Namely, in the first example the authors state that, from the sentence
"In Euro 1992, Germany reached the final, but lost 0-2 to Denmark", the pipeline generates:
* a new statement "Germany defeat Denmark" where defeat is the frame
* a set of facts about the new property by using FEs.
Instead, in the second example, for the same sentence, the pipeline generates
* a new statement ":Germany defeat :Defeat01"
* a set of facts about the object :Defeat01
The semantics of the first example leds to extensional issues when adding new facts to a property/frame. This issues seem to be resolved by the second example. However, the statement is completely different as no explicit binary relation is generated between Germany and Denmark. Rather, a more coherent (with respect to first example) and correct formalisation could be the following:

:Germany :defeat_01 :Denmark .
:defeat_01 rdfs:subPropertyOf :defeat .
:defeat_01 :winner :Denmark ;
:loser :Germany ;
:competition :Euro_1992 ;
:score "0-2" .

The state of the art is focused on (open) information extraction and semantification. Nevertheless, it misses the comparison with a large slice of relevant works in the areas of machine reading, n-ary relations extraction, existing frame related KBs and theories in the Semantic Web. These works include FRED [1] (see also the paper submitted to the SWJ [2]), FrameBase (see also the paper submitted to the SWJ [4]), [5], [6] and [7].
The comparison with Legalo is unfair (cf. [8] for more details) In fact, Legalo:
* relies on FRED (and its frame-based representation of natural language sentences) for generating binary relations;
* works either on Wikipedia articles or generic free text inputs;
* can be used for KB enrichment (cf. property matcher in the architecture of Legalo).

* page 1: "DECIPHERING its meaning". If you do not use any cryptographic techniques the
term understanding would be more appropriate.
* page 2: to "to CREDIBLE (thus high-quality)" -> "to RELIABLE (thus high-quality)"
* page 2: "from raw text and produces e.g, " -> "from raw text and produces, e.g, "
* page 6: "The selected LUs comply to" -> "The selected LUs comply WITH"
* page 7: "We alleviate this through EL techniques". What does EL stand for?
* page 9: "which we call reification". I would say "called reification"
* page 13/14: "In avevage, most raw properties" -> "ON avevage, most raw properties"

[1] "Knowledge Extraction Based on Discourse Representation Theory and Linguistic Frames". Valentina Presutti, Francesco Draicchio, Aldo Gangemi. EKAW 2012: 114-129
[2] http://semantic-web-journal.org/system/files/swj1297.pdf
[3] "FrameBase: Representing N-Ary Relations Using Semantic Frames". Jacobo Rouces, Gerard de Melo, Katja Hose. ESWC 2015: 505-521
[4] http://semantic-web-journal.org/system/files/swj1239.pdf
[5] "Gathering lexical linked data and knowledge patterns from FrameNet". Andrea Giovanni Nuzzolese, Aldo Gangemi, Valentina Presutti. K-Cap 2011: 41-48
[6] "Frame Detection over the Semantic Web". Bonaventura Coppola, Aldo Gangemi, Alfio Gliozzo, Davide Picca, Valentina Presutti. ESWC 2009: 126-142
[7] "Towards a Pattern Science for the Semantic Web". Aldo Gangemi and Valentina Presutti. Semantic Web Journal 1.1, 2 (2010): 61-68.
[8] "From hyper- links to Semantic Web properties using Open Knowledge Extraction". Valentina Presutti, Andrea Giovanni Nuzzolese, Sergio Consoli, Aldo Gangemi, and Diego Reforgiato. Accepted for publication on Semantic Web Journal (http://semantic-web-journal.org/system/files/swj1195.pdf).


I would like to inform the readers that the paper From Freebase to Wikidata: The Great Migration [1] - accepted at the WWW 2016 industry track [2] - mentions the FBK StrepHit soccer dataset in Section 4.4.
The dataset is generated upon the baseline classifier output described in this SWJ submission (cf. Sections 1, 10 and 14).

[1] http://static.googleusercontent.com/media/research.google.com/en//pubs/a...
[2] http://www2016.ca/program/industry-track.html#from-freebase-to-wikidata-...