Rule-driven inconsistency resolution for knowledge graph generation rules

Tracking #: 1947-3160

Pieter Heyvaert
Ben De Meester
Anastasia Dimou
Ruben Verborgh

Responsible editor: 
Guest Editors Knowledge Graphs 2018

Submission type: 
Full Paper
Knowledge graphs contain annotated descriptions of entities and their interrelations, and are often generated based on rules that state how certain data sources are semantically annotated. Inconsistencies are introduced in these graphs when ontology terms are (re)used without adhering to the restrictions defined by the ontologies, affecting the quality of the graphs. Rules and ontologies are two possible root causes for these inconsistencies. Methodologies and tools were proposed to detect and resolve these inconsistencies. However, they either require the complete knowledge graph, which is not always available in a time-constrained situation; or assume that only the rules can be refined and not the ontologies. In the past, we proposed a rule-driven methodology to detect and resolve inconsistencies without requiring the complete knowledge graph, but it only allows applying a predefined set of refinements to the rules. Therefore, we propose with this paper a rule-driven methodology, extending our previous work, that considers refinements for both rules and ontologies. In this work, we provide (i) a detailed description of our methodology and its implementation; and (ii) our findings when applying the methodology to two real-life use cases: DBpedia and DBLP. The use cases show that our methodology provides valuable insights when determining which refinements should be applied to the rules and ontologies, such as the entities that need to most attention when applying refinements, and the specific ontology terms and definitions that are involved in a lot of inconsistencies and that therefore might be problematic.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 09/Sep/2018
Major Revision
Review Comment:

Summary: The paper considers the problem of inconsistencies in the context of knowledge graphs constructed relying on the data sources, underlying ontology and RML rules (or mappings as they are often referred to) that state how certain data sources are semantically annotated. More specifically, the authors follow the assumption that the inconsistencies are caused by the rules and the ontology, while the data sources are error-free. The paper presents a methodology for ranking rules and ontology terms in the order in which they should (as authors suggest) be inspected by the user to assist them in manual inconsistency resolution process. The ranking is based on the scores assigned to every rule and ontology term, which depends on the number of inconsistencies they are involved in.

Importance and relevance: The problem of resolving inconsistencies in data sources enriched with ontologies and rules is certainly important, and relevant for the Semantic Web Journal.

Originality: In a nutshell, the novelty of the work is supposed to be the ranking function and the methodology for inconsistency resolution depicted in Figure 3. Clearly, these can only be evaluated empirically. However, strangely the experimental section (called "Findings") contains examples rules of concrete benchmarks, but not the general evaluation of how well the proposed methodology performs in practice, i.e., how much faster can the rules be repaired when being ordered using the developed ranking functions compared to the case when random ordering is applied. That is, the research question stated on p. 3 from Section 1 does not seem to be addressed or maybe it is somehow indirectly addressed, but this is not made apparent. Moreover, experimental comparison of the proposed method to the one from, e.g., [1] is missing and it is unclear why.
[1] Heiko Paulheim: Data-Driven Joint Debugging of the DBpedia Mappings and Ontology - Towards Addressing the Causes Instead of the Symptoms of Data Quality in DBpedia. ESWC (1) 2017: 404-418

Significance of the results: Rather poor presentation of the material disallows the reader to fully appreciate the depth of its technical content. Indeed, the methodology is presented on a very high level without explaining why certain steps are being proposed. On the other hand, extensive implementation section seems too lengthy, and it discusses engineering details (e.g., syntactic forms of rules in Listing 5) rather than conceptual solutions. The ranking functions are inaccurately presented, (e.g., capital C and T in the text should be small).

Quality of writing: The general impression is that the paper has been put together in a rush. Indeed, numerous typos (see some examples below), draft leftovers (crossed sentence on p. 10), corrupted order of figures (Figure 3 is mentioned in text before Figure 2), small font in figures and lack of their description (e.g., Figure 1) distract the reader from the technical content. Above all, the paper needs to be proof read by a native speaker.

(Non-exhaustive) list of typos:
- in the abstract: "that need to most attention..."
- p. 5 "the refined version of the rules are used..."
- p. 5 "More, it..."
- p. 6 "Furthermore, the alignment between the knowledge graph and which rules generated them...".
The sentence seems broken?
- p. 7 " order..."
- Examples are not enumerated
- p. 12 "...such those..."
- p. 15 "...the these..."
- p. 15 "...need to most..."
- p. 15 "...were we..."

Review #2
By Robert Andrei Buchmann submitted on 14/Sep/2018
Major Revision
Review Comment:

The paper presents a methodology for managing knowledge graph inconsistencies that originate at rule/ontology level. A DBPedia and a DBLP case are presented as application cases.

The results are significant for this journal (and its special issue). The presentation is nicely elaborated, built around easy to follow examples. In the following I will focus on the perceived shortcomings, mostly with respect to how the problem is motivated in the earlier sections of the paper:

Issue 1:
Methods for lifting RDF graphs out of tables are well-known, even aiming for standardization. See the RDB2RDF method or the D2RQ tool. Their rules are generic for any kind of relational table-based structures, so the reason for having rules such as those in Listing 3 is not clear to the reader. The argument of the paper seems to be that rules 3 and 4 are conflicting while producing triples 1 and 2, but a typical table-to-graph conversion method would not employ such explicit rules that are prone to errors. They would just assign the type based on the entity represented by the table. According to such generic rules, the ID 0 is only used in the people data source, so it would never be typed as furniture.

Consequently, the example employed as motivation makes the problem feel artificial. Authors must improve this rationale - why would someone have those explicit inconsistent rules, when the existing generic transformations work fine? A more generalized and powerful example should support the motivation - perhaps one of the "more than 2000 inconsistencies" in DBPedia, with data sources that are not tables? The initial rule examples could be given in the same rule language to be employed later in the paper... currently there is a drastic gap between the example that motivates the work and the examples that later illustrate the contribution.

Issue 2:
Also, it is not straightforward convincing to state that rules and ontologies are sources of inconsistencies (they are the means for detecting inconsistencies). In my practical experience a vast majority of inconsistencies come from data sources - misalignments in ontology axioms or rules are rather short-lived, temporary (in some intermediate, in-progress ontology draft, before it is released in production - i.e., before it has anything to do with the generation of knowledge graphs). Knowledge graphs are typically produced only after a certain level of quality and stability is reached for their vocabulary.

The authors make some distinction between "root causes" under Table 2, but is is brief and fails to introduce a credible scenario where rules/axioms are to blame - again, references to DBPedia inconsistencies could help with the credibility of this argument, but in the current form the reader is left wondering about the plausability of the problem statement (in the first half of the paper).

Issue 3:
The rules clustering approach seems to be central to this paper's proposal - however section 3.2.1 is one of the briefest in the entire paper (and 4.3.1 does not add much to it). How automated is this clustering? What does it mean "the entity to which a rules relate"? Since the examples in Listing 3 are given in natural language it cannot be assessed how this clustering really works, how it detects the relevant entities. Is the term "entity" used in the traditional sense of the Entity-Relationship model? Are they classes, instances, properties, any of these? How formal are those clustering rules?

To conclude, the authors make several assumptions that the casual journal readers will not necessarily assume and clarifications are still necessary to motivate the work in a generalized context. Additionally, more detailed explanations on the clustering approach should be given.

Review #3
Anonymous submitted on 06/Oct/2018
Major Revision
Review Comment:

This work is about an approach for ranking rules and ontology terms so that it can be possible to facilitate an inspection leading to improve the manual resolution of inconsistencies.

First of all, I think that the work addresses an interesting challenge. In addition, the manuscript is written in a very clear way. The state-of-the-art is complete enough. The technical contribution is sound, and it is also positive that two real use cases have been included.

However, there are some issues that should be addressed in order to improve the understandbility of the work. These issues are:

- I think the Introduction section is a little overloaded with 3 Listings and 2 tables. This is because the authors go into detail too soon, when they should actually use the Introduction to present the problem in a less exhaustive way.

- The first sentence of the Abstract and the Introduction are the same. One of them should be reformulated.

- Authors mention "Compared to the solutions that work directly on the knowledge graph, these find the inconsistencies in less time..." This affirmation should be supported with additional information, or at least, a reference to get deeper insights.

- The comparison with previous work [12] should be clearer. In its current form, it is the reader who has to seek out and analyze the incremental contribution of this work over the previous one. This should be done by the authors (and should be preferentially located in the Introduction, just before going on to describe the current contribution)

- I am not very convinced by the organization of sections 4 and 5. The reason is that they have many subsections, and some of them are so small that they do not really deserve to be subsected.

- After reading the manuscript, it is not clear to me if the proposed methodology is the only one that facilitates the inspection of incosistencies. I assume that yes (mainly due to the fact that no comparisons have been made with other works) but this should be explicitly indicated.

- At the same time, the major aim of the work, i.e. that this methodology facilitates the manual resolution of inconsistiencies is not proved, just suggested. Could anything be done about it? Maybe a comparison of the time it takes an expert to solve the problem with and without your approach?

Minor issues:

- Listing 2 is referenced in the text before than Listing 1
- Resource Description framework (RDF) -> please capitalize the F