ABECTO: Assessing Accuracy and Completeness of RDF Knowledge Graphs

Tracking #: 3234-4448

Authors: 
Jan Martin Keil

Responsible editor: 
Guest Editors Tools Systems 2022

Submission type: 
Tool/System Report
Abstract: 
Accuracy and completeness of RDF knowledge graphs are crucial quality criteria for their fitness for use. However, assessing accuracy and completeness of knowledge graphs requires a basis for comparison. Unfortunately, in general, a gold standard to compare with does not exist. As an alternative, we propose the comparison with other, overlapping RDF knowledge graphs of arbitrary quality. We present ABECTO, a command line tool that implements a pipeline based framework for the comparison of multiple RDF knowledge graphs. For these knowledge graphs, it provides quality annotations like value deviations and quality measurements like completeness. This enables knowledge graph curators to monitor the quality and potential users to select an appropriated knowledge graph for their purpose. With two example applications of ABECTO we demonstrate the usefulness of ABECTO for the improvement of knowledge graphs.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Edgard Marx submitted on 25/Dec/2022
Suggestion:
Accept
Review Comment:

In this work, the author proposes to assess the accuracy and completeness of knowledge graphs by comparing themselves using an approach very similar to link discovery.
The text is very easy to follow, and the author provides all necessary justification for the design choices.
I have a few comments regarding the evaluation, however, I must recognize that the evaluation of the tool is not among the metrics for accepting the work in Tools & Systems Papers and many papers published in this track do not provide any evaluation. I see therefore the author's effort for tool evaluation in two different scenarios (although can be improved) as a further argument for paper acceptance.

#Related work

There are a few publications very related to your work that I would like to see there as well as an explanation of what they differ in relation to your work.

Valdestilhas, A., Soru, T., Saleem, M., Marx, E., Beek, W., Stadler, C., ... & Riechert, T. Identifying, Querying, and Relating Large Heterogeneous RDF Sources.

Ngomo, A. C. N., & Auer, S. (2011, June). LIMES—a time-efficient approach for large-scale link discovery on the web of data. In Twenty-Second International Joint Conference on Artificial Intelligence.

Volz, J., Bizer, C., Gaedke, M., & Kobilarov, G. (2009, January). Silk-a link discovery framework for the web of data. In Ldow.

#Requirements

“it is necessary to enable the exclusion of known deviations” -> This might be the only thing that the user would like to know. Therefore, if you are looking for completeness, the number of deviations might be exactly what you are looking for. I do not see this as a requirement but as a feature.

#Approach

The author discusses a Jaro-Winkler processor, but what about other metrics such as Levenshtein, cosine, and Jaccard?
Does it only use Jaro-Winkler? Why?

You should also highlight the limitations that ABECTO does not provide a way of comparing two different schemas, as the variables in the SPARQL queries turn the resulting flat, the user should know the schema of both targets knowledgebases.

It is also important to highlight that the selection and comparison approach is very similar to the link discovery frameworks such as the Limes framework, which also works through SPARQL queries. This leads me to the following question:

Why did the author not compare existing link discovery frameworks with its approach? I understand the metrics used for comparison are different but they can be easily derived from the linking result.

This should all be addressed in the work.

#Results

Results in Table 1 are confusing, i.e. QUDT, OEM, and SWEET together represent around 56% of the Wikidata Unit Counts. How is it possible that the completeness of Wikidata is 55% while the others are less than 20%? What I am missing? Perhaps a text explaining the measurements of each column will be helpful. The same thing to Quantity Kinds, the summation is over 100%.

Table 2 does not offer too much information, i.e.
how many of the instance in one knowledge graph corresponds to another and vice versa?
Which are properties missing?
How many of these resources overlap?
What are the parameters setup used in the experiments?
What are the most common errors?

Where are the results reported using the ontology described in the previous chapters?
Please provide the report results together with the setup and link it to the paper in the respective chapters.

#Minors

The definition for pairwise overlaps and population size in the completeness processor seems to be wrong.

“comply to” - > comply with

“containing these resource as subject or object” - > containing this resource as subject or object

“will considered to correspond” - > will be considered to correspond

“a OWL “ -> an OWL

Review #2
Anonymous submitted on 03/Jan/2023
Suggestion:
Reject
Review Comment:

This paper presents a CLI tool for comparing knowledge graphs in terms of accuracy and completeness. The idea presented is very interesting as it avoids the necessity of a gold standard to compare KGs. However, there are important issues w.r.t. on how the resource is presented but also its impact. As the paper is submitted as a system/tool report, so I will use the recommendations of the journal to review it.

(1) Quality, importance, and impact of the described application
The impact of the application is motivated through two different projects (Comparison of Units of Measurement Data from Four Knowledge Graphs, and Comparison of Space Flight Data in Wikidata and DBpedia). These two cases seem to be created ad-hoc for the paper, as there isn’t a detailed explanation of each of them nor a clear motivation of their necessity in the community. Additionally, there are no demonstration of the adoption of the tool by the community. Although I think the tool covers an important aspect of evaluating knowledge graphs (in their construction, or their current quality), the impact of the resources is very limited at this moment.

(2) Clarity and readability of the describing paper
- This point is the weak part of the paper and the main reason for my final decision. The paper seems more like a technical report (like a wiki or a how-to) than a journal paper. For example, section 4 presents the tool description, as a software resource, but I would like to know which are the algorithms implemented and how the tool deals with the potential issues of the addressed problem (e.g., scalability). Although the ABECTO vocab presented seems interesting, is it necessary to present all the links to the used vocabularies? See other papers on how to present (and represent) vocabularies[1][2]. (and there are many examples along the paper on information that can be moved to a GitHub wiki and are not very relevant for a paper).

- Related work is more a background session than a proper related work, and the introduction is totally missing in the paper without a clear motivation.

- Requirements presented are pretty interesting but they should be compared with what is already presented in the literature (it’s not valid to say that are extracted based on experiences).

- There is no experimental evaluation of the engine so it is very difficult to validate what is presented. Happy to see that the source code in GitHub is linked to Zenodo and generates a DOI for each release.

For all these reasons I would recommend to reject but I would like to encourage the authors to prepare an improved version of the paper and include the tool in projects or other investigations that would increase its impact. I would also like to recommend to the authors to take a look at published papers on this track in the journal but also in other venues with a similar track (resources) such as ESWC or ISWC.

[1] Ruckhaus, E., Anton-Bravo, A., Scrocca, M., & Corcho, O. (2017). Applying the LOT Methodology to a Public Bus Transport Ontology aligned with Transmodel: Challenges and Results. Semantic Web, (Preprint), 1-19.
[2] Chávez-Feria, S., García-Castro, R., & Poveda-Villalón, M. (2022). Chowlk: from UML-based ontology conceptualizations to OWL. In European Semantic Web Conference (pp. 338-352). Springer, Cham.

Review #3
Anonymous submitted on 04/Feb/2023
Suggestion:
Major Revision
Review Comment:

The paper describes the tool ABECTO, a command line tool to assess the quality of RDF knowledge graphs in terms of completeness and accuracy. As described by the authors usually quality assessment is performed if a gold standard is present which is not possible in the case of Knowledge Graphs having such a reference dataset. Therefore, the authors propose to compare the quality of the portion of knowledge graphs that describe the same things, for instance, compare the quality between musician entities from different knowledge graphs.

Overall, I had a good impression of the work although I think this idea is limited since it provides possible missing data. Each dataset has its own characteristic that may be guided by a particular use case therefore in terms of completeness we will have data that in one dataset can be more complete than in the other but this is not wrong if the aim of building such dataset is not to make it more complete than others. The problem would be interesting about the consistency of the same information. In this case, the author may identify if the fact expressed in one dataset is consistent with respect to the others.

Your work may represent a potential impact since it is a follow-up of a previously impactful contribution - this is usually insufficient unless you can make a very convincing case that impact according to impact beyond your own range of influence will be had very soon e.g. either by using your tool in other research groups or in other collaborative projects.

I have more detailed section-by-section comments below.

Sec 1
In the Motivation section which should be re-named Introduction, I think you should be able to expand and give a little background and some statistics regarding the importance of assessing the quality of knowledge graphs and telling why it is important (thus including the motivation part). Which are the challenges and how some of them are solved by state-of-the-art approaches (mention the most important approaches)? Which are the open issues? Here is the part where you start introducing ABECTO. A reference to the tool would be necessary from the beginning such as a repository on GitHub where the tool can be accessed immediately without waiting to read the other sections.

Sec 2
A lot of other works are discussed and the focus has been given to works proposing a tool. Although the discussion is interesting because it highlights the tools for assessing and improving the quality of knowledge graphs which are not so many, I still think that there are other relevant works that are in the same direction that are missing. For instance, the work of Nandana et al considers more quality metrics and the comparison is made between different versions of the same knowledge graph over time [A quality assessment approach for evolving knowledge bases].

But the comparison with the state-of-the-art should pay more attention to other works as well, such as:
- Jeremy Debattista et al. 2020: Evaluating the quality of the LOD cloud: An empirical investigation (In particular, here you will find a very wide range of metrics applied in the LOD cloud
- Knowledge Graph Completeness: A Systematic Literature Review (in this work there is a very thorough discussion about the completeness quality dimension which is very relevant for your work)
- Zaveri et al. 2016: Quality assessment for Linked Data: A Survey

Sec 3
Regarding the requirements list, the integration of the quality results from other tools on the same dataset can be made possible in R4 by aligning the schema elements but I don't see how it will be possible with quality results extracted from other tools. In order to measure the same quality dimension we should be able to check if the same quality metrics are applied and if the comparison is made with the same reference dataset. Integrating quality results from different tools is not that straightforward.

An additional point in the requirements is to provide a user interface with a dashboard where people can visualize and check the quality issues. Second, scalability is another relevant requirement especially when you want to compare several datasets with each other.

Sec 4

"Further resources are provided for the use inside the metadata graphs. The properties av:relevantResource, and av:correspondsToResource, av:correspondsNotToResource are available for the representation of the belonging of a resource to an aspect and for mapping results."
-> can you explain this better? What is a relevantResource? I am trying to guess things while reading this sentence. I think you should provide all the explanations to understand all the details better.
You need to explain all the new terminology that can be easily understood, e.g predefined metadata graphs, default graphs or key variables.
Why using could instead of can? Aren’t the variables used by the processors?

I think that section 4.1 should be reorganized saying that you are considering several parts of the vocabularies, i.e., you are proposing a modularization of vocabularies and then you explain each module. In this way, it will be easier to follow. What is the connection with figure 4, figure 5 and figure 6. It is not just an example but this is the whole vocabulary. I would suggest to keep this in a separate section

Are the parameters in section 4.2.1 all mandatory?

Mapping Processor in section 4.2.3 do we need to know in the graph what should be there a priori? It seems that we need to construct all the possible mappings.

Sec 5
This section can be integrated with the previous section or can be inserted with a discussion section. Usually, one section contains more than one paragraph.

Sec 6
Why should we consider this workflow to be cyclic? Aren't we supposed to have assessed the quality of our knowledge graphs and then improve it? Why do we need to go in a cycle?

* What does it mean that the plan execution is triggered automatically? Does "provides data for the further process"' mean data for the next steps!!!
* What does it mean to extend the knowledge graph?
* "own" knowledge graph -> is that the default or do you mean smth else? -> rephrase
* change or addition of a value in another knowledge graph -> how is it possible to bring changes in another knowledge graph that we do not own? Additional information is needed to understand better what you mean by this.
* You should extend and better explain all the cases under the results analysis.
* Knowledge Graph refinement, how does this change with respect to the previous steps? Extend and explain all the necessary details.

Sec 7
Rename the section to something smth like USE CASE Applications

* rephrase -> For a correct interpretation please note, that in contrast to the other knowledge graphs
* regarding the Jaro-Winkler Similarity metric seems to be limited. What about a metric for number similarity? What about other metrics?

* Table 1 is not clear by looking at the columns which is the overlap between these knowledge graphs. It is not clear how this comparison is done. I think in terms of completeness and accuracy we need to know the details of the formula applied. What is the quantity kinds count?

* do not use phrases such as "hopefully" in scientific work.

Sec 8

I think that there are also other metrics that do not need a gold standard such as the metrics proposed by Debattista et al. I think even in this work you still need smth to compare to but I don't think we can use exact metrics to do the comparison since we do not have presented the same reality. Here we should consider the problem of the open-world assumption vs the closed-world assumption. It does not mean that smth that is missing should be wrong since the objective is different for different providers.