Evaluating the Quality of the LOD Cloud: An Empirical Investigation

Tracking #: 1757-2969

Jeremy Debattista
Christoph Lange
Sören Auer
Dominic Cortis

Responsible editor: 
Ruben Verborgh

Submission type: 
Survey Article
The increasing adoption of the Linked Data principles brought with it an unprecedented dimension to the Web, transforming the traditional Web of Documents to a vibrant information ecosystem, also known as the Web of Data. This transformation, however, does not come without any pain points. Similar to the Web of Documents, the Web of Data is heterogenous in terms of the various domains it covers. The diversity of the Web of Data is also reflected in its quality. Data quality impacts the fitness for use of the data for the application at hand, and choosing the right dataset is often a challenge for data consumers. In this quantitative empirical survey, we analyse 130 datasets (~ 3.7 billion quads), extracted from the latest Linked Open Data Cloud using 27 Linked Data quality metrics, and provide insights into the current quality conformance. Furthermore, we publish the quality metadata for each assessed dataset as Linked Data, using the Dataset Quality Vocabulary (daQ). This metadata is then used by data consumers to search and filter possible datasets based on different quality criteria. Thereafter, based on our empirical study, we present an aggregated view of the Linked Data quality in general. Finally, using the results obtained from the quality assessment empirical study, we use the Principal Component Analysis (PCA) test in order to identify the key quality indicators that can give us sufficient information about a dataset's quality.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Anisa Rula submitted on 10/Nov/2017
Review Comment:

The article has been significantly revised. My reviewer comments have been taken into account. I only have one suggestion for the authors. I would like to see in the final version of the paper a discussion of the efficiency of your approach. How long did it take to run the 27 metrics on 130 datasets? Was it all ran at once or you ran it separately for each dataset? How much manual effort did it take? I would really appreciate to put this in the discussion in terms of limitations or advantages of your work.

Review #2
By Amrapali Zaveri submitted on 26/Nov/2017
Review Comment:

The authors have addressed all the issues raised satisfactorily. The only comment I have is to either fix the link https://w3id.org/lodquator ASAP or remove it from the paper and only point to the main page: http://jerdeb.github.io/lodqa/ and update the link to the service there.

Review #3
By Heiko Paulheim submitted on 07/Dec/2017
Minor Revision
Review Comment:

I appreciate that most of my comments have been addressed.

Just a few minor issues:

(1) the research question on p. 2 is too broad, as it refers to "quality of existing data on the Web" - this should be narrowed down to "Linked Data"
(2) on p. 15, the authors state that they consider a term from another vocabulary "if a property or a class refers to an existing term in another vocabulary" - this might be a bit picky, but as long as you do not attempt to derefer the term, you should omit the word "existing" ;-)
(3) I am still not fully convinced by CS9. Many datasets mix terms from different vocabularies. For example, if someone assigns a foaf:Person as a dc:creator of a swrc:Publication, this will be a violation of the metric as defined by the authors. Hence, I do not think that the approach of simply checking for supertypes is a suitable proxy for detecting incorrect domain and range types, and will likely underestimate the quality of a dataset.
(4) In section 6, the authors should also show correlations.

While (1)+(2) are fairly easy to fix, (3) might be a little tricky. One option would be to load the statement, the subject's and object's types, and the vocabularies of the subject, object, and property, plus the transitive set of imports, and then check for consistency (using a very a minimalistic A-box instead of the entire A-box, as we did, e.g., in [1]). Since the metric is computed based on a sample only, this should be fairly well feasible. Even in a very pessimistic scenario where checking a single statement takes one minute (it should actually be less than that for most vocabularies, which are fairly small), this would not take more than a week.

As the authors mention correlations in section 6, it would be good to also see a correlation matrix for the metrics, e.g., in the form of a heatmap visualization. This would answer the question of whether metrics are correlated or not in the most straight forward way.

[1] Paulheim and Stuckenschmidt (2016): Fast Approximate A-Box Consistency Checking Using Machine Learning