Editorial Board

Editor-in-Chief
Krzysztof Janowicz

Managing Editors
Cogan Shimizu
Eva Blomqvist

Editorial Board
Mehwish Alam
Claudia d’Amato
Stefano Borgo
Boyan Brodaric
Philipp Cimiano
Oscar Corcho
Bernardo Cuenca-Grau
Elena Demidova
Jerome Euzenat
Mark Gahegan
Aldo Gangemi
Anna Lisa Gentile
Rafael Goncalves
Dagmar Gromann
Armin Haller
Pascal Hitzler
Aidan Hogan
Katja Hose
Eero Hyvönen
Sabrina Kirrane
Agnieszka Lawrynowicz
Freddy Lecue
Maria Maleshkova
Raghava Mutharaju
Axel Polleres
Guilin Qi
Marta Sabou
Harald Sack
Christoph Schlieder
Stefan Schlobach
Oshani Seneviratne
Cogan Shimizu
GQ Zhang

Former/Founding Editors-in-Chief
Pascal Hitzler

Editorial Assistants
Michael McCain

Syndicate

Evaluating the Quality of the LOD Cloud: An Empirical Investigation

Submitted by Jeremy Debattista on 11/01/2017 - 05:22

Tracking #: 1757-2969

Authors:

Jeremy Debattista

Christoph Lange

Sören Auer

Dominic Cortis

Responsible editor:

Ruben Verborgh

Submission type:

Survey Article

Abstract:

The increasing adoption of the Linked Data principles brought with it an unprecedented dimension to the Web, transforming the traditional Web of Documents to a vibrant information ecosystem, also known as the Web of Data. This transformation, however, does not come without any pain points. Similar to the Web of Documents, the Web of Data is heterogenous in terms of the various domains it covers. The diversity of the Web of Data is also reflected in its quality. Data quality impacts the fitness for use of the data for the application at hand, and choosing the right dataset is often a challenge for data consumers. In this quantitative empirical survey, we analyse 130 datasets (~ 3.7 billion quads), extracted from the latest Linked Open Data Cloud using 27 Linked Data quality metrics, and provide insights into the current quality conformance. Furthermore, we publish the quality metadata for each assessed dataset as Linked Data, using the Dataset Quality Vocabulary (daQ). This metadata is then used by data consumers to search and filter possible datasets based on different quality criteria. Thereafter, based on our empirical study, we present an aggregated view of the Linked Data quality in general. Finally, using the results obtained from the quality assessment empirical study, we use the Principal Component Analysis (PCA) test in order to identify the key quality indicators that can give us sufficient information about a dataset's quality.

Full PDF Version:

swj1757.pdf

Previous Version:

Evaluating the Quality of the LOD Cloud: An Empirical Investigation

Tags:

Reviewed

Decision/Status:

Solicited Reviews:

Click to Expand/Collapse

Review #1

By Anisa Rula submitted on 10/Nov/2017

Suggestion:
Accept

Review Comment:

The article has been significantly revised. My reviewer comments have been taken into account. I only have one suggestion for the authors. I would like to see in the final version of the paper a discussion of the efficiency of your approach. How long did it take to run the 27 metrics on 130 datasets? Was it all ran at once or you ran it separately for each dataset? How much manual effort did it take? I would really appreciate to put this in the discussion in terms of limitations or advantages of your work.

Review #2

By Amrapali Zaveri submitted on 26/Nov/2017

Suggestion:
Accept

Review Comment:

The authors have addressed all the issues raised satisfactorily. The only comment I have is to either fix the link https://w3id.org/lodquator ASAP or remove it from the paper and only point to the main page: http://jerdeb.github.io/lodqa/ and update the link to the service there.

Review #3

By Heiko Paulheim submitted on 07/Dec/2017

Suggestion:
Minor Revision

Review Comment:

I appreciate that most of my comments have been addressed.

Just a few minor issues:

(1) the research question on p. 2 is too broad, as it refers to "quality of existing data on the Web" - this should be narrowed down to "Linked Data"
(2) on p. 15, the authors state that they consider a term from another vocabulary "if a property or a class refers to an existing term in another vocabulary" - this might be a bit picky, but as long as you do not attempt to derefer the term, you should omit the word "existing" ;-)
(3) I am still not fully convinced by CS9. Many datasets mix terms from different vocabularies. For example, if someone assigns a foaf:Person as a dc:creator of a swrc:Publication, this will be a violation of the metric as defined by the authors. Hence, I do not think that the approach of simply checking for supertypes is a suitable proxy for detecting incorrect domain and range types, and will likely underestimate the quality of a dataset.
(4) In section 6, the authors should also show correlations.

While (1)+(2) are fairly easy to fix, (3) might be a little tricky. One option would be to load the statement, the subject's and object's types, and the vocabularies of the subject, object, and property, plus the transitive set of imports, and then check for consistency (using a very a minimalistic A-box instead of the entire A-box, as we did, e.g., in [1]). Since the metric is computed based on a sample only, this should be fairly well feasible. Even in a very pessimistic scenario where checking a single statement takes one minute (it should actually be less than that for most vocabularies, which are fairly small), this would not take more than a week.

As the authors mention correlations in section 6, it would be good to also see a correlation matrix for the metrics, e.g., in the form of a heatmap visualization. This would answer the question of whether metrics are correlated or not in the most straight forward way.

[1] Paulheim and Stuckenschmidt (2016): Fast Approximate A-Box Consistency Checking Using Machine Learning

Log in or register to post comments
11210 reads

Main menu

Editorial Board

Syndicate

Evaluating the Quality of the LOD Cloud: An Empirical Investigation

Tracking #: 1757-2969

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles

Search form

Main menu

Login

Editorial Board

Syndicate

Evaluating the Quality of the LOD Cloud: An Empirical Investigation

Tracking #: 1757-2969

Reviewed Articles

Authors & Reviewers

Links

Recent blog posts

Accepted Articles