Quality Assessment for Linked Data: A Survey

Tracking #: 773-1983

Authors: 
Amrapali Zaveri
Anisa Rula
Andrea Maurino
Ricardo Pietrobon
Jens Lehmann
Sören Auer

Responsible editor: 
Pascal Hitzler

Submission type: 
Survey Article
Abstract: 
The development and standardization of semantic web technologies has resulted in an unprecedented volume of data being published on the Web as Linked Data (LD). However, we observe widely varying data quality ranging from extensively curated datasets to crowdsourced and extracted data of relatively low quality. In this article, we present the results of a systematic review of approaches for assessing the quality of LD. We gather existing approaches and analyze them qualitatively. In particular, we unify and formalize commonly used terminologies across papers related to data quality and provide a comprehensive list of 18 quality dimensions and 69 metrics. Additionally, we qualitatively analyze the 30 core approaches and 12 tools using a set of attributes. The aim of this article is to provide researchers and data curators a comprehensive understanding of existing work, thereby encouraging further experimentation and development of new approaches focused towards data quality, specifically for LD.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Aba-Sah Dadzie submitted on 27/Nov/2014
Suggestion:
Accept
Review Comment:

The paper is, overall, now fairly easy to follow, and should provide a good guide to the reader looking for information on data quality in the field. The update to 2014 also increases currency and breadth.

A few additional comments:

* The introduction states exclusion criteria include non-peer reviewed publications. This therefore excludes theses - they are examined, not peer-reviewed. The section about the selection of approaches and referencing should either state this exception or explain why theses are considered to be peer-reviewed. Note this comment is based not on the content or value of theses, but simply the way in which academic literature is classified.

* The concluding sections of the paper mention for the first (and only) time one of the references on which the selection of data quality criteria was based (actually stated as that which provided the initial list). This should be brought forward to the introduction when the methodology followed is described. While this is for the authors to decide on, it would be useful to provide a bit more detail on the contribution to the analysis/paper.

* the intra-relations in 4.2.6 are a bit confusing, at least partly because they're written differently from all the others. At the end of this section I'm still not sure what this is saying - what I read is that this set may or may not give a measure of quality or correctness. I would have thought that using them in concert SHOULD resolve the issues highlighted?

* if demos are classed as not available (S5.4), surely LinkQA must be considered as such - if the user can select no input and all other features/test criteria are automatically set it is a demo.

* check that all the criteria in the tables match what is in the text - they're mostly identical or near so, however, e.g.,
- I1 includes "(ii) via crowdsourcing [1,64]" in Table 2 but this is not discussed in the text.
- S2 also in Table 2 - "verifying authenticity of the dataset based on a provenance vocabulary such as the author and his contributors," - author and contributors are NOT provenance vocabulary - in this case the text actually states this correctly.
- CS6 in Table 3 - "...such that reasoning over data using those external terms is affected ..." - the matching text says the opposite - "such that reasoning over data using those external terms is [not] affected" - which of these is correct? - I would suspect the latter?

* am not convinced that either conclusion in footnote 20 is substantiated. Importantly, how does large size generate "semantically incorrect rules"?

* end, s.5 - "DaCura, TripleCheck- Mate, LiQuate (2013) and RDFUnit (2014) and [?] are currently being maintained." - is there a tool missing or is this a typo? - where '[?]' inserted

* there are a number of grammatical errors/typos that need to be fixed. There's also noticeable change in writing style in S5.4. A proofread should probably resolve both.

* on structure:

- a few tables are placed a couple of pages after where they are first mentioned. Bringing them to the first position in the text where there is enough room to place them makes it easier to do the heavy cross-referencing needed to read them.

- why is a bit.ly used for the mendeley url in footnote 4 - space isn't an issue, and the full url is more meaningful

- there's a bit of inconsistency in citation and referencing. E.g.,
* Fürber et al. [20] referred to several times as "he" (as opposed to "they")
* a few references in the text are not accompanied by reference numbers
* the same conference series with differences in full form and/or acronym.
* reference 19 is in German - target readership is English, would be useful to provide a translation of the title with it. Actually, this IS done, in table 1, but it took double-checking to confirm it was the same reference - needs to be done in the references section as well.
* not a deal breaker, but, e.g., 51 is a collection - this is not obvious either in the way it is cited, nor in the references (using eds. would help) - in the latter it appears to be an incomplete reference.