Quality Assessment Methodologies for Linked Data: A Survey

Tracking #: 682-1892

Authors: 
Amrapali Zaveri
Anisa Rula
Andrea Maurino
Ricardo Pietrobon
Jens Lehmann
Sören Auer

Responsible editor: 
Pascal Hitzler

Submission type: 
Survey Article
Abstract: 
The development and standardization of semantic web technologies has resulted in an unprecedented volume of data being published on the Web as Linked Data (LD). However, we observe widely varying data quality ranging from extensively curated datasets to crowdsourced and extracted data of relatively low quality. In this article, we present the results of a systematic review of approaches for assessing the quality of LD. We gather existing approaches and analyze them qualitatively. In particular, we unify and formalize commonly used terminologies across papers related to data quality and provide a comprehensive list of 18 quality dimensions and metrics. Additionally, we qualitatively analyze the approaches and tools using a set of attributes. The aim of this article is to provide researchers and data curators a comprehensive understanding of existing work, thereby encouraging further experimentation and development of new approaches focused towards data quality, specifically for LD.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Aidan Hogan submitted on 13/Jun/2014
Suggestion:
Accept
Review Comment:

Versus the previous version, the authors have done a good job in streamlining the conceptualisation of Linked Data quality into something more concise and understandable (as was my main concern in the previous review). They have also addressed various other issues that my review raised. As a result, I am recommending an accept.

I list some final minor comments with respect to phrasing and typos that the authors should look into for the camera-ready version (some corrections are not strictly incorrect in the paper, but are merely distracting):

MINOR COMMENTS:

<> = replace
[] = add
{} = remove

THROUGHOUT:
* Watch capitalisation of the Web
* I would suggest to put commas before "e.g." and "i.e." in most places (and possibly even after, but this is a style preference)
* Keep singular/plural of data consistent (e.g., "those data" vs. "the data is")

INTRODUCTION:
* "comprises [of ]close to"
* "government, etc.\footnote{2}{.}"
* "based on [the ]results of"
* "there are more concrete"
* "Despite the quality LD being ..."
* "for assessing the quality LD ..."
* "Providing semantics links is another ..." This sentence is a bit fluffy.
* "metrics for each of the dimensions along with ..." Awkward sentence. Rephrase.

Section 2:
* "tackles, in particular, {the} problems (i)--(iii) in that{,} it"
* "to justify the need conducting ..."
* " are the data quality ..."
* "to shortlist potential[ly] relevant ..."
* "focus trust related ..."

SECTION 3:

* "which are prone to the non-exploitations of those data" Awkward phrasing.
* "rely on quality indicator[s] that the assessment"
* "(i) Accessibility (ii) Intrinsic (iii) Contextual and (iv) Representational groups." A nit-pick but the latter three are adjectives and the former a noun. It's a little distracting.

SECTION 4:
* Footnote 6: "[http://]linkedgeodata.org"
* Table 2: "a[n] RDF dump"
* "derefer[e]ncability" or possibly even "derefer[e]nc[e]ability"
* "from both {the }datasets"
* "as the mean[s] a customer"
* Footnote 10: "excluding {the }blank nodes"
* "degree of using digital signatures ..." Entire phrase is awkward. Rephrase.
* "prevent users with the competitor"
* "or search engine, not on the dataset itself"
* "the performance criterion comprises [of ]aspects of ..." I appreciate this is a direct quote so maybe you can leave it as it is.
* "completely represents {the }real world data" or maybe even better: "completely represents the real world{ data}"
* Table 3: "deviation{s}-based and distribution based method[s]"
* "using functional dependenc rules" or "using functional dependencies{ rules}"
* "no. of {the }instances contained in the semantic metadata set"
* "deviation{s}-based"
* Footnotes 13--20. It's not really consistent how capitalisation and full-stops (aka. periods) are used.
* Footnote 20: "and the semantically incorrect rules be generated."
* "including clear definitions [of ]what inconsistency means"
* "Moroever" -> "Moreover"
* Footnote 23: "unique name{s} assumption".
* "no. of {the }instances ..."
* "It should be noted{,} that in this case"
* "flight related information{,} simplifies"
* Table 4: "counting the occurrence[s] of"
* Table 4: "trustworthiness of [the ]information provider"
* "scale of 1 -- 9" -> "scale of 1--9"
* Footnote 26 talks about "context trust" but the text talks about "content trust"
* "information when it is provided"
* "can be trusted{,} when it is"
* "Data related to Boston in the integrated dataset{,} for the required flight is represented as follows:" The example doesn't really show data, just labels?
* "because it reflects a too old state of the real world for a specific usage" Poorly phrased.
* "where [a ]score"
* "outdated [and ]thus unacceptable"
* "between [the ]last modified time of the data source and [the ]last modified time of the dataset"
* "from {a }city A to {a }city B"
* "This is the only article that describes this dimension (from the core set of articles included in this survey)." But in each such case, you present metrics under that dimension from other papers, so why say this? Do you mean that that paper is the only one that explicitly defines the metric?
* "whereas in the other case, [the ]date is represented"
* "as the more versatile forms" What is a versatile form? Maybe just a form?
* "Another dimension in the representational group, versatility is related ..." Poorly phrased.
* " set of non-exhaustive examples of inter-relations between the dimensions{,} belonging to different groups{,} indicate[s] the ..."

SECTION 5:
* "In this section, we compare the 21 selected approaches ... (Comparison perspective of selected approaches)." Poorly phrased.
* I would suggest: "RDF triple[ processes]", "RDF graph[ processes]", "Dataset[ processes]" in the itemised list following "three types of data". Otherwise it reads like you're talking about actual RDF triples/graphs/datasets (which you are not in the text snippets).
* Table 7: why is the caption sideways and the table not? Also, "Occurrence[s]".
* "dataset{s} assessment"
* "assess data [on ]both [a ]triple and graph level"
* propagated a higher level such as [the ]graph or dataset level"
* "The Trellis user interface allows several users to express ... The tool allows several users ..." Distracting repetition of "several users".
* Table 8: Hogan et al., 2010 has a tool URL but no tool tick?
* "For example, in TrustBot, [which is] an IRC bot that makes ..." Otherwise easy to misread as "an IRC bot in TrustBot makes ..." as opposed to "Trustbot is an IRC bot that makes..."

Review #2
By Aba-Sah Dadzie submitted on 12/Jul/2014
Suggestion:
Major Revision
Review Comment:

The paper is much easier to read. However, I'm not completely sure this isn't because I've now read it several times. There are still sections that are a bit difficult to interpret. The inter-relations sections in general state what is easily inferred or highlight fairly thin connections. S4.5 especially is a bit difficult to follow, and I'm not too sure really contributes much to the discussion. Most of S5.2 and especially the definitions of quantitative vs qualitative are probably not required, even for anyone who might not know what they are, they are easy enough to interpret within the context.
The point of the examples is not always easy to decipher, and sometimes actually raise questions rather than answer them. E.g., the flight code in 4.2.2 & 3 left me wondering if the error could be correctly raised - maybe this is a case where leaving it to an inter-relations section would work. The authors could then illustrate more clearly where even if one criterion did not recognise a quality issue another used in conjunction with it would.
In 4.1.4 the example doesn't make sense - a spoof competitor with HIGHER prices won't entice people away from the real one. 4.4.2 - the date example is a bit far-fetched, differences in formatting for dates is a known issue, the example given is actually one of the simpler encountered, and any basic application should be able to handle it and should probably anticipate such. Also, what exactly is meant by using LD principles to provide data (wrt time/dates) - which principles, and to correct what, using XSD or the time ontology?? Does either XSD or the ontology violate LD principles?

Importantly, my key reservation still remains - a survey should cover a broad range of existing work. I'm not completely convinced the infancy of LD is the only reason for the small number of articles found. It may be that:
- 1. - there isn't enough distinction between LD and other structured data, esp as stored in databases, from this point of view, to justify new research specifically for LD
which leads to:
- 2. - there simply isn't enough yet to warrant a survey for the field. Note I still believe the paper is timely, but it may be that at this point what is possible is a review of nascent work or a proposal to guide research in this area, rather than a complete survey.
I'm actually quite surprised that the authors didn't look for new papers in the interim for the reviewed version, i.e., 2012+. Granted, it is more work and moves the goal post. But this would only be positive, and would widen the scope, even if only by a few more papers. And it would make the paper more current, which is important considering we all agree the field is nascent. Further, a good survey may be considered seminal work, it needs sufficient breadth to guide new research and further work.

Related to the point above, while more of the 21 are referenced in the quality descriptions/definitions, they still mostly start with "Bizer [6] adopted the definition of [QualityCriterionX] from [AuthorsY] …". My original point has not been addressed. I'll try to illustrate more fully why this is an issue. Note, as in my initial review, I am in no way discounting how much work goes into a PhD thesis. However, it is, if I must be pedantic, not in the normal sense peer-reviewed, it is examined. Also, theses are not normally classified as regular publications, unless they have been independently reviewed after the fact and published. So the point that a survey should not rely predominantly on a thesis still holds. The way in which it is mostly cited actually strengthens my point - if it can only be cited indirectly, and needs to be substantiated by whoever else its author was citing, then the authors of this paper are actually saying that it cannot stand on its own. Also, large sections of the paper start to read a bit like a review of the thesis.
Further, citing in this way isn't simply unusual but introduces unnecessary complexity. If I were to cite this paper, then, do I write "[AuthorsZ] adopt the definition of AuthorsA, who adopt the definition of…" ad infinitum. Of course, a review of someone else's book on, say, Newton's law of gravity WOULD talk about how AuthorX analyses Newton's work. But this is not the case here.
Finally, the fact that the authors actually cite Pipino et al - one of the examples of indirect references - on its own (S4.3.1) indicates that it could have been used without prefacing it with the thesis in all the other sections.
Strangely, the Master's thesis is actually cited in and of itself!

The availability of (detailed) documentation is not a good measure of usability, if a tool is usable you should not need the documentation for any but rarely used functions or especially complex tasks. Ease finding help (within the documentation) may be a better measure (but still not a particularly good one). Wrt to the response to this point, at least state that this is what you did. Simply reading documentation written by someone else is simply not enough. An expert review - where the expert is an HCI or usability specialist, and has specific training and experience to carry out a heuristic evaluation using documentation and a tool's UI, will still have to follow a set of heuristics in doing so. But such results are still always presented with the caveat.

(Still) A large number of basic grammatical errors which an auto check and proof-read would pick up. A handful of contradictory statements, e.g., 5.4 licensing says all tools have a license, then ends by listing those that don't.

************** additional questions raised by authors' response

R: "We would like to point the reviewer to the comprehensive survey done by Batini et. al.(ref [2]), which already focuses on data quality measures for other structured data types. Since there is no similar survey specifically for LOD, we undertook this study. ..."

***So clearly state this in the introduction, with the reference. Simply because while the paper IS timely, the (range/coverage of) supporting literature leaves a lot to be desired. (but see also above)

A: ...This information has been added to the Introduction (2nd last paragraph). In fact this point also answers the reviewers issue about our inclusion of very few articles.

*** Actually, no, it doesn't. Stating this in the introduction DOES help, but it doesn't change the fact that the number of papers referenced is quite low. The bigger problem really is the coverage/range - that even out of the 21 the focus is still on just two, and those two not regular publications.
In fact the response saying Batini et al already have a detailed survey for structured data, and explaining why most of the review cites predominantly the PhD and Masters theses, actually highlights the issue - that the coverage is too low.

"It would be useful to indicate in Table 8 which of the three groups described in S5.1 each dimension belongs to - mapping the list of numbers in the text to the columns is unnecessarily tedious. "

*** The response doesn't address the question - in the text there is a list of numbers, in the table (in fact, the same is done in Tables 7 & 9) - author names used. Simply placing the (ref) number after the author name (in the tables) would resolve the issue. As is, the reader has to go back and forth between the text, the table and the references to match the numbers to authors and the information in the tables.

"What criteria were used to select the initial set of articles...
R: "The inclusion and exclusion criteria specified ... are detailed in Section 2. A reference has been added to the Introduction.
There is no reference where this is first introduced."

*** The point is that YOUR response to the question (initial review) stated you had included a reference in the intro. I could not find the reference where first introduced.

A: We agree that some publications provide a list of standardized keywords. However, if the keyword is in a list of standardized keyword, there are very high chances that that keyword is also present in the abstract and thus show up even with kewyords based search. Besides, the ACM Digital Library is just ONE example which provides such lists. We, however, used five other search engines/digital libraries and four journals where there is no such list provided. Thus, in order to standardize our search criteria over all the search engines and journals, we use the same search strategy.
Additionally, we are sure that we have included *all* the relevant articles for our survey.

*** ACM was one example out of the lot. Out of your own list the majority, if not all, have a requirement to include keywords, and some of these also have a predefined list of standard keywords. The argument about keywords appearing in the abstract only holds sometimes. Lowering quality, which is what this paper aims to countermand, is probably not the best solution in this case. While it is probably too late to do it here, it may have been preferable to use two separate searches if required, narrowing where necessary for the web search. This, from the paper, was a supplementary search, so narrowing at this point would not be a massive issue. Narrowing right from the start is.
The claim you are sure to have found ALL papers is a bit grand, with the number of journals, conferences and especially workshops out there today, even with more general criteria not absolute, with the search criteria used, even more unlikely. It is not unusual for authors of papers to contact those of a new one to identify similarities between their work. Especially for surveys.

A: As mentioned in the previous answer, since we wanted to include all the tools that the core articles in our survey propose, we include the ones that are not available too. About determining whether it is customizable, it was mainly done by the documentation of the tools.

*** then state this.

*** re -Maintenance/last update - state clearly then to the author how they are expected to interpret this. It is ambiguous at best.

*** I don't understand the point in LinkQA as an example here, if you cannot even choose the datasets what is it evaluating? How will it help me follow the quality criteria/guidelines proposed?

*** The authors highlight where only one of their references defines a specific metric, this starts to get repetitive. This is obvious where done, it should be stated at most once. Also, e.g., in 4.1.3 - rather than repeat the same reference 5 times for the sub-points, it could simply be placed at the top of the list.

*** Fig. 1 - for consistency and also readability - suggest move the final two numbers outside the boxes, as is done for all the others, it took me a while to finally locate them, even knowing they were supposed to be there.