Quality Assessment for Linked Open Data: A Survey

Tracking #: 556-1762

Authors: 
Amrapali Zaveri
Anisa Rula
Andrea Maurino
Ricardo Pietrobon
Jens Lehmann
Sören Auer

Responsible editor: 
Pascal Hitzler

Submission type: 
Survey Article
Abstract: 
The development and standardization of semantic web technologies has resulted in an unprecedented volume of data being published on the Web as Linking Open Data (LOD). However, we observe widely varying data quality ranging from extensively curated datasets to crowdsourced and extracted data of relatively low quality. Data quality is commonly conceived as fitness for use. In this article, we present the results of a systematic review of approaches for assessing the quality of LOD. We gather existing approaches and compare and group them under a common classification scheme. In particular, we unify and formalize commonly used terminologies across papers related to data quality and provide a comprehensive list of the dimensions and metrics. Additionally, we qualitatively analyze the approaches and tools using a set of attributes. The aim of this article is to provide researchers and data curators a comprehensive understanding of existing work, thereby encouraging further experimentation and development of new approaches focused toward s data quality, specifically for LOD.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Aba-Sah Dadzie submitted on 12/Nov/2013
Suggestion:
Minor Revision
Review Comment:

The paper is very much improved, and the discussion on dimensions is more clearly presented. However, a few key and other concerns have not been addressed (properly).

The authors in their response state that they intentionally focus on papers to do specifically with LOD. However, this does not resolve the issue I raised - that a survey needs to look at a broad range of existing literature. The discussion is still driven by the same PhD thesis - most of the dimensions in subsections 4.* start with "Bizer [4] adopted the definition of …" in paper X … the survey now reads like a summary of the thesis. Also, this is an unconventional method of citing - why not simply cite the original paper?

Another, and probably the second most highly cited reference [14] is not a published, nor a peer-reviewed article - I'm not sure it even counts as a technical report - it's a proposal on a wiki. Which, btw, is different from the original reference, which was a Masters thesis.

These contradict the first exclusion criterion - "Studies that were not peer-reviewed or published". Further, the authors in section 5.1 clearly state the quality dimensions were derived based first on 4, then 14. Finally, the last 3 dimensions introduced in 5.1 need to be either linked to existing work, or backed by anecdotal evidence uncovered by the authors.

"Finally, we can conclude that none of the approaches covers all data quality dimensions that are relevant for LOD and most of the dimensions are discussed in two articles." - does this mean two specific articles discuss almost all dimensions - in which case they should be cited, or each dimension is discussed on average by 2 articles?

It would be useful to indicate in Table 8 which of the three groups described in S5.1 each dimension belongs to - mapping the list of numbers in the text to the columns is unnecessarily tedious. The issue here is not the sentence (which the response says was rephrased), but the fact that the split in the table itself is not obvious to the reader.
Also, is order (of dimensions) significant?
From the point of view of formatting - which is more important - the dimensions or the articles referenced in the table? English reads from left->right - the table requires reading from right->left. Maybe more aesthetically pleasing, but at the end of a long paper simply introduces more load on the reader.

The conclusions state (among others) "As our literature review reveals, the number of publications published in the span of 10 years (i.e. 21) is rather low. This can be attributed to the infancy of this research area. "
This is open to debate - a number of other reasons could be given - a simple example - that this would simply be replicating existing, validated research on assessing quality of structured or semi-structured data.

********

"What criteria were used to select the initial set of articles…"
R: "The inclusion and exclusion criteria specified … are detailed in Section 2. A reference has been added to the Introduction."
There is no reference where this is first introduced.

"One key piece of information is missing in section 2 - the criteria applied for "Extracting data for quantitative and qualitative analysis" - which pruned down from 68 to 21.
R: Added description of step 5 in text.
Where - the paragraph is identical to that in the original version. I still don't understand what was done in this step.

R: "We would like to clarify that in our inclusion criteria, the articles have to satisfy the first criterion and one of the others to be included in our study…"
I would suggest the first criterion be presented separately and the other four as a batch - this would make this clearer.

R: "We would like to point the reviewer to the comprehensive survey done by Batini et. al.(ref [2]), which already focuses on data quality measures for other structured data types. Since there is no similar survey specifically for LOD, we undertook this study. …"
So clearly state this in the introduction, with the reference. Simply because while the survey IS timely, the (range/coverage of) supporting literature leaves a lot to be desired.

R: "Article keyword lists are usually not standardized in terms of terminology also some of the publication databases do not allow to search in keywords specificially. However, it is very likely that terms mentioned as keywords are included in the abstract of the article, which we considered in our search."
I am not convinced by this response - some publications (e.g., ACM) provide a list of standardised keywords, and even where free text keywords are used, for them to be useful authors will include key terms based on their target and field, otherwise the whole purpose of providing keywords is defeated. The 2nd half of the argument may be true, but if you can search the abstract, surely, searching the much shorter set of (delimited) keywords is simpler?

R: "The irrelevance of the superfluous data was particularly aimed at illustrating this quality aspect: LinkedGeoData contains airports as well as trees, post offices etc. Including all in a flight search application means including a large amount of irrelevant data. We rephrased the example to make this more clear."
This is more a case of inventing a non-issue in order to justify the dimension. The author of a tool should implement basic filtering to remove obviously superfluous data. Surely, a much better argument or example for (ir)relevance could be found.

Under verifiability -
R: "The third party is indeed human." … state this in the text

"Based on the definitions and examples given for conciseness, I am confused by the definitions of intensional and extensional conciseness …" - the response says this has been changed. Yet the example points to the same (contradictory) conclusion - if values are duplicated that leads to low conciseness, not high.

OTHER POINTS

"For example, data extracted from semi-structured or even unstructured sources, such as DBpedia"
- in the intro implies that DBpedia is created from unstructured sources; however, in discussing the medical application on the ff. page it is described as derived from semi-structured data. Further on in the paper "Our survey is different since it focuses only on structured data and on approaches that aim at assessing the quality of LOD."
Which of the three is being addressed?

The introduction of ontologies (at the start) only to dismiss their relevance is confusing - why bring it up at all, then?

"Retrieving further potential articles. In order to ensure that all relevant articles were included, an additional strategy was applied such as:" - were the criteria listed used or not - "such as" implies something similar was used - what WAS used?

Table 2
"permissions to use the dataset" is redundant - this would be in either of the first two in licensing

An example of what the authors mean by domain in footnote 14 would be useful - the definition doesn't seem to quite cover the expected definition (in general and in the SW).

Why is OWL2 EL the standard for measuring consistency?

S4.2.4 - does not provide any useful information, but simply, especially in its closing sentence, states the obvious.

S4.7 - removing redundant data does NOT increase the AMOUNT of relevant data, only the PROPORTION of relevant data.

Table 10 - why these 8? - this question is still unanswered: " Also, what were the criteria used to select them?"
Also, some of the tools are highlighted as not available for quality assessment - why are they listed? Further, some of these are stated to be customisable - how was this determined, if the tools aren't available?
"TrustBot and Flemming's tool are [reportedly] not scalable for large datasets." - this is a scientific paper - this statement needs to be backed by evidence.
User documentation is not enough to measure usability of a tool. Also, users typically refer to the documentation for complex tasks or rarely used features, or when they are unable to figure out how to carry out tasks - simple or complex - an indication of low usability.
Maintenance/last update - the conclusions here are open - a tool may not be updated simply because there is no need to do so - this is not necessarily a negative. On the other hand, a tool may be updated frequently to deal with bugs.

S5.2 "which measures the occurrence of observed instances out of the occurrence of the desired instances, where by instances we mean properties or classes [42]" - for a SW paper, referring to properties and classes as instances is simply introducing an element of confusion or ambiguity, esp since it is also used in the paper (correctly) to mean an instance of a class.

CITATIONS & REFERENCES

Convention and simply making the reader's life easier - order numbered citations, e.g., [53,40] -> [40,53]

PRESENTATION & LANGUAGE

The use of conjunctions in a few places alter what the authors are trying to say - a few examples at the start:

p.1 "For example, data extracted from semi-structured or even unstructured sources, such as DBpedia ..."
suggest replace "or even" with "and especially" - unstructured pose more issues than semi-structured

Conversely, (p.2) "It should be noted that even the traditional, document-oriented Web has content of varying quality and is still perceived to be extremely useful"
"and" -> "but"

And ... "Such data quality metrics include correctness of facts, adequacy of semantic representation or degree of coverage."
"representation or degree" -> "representation and/or degree"

"wrt" is an informal abbrev/acronym and should not be used in a publication, at least, not without definition.

cf. means "compare (with)", not "see"

Grammar check and proof-read needed

Review #2
By Peter Haase submitted on 24/Nov/2013
Suggestion:
Accept
Review Comment:

The article has been significantly revised. My reviewer comments have been taken into account.

In particular:
- The title and overall motivation has been adjusted to emphasize quality dimensions and criteria (rather than methodology)
- Definitions have been revised.
- The overall structure and presentation is now more coherent and consistent.

In summary, I can now recommend to accept this article.

There are some very small details to be fixed:
- In the new reference 44: Lewn -> Lewen
- references should be consistent, in particular in abbreviations of names (see e.g. references 35, 36, 49, 50 which are inconsistent)

Review #3
By Aidan Hogan submitted on 04/Dec/2013
Suggestion:
Major Revision
Review Comment:

Though the revised version of the paper is improved, and many of the detailed comments have been addressed, I still think more work is needed since the broader comments still hold. I'll try give a more detailed idea of my remaining concerns in this review than I did in my previous review.

As before, I am still concerned that the paper does not meet the following two criteria from the CfP:

1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

Again, although the paper certainly contains all of the raw material for a very good survey paper on a very important topic, it is still, for me, simply too difficult to read and too confusing to understand in its current form. Given that I'm already familiar with issues of Linked Data quality, and given how much I genuinely struggled to make sense of the paper, I cannot say it is yet suitable as an introductory text for researchers getting started on the topic. I hope the following comments can help the authors rethink parts of the paper.

Primarily the problems with readability are due to the classification, which forms the long core of the paper. I am glad that the authors chose to partially simplify the classification but I'm disappointed to see that, in my opinion, they didn't simplify it enough. I think the paper still draws unintuitive distinctions between quality dimensions that are not justified, which really hurts the readability and understandability of the paper. Given that the descriptions of dimensions are not concise or formal, trying to mentally distinguish them leads to major confusion. The discussion sometimes (not always) succeeds at saying how the dimensions are *different*, but not why they are interesting to consider as separate dimensions. I'm not really sure why the authors made the classification so complicated! I guess it might have to do with aligning with previous papers or with following something like the taxonomy by Bizer in his thesis. But again, a simpler classification with a more natural and intuitive explanation in a Linked Data setting is all that's needed! The metrics of the papers should then fit into this classification (otherwise the metric is not related to LOD quality). A simpler classification would also make the section much easier to read and to write (and to review!).

To ground this criticism, let me try go into detail on three dimensions that are considered separate at the moment and how I tried to understand them as I was reading. I emphasise that this is just an example; it would take too much effort for me to do this for all the dimensions that seem redundant to me (these will be summarised again later; these were also mentioned in the last review and were addressed by adding new discussion rather than simplifying the classification structure; I did not find the new discussion all that helpful unfortunately).

#### Reputation:
# "Gil et al. ... proposed the tracking of reputation either through a centralized authority or via decentralized voting"
# "... a judgement made by a user to determine the integrity of a data source."
# "Reputation is usually a score ..." -- vs. , a score suggests that reputation is a computed metric whereas the definition says that reputation is a user judgement
# "The (semi-)automated approach uses external links or page ranks to determine the reputation of a dataset." -- vs. likewise
# "Reputation is a social notion of trust"
# "It should be noted that credibility can be used as a synonym for reputation."
#### Then under Believability (which is an actual synonym for credibility, vs. ):
# "Jacobi et al. termed believability as 'trustworthiness'" -- vs. , also about trust.
# "they referred to believability as a subjective measure of a user's belief that the data is 'true'" -- vs. , also dependant on the user's context.
# "Believability is measured by checking whether the contributor is contained in a list of trusted providers" -- vs. , also about sources, could involve a centralised list of trusted sources.
# "In our flight search engine use case, if the flight information is provided by trusted and well-known flight companies such as Lufthansa, British Airways, etc. then the user believes the information provided by their websites. She does not need to assess their credibility since these are well-known international flight companies." -- vs. , has she not already judged the credibility/integrity of these flight companies as being well-known and reputable, based on a "social notion of trust"? If this example does not refer to "Reputation", then I have no idea what "Reputation" is any more.
#### Then under "Objectivity".
# "The extent to which information is unbiased, unprejudiced and impartial." -- I cannot understand how this is not covered by the previous two? This is just a reason *why* a source might not be credible/a dataset believable. To be consistent, you would then have to list other dimensions as fine-grained as objectivity (i.e., reasons *why* a dataset is not reputable/believable or why a user might judge it not to have "integrity"), such as "Expertise" (how much the providers know about the topic, are only experts allowed to edit), "Verification" (is the dataset verified/curated/corrected by some quality-control process), etc.

Some other examples of statements that show confusion:

# "Low response time hinders the usability ..."

# "this is the same information is stored in different ways, this leads to high extensional conciseness ..."

# Section 4.2.4 doesn't mention Conciseness at all.

# "Reputation affects believability but the vice-versa does not hold true." ??

# "By fixing amount-of-data, completeness becomes a function of relevancy." If we substitute in the definitions of the terms, the following statement is made: This is an example of why I feel the "Intra-relations" sections often don't really help with clarifying the dimensions.

# Another example: "Timeliness measures how up-to-date data is, relative to a specific task" ... "Although timeliness is part of the dataset dynamicity group, it can be also considered as part of intrinsic quality dimensions because it is indepenent of the users context" A contradiction: how can it be intrinsic and relative to a specific task?

# The example of Interpretability talks about human readable labels being missing for URIs, which is precisely what Understandability was just talking about: "Understandability is measured by detecting whether human-readable labels for classes, properties and entities ..."

# "Most web applications prefer timeliness as opposed to accurate, complete or consistent data" ??

# "a list of courses published on a university website must be timely, although there could be accuracy or consistency errors ..." ??

Here's a summary of my own thoughts on the dimensions on a high-level:
* Availability is fine
* Licensing is fine
* Interlinking is fine
* Security is thoroughly ambiguous in a *LOD* context. A LOD dataset behind a security firewall is not a LOD dataset. I would remove or otherwise just focus on signed content, not access control.
* Accuracy seems too broad in that its definition covers most of the dimensions that follow it. If the authors focused on something like "Syntactic Validity" here or some equivalent, and stay away from the semantic interpretation, I think it would make more sense.
* Consistency is fine.
* Conciseness is intuitively fine, but the definition/discussion is not great since it does not explain what the redundancy is.
* Reputation, Believability and Objectivity should be consolidated and simplified into one dimension, with the text drastically shortened from the sum of the parts. The metrics can be categorised under one dimension.
* Verifiability is fine.
* I think Currency, Volatility and Timeliness should be consolidated into one dimension. Again, volatility has nothing to do with Linked Data quality for me. If I have a dataset with the capitals of all the countries in the world, a low volatility says nothing about quality. What is important for quality is that data are up-to-date. It that case, I would call the dimension "Timeliness", which indicates the amount of time between changes in what is described by the data and changes in the data itself. Currency is not distinct from this. All of the metrics associated with the three dimensions can fit under one. I think the new Timeliness dimension could then go into Section 4.2 under intrinsic dimensions. (On that, I don't like the name of that section since quite a few of the dimensions outside of that section are intrinsic.)
* I still find Completeness, Amount-of-Data and Relevancy confusing. In the simplest case, I think only Relevancy is needed: the dataset has the content the user needs. Conciseness is already covered. Completeness could be folded into Relevancy.
* Representational-conciseness is fine
* Representational-consistency is useful, but could be folded into the previous dimension or perhaps renamed? Something like "Interoperability"? Consistency is a loaded term.
* Interpretability and Understandability could be compressed into one. Otherwise, it should be made much more consistently clear that one is to do with a human user understanding the data, and the other is to do with machines being able to process the data.
* Versatility is fine.

I think that ideally, there should be about 13-15 dimensions. Whether or not the authors choose to simplify the classification that much is up to them, but in it's current form, Section 4 needs to be written much more clearly and sharply to be a good introductory text! And I think greatly reducing the number of dimensions and simplifying/shortening the discussion is the easiest way to achieve this.

Finally, I still do not understand how the subjective/objective distinction is applied in the tables. The detailed explanation comes far too late in Section 5 and still, I don't know why some metrics are considered one or the other. To take an example, in Table 2, the "detection of the existence and usage of external URIs and owl:sameAs links", I have no idea why this metric is subjective? Again, I noticed that some of the paper are misattributed; for example, the first and second entries in Table 2: reference [26] does not check SPARQL endpoints or data dumps. Again, I urge the authors to double-check that all of the referenced works are correctly attributed! In the tables that list metrics, it would also be good to know whether the metric is good or bad; for example, "no usage of slash-URIs": is this indicating high Linked Data quality or low Linked Data quality? The easiest way would be to label the metrics in such a way that they always indicate good Linked Data quality.

(3) Readability and clarity of the presentation.

Section 5 is improved, though I would ask the authors to break up long paragraphs into logical chunks.

Although my minor comments with respect to the writing have been addressed (as I had noted, it was an incomplete list), again, parts of the paper are well written but parts of the paper are still poorly written. I will try to outline more minor comments at the end to address these problems but again, this can only be considered an incomplete list: *please* proof-read the paper more carefully before submission. It is time-consuming for me as a reviewer to draw attention to this issues that I am sure could easily be fixed by the authors themselves (esp. given that parts of the paper are well-written and typo-free!).

In summary, I appreciate that the authors have worked hard to collect together a comprehensive list of literature in the area and the paper has the raw material for an excellent survey paper on an important subject. However, I again strongly encourage the authors to improve the writing throughout and to greatly simplify the classification until it is sufficiently intuitive and readable to serve as a good introduction text for a researcher new to the area (as per the criteria in the CfP). I hope this second batch of detailed comments will help in that direction.

MINOR COMMENTS: (Incomplete!!)

<> = delete
{} = add

Throughout:
* I said before that LOD refers to the "Linking Open Data" project. But when you say "published on the Web as Linking Open Data", this doesn't make sense since Linking Open Data is a project. You could simply say "Linked Data" but if preferred "Linked Open Data" could also be used since it's also used in the original Linked Data Design Issues document by Berners-Lee (sorry for that; I was incorrect to bring this up before).
* Sometimes RDF terms like owl:sameAs are given a \tt format and sometimes they're not. Please make consistent.
* The end of examples is not clearly marked. The next paragraph reads like it is still part of the example.

Abstract:
* "toward s data"

Section 1:
* "focus {on the} quality"
* "Thus, adopting existing approaches"
* "and {the} unbound{ed} dynamic"
* "focus {on}" again

Section 2:
* "as well as {identifying} open
* "What kind{s} of tools"
* "The majority of the papers {were} published {in an} even distribution between ..."

Section 3:
* "'fitness for use' [31]."
* "The semantic metadata, for example ..." Not sure what the "semantic metadata" are here.
* "is used {a} quality indicator"
* "with the user's quality ..."

Section 4:
* "It obtains ..." What does?
* "There are five dimensions {that are} part of"
* " {I}nterlinking is"
* "between entities<,> {are} user or software agents able to"
* "as well as {the} accessibility"
* "is represent{ed} as A231"
* " {fewer} inconsistencies"
* "one of the dimensions<, which> that"
* owl:DatatypeProperty, owl:ObjectProperty, owl:DeprecatedProperty (no '-'), owl:InverseFunctionalProperty (no '-')
* "to the degree {to} which"
* "from malicious websites<.>{:} for instance, if a website"
* "is measured as "
* "a prior{i}"
* "states {that} information"
* "up-to-data data"
* "user{'}s context"
* "comprises {of} the following aspects"
* "enough data"
* data is `complete'
* The HDT guys would probably appreciate a formal reference to one of their papers if the work is to be discussed (as well as the footnote). For example, "Javier D. Fernández, Miguel A. Martínez-Prieto, Claudio Gutiérrez, Axel Polleres, Mario Arias: Binary RDF representation for publication and exchange (HDT). J. Web Sem. 19: 22-41 (2013)".
* "blank nodes {where} the blank node"
* Figure 2: Understand{a}bility

Section 5:
* "metrics belonging to dimensions such as objectivity" ... objectivity is a confusing example of a dimension when talking about the division of Objective and Subjective categories. Also, again, break up that paragraph a few times.

References:
* Reference [14] needs a proper "thesis" entry like [4] has. Looks like a web-page in current form.

And more besides! Please don't rely on just these comments but thoroughly proof-read the whole paper!

(Finally, on a side note, please find a better format for the response letter! I could not print or read the spreadsheet without setting word-wrap and resizing each column and row size individually. Please keep it simple and just do quotes and inline comments in plain text.)


Comments

Just a note to the authors/editor that part of my review with cross-references got a little garbled, maybe because I was using angle brackets that got mistaken for HTML tags. Here I just repost that part of the review (same content but with the cross-references fixed)). Sorry about that.

#### Reputation:
#(R1) "Gil et al. ... proposed the tracking of reputation either through a centralized authority or via decentralized voting"
#(R2) "... a judgement made by a user to determine the integrity of a data source."
#(R3) "Reputation is usually a score ..." -- vs. (R2), a score suggests that reputation is a computed metric whereas the definition says that reputation is a user judgement
#(R4) "The (semi-)automated approach uses external links or page ranks to determine the reputation of a dataset." -- vs. (R2) likewise
#(R5) "Reputation is a social notion of trust"
#(R6) "It should be noted that credibility can be used as a synonym for reputation."
#### Then under Believability (which is an actual synonym for credibility, vs. (R6)):
#(B1) "Jacobi et al. termed believability as 'trustworthiness'" -- vs. (R5), also about trust.
#(B2) "they referred to believability as a subjective measure of a user's belief that the data is 'true'" -- vs. (R2), also dependant on the user's context.
#(B3) "Believability is measured by checking whether the contributor is contained in a list of trusted providers" -- vs. (R1), also about sources, could involve a centralised list of trusted sources; vs. (B2) subjective or not?
#(B4) "In our flight search engine use case, if the flight information is provided by trusted and well-known flight companies such as Lufthansa, British Airways, etc. then the user believes the information provided by their websites. She does not need to assess their credibility since these are well-known international flight companies." -- vs. (R2,R5), has she not already judged the credibility/integrity of these flight companies as being well-known and reputable, based on a "social notion of trust"? If this example does not refer to "Reputation", then I have no idea what "Reputation" is any more.
#### Then under "Objectivity".
#(O1) "The extent to which information is unbiased, unprejudiced and impartial." -- I cannot understand how this is not covered by the previous two? This is just a reason *why* a source might not be credible/a dataset believable. To be consistent, you would then have to list other dimensions as fine-grained as objectivity (i.e., reasons *why* a dataset is not reputable/believable or why a user might judge it not to have "integrity"), such as "Expertise" (how much the providers know about the topic, are only experts allowed to edit), "Verification" (is the dataset verified/curated/corrected by some quality-control process), etc.