Literally Better: Analyzing and Improving the Quality of Literals

Tracking #: 1579-2791

Authors: 
Wouter Beek
Filip Ilievski
Jeremy Debattista
Stefan Schlobach
Jan Wielemaker

Responsible editor: 
Guest Editors Quality Management of Semantic Web Assets

Submission type: 
Full Paper
Abstract: 
Quality is a complicated and multifarious topic in contemporary Linked Data research. The aspect of literal quality in particular has not yet been rigorously studied. Nevertheless, analyzing and improving the quality of literals is important since literals form a substantial (one in seven statements) and crucial part of the Semantic Web. Specifically, literals allow infinite value spaces to be expressed and they provide the linguistic entry point to the LOD Cloud. We present a toolchain that builds on the LOD Laundromat data cleaning and republishing infrastructure and that allows us to analyze the quality of literals on a very large scale, using a collection of quality criteria we specify in a systematic way. We illustrate the viability of our approach by lifting out two particular aspects in which the current LOD Cloud can be immediately improved by automated means: value canonization and language tagging. Since not all quality aspects can be addressed algorithmically, we also give an overview of other problems that can be used to guide future endeavors in tooling, training, and best practice formulation.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Accept

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 13/Mar/2017
Suggestion:
Minor Revision
Review Comment:

The authors have addressed most of the comments included in the previous review and have provided explanations. The taxonomy is improved with the clear separation of the inherent data-specific quality aspects from the RDF processor related aspects such as “Unimplemented”. Readability of most sections of the paper is improved in the new version and the new examples included in the paper help to understand its content better.

Comments:

* DQV work has completed, so the corresponding text can be updated accordingly.
* For completeness and consistency, it might be better to add the descriptions for categories “well-specified” and “canonical” for datatyped literals and “consistent” and “well-specified” for language-tagged strings. For instance, these could one phrase descriptions similar to “valid” or “registered” categories.
* Because unsupported is not a category in the taxonomy may be it can be removed from the text to be clear where it says “Invalid or unsupported”.
* According to Section 4.1, the example of “semantics”xsd:string does not fit “underspecified” of the language tagged string tree, isn’t it? Doesn’t it belong to the datatyped literals tree?
* The last paragraph of the section 6.3 is a bit unclear.
* When giving statistics about strings in page 14, it would be interesting to include what percentage of the 2.26 billion language tagged strings had the optional language tag.
* In general, the sections about quality assessment (e.g., section 6.2 and 6.3) have still a lot of room for improvement with respect to clarity and details.

Previous comments which are still relevant:
* For single word literals, aren’t there more efficient ways of doing dictionary look-ups rather than looking for a rdfs:seeAlso property in the resources returned by the Lexvo API? For instance, if I look for a word such as http://www.lexvo.org/page/term/eng/Canonicalization your approach will give a false negative. Do you know the precision and recall of this approach? Also it looks strange that the ALD libraries that are described in the paper which high precision are not used here. It would be good to motivate why the current approach was chosen.

Minor issues:
* AbstractQuality -> Abstract Quality (page 1)
* calculating the ration -> calculating the ratio (page 12)
* a these are supported -> these are supported (page 15)

Review #2
By Heiko Paulheim submitted on 23/Mar/2017
Suggestion:
Minor Revision
Review Comment:

I appreciate the thorough review made by the authors, and the answer provided.

My main remaining concern is the identification of datasets. I can see that this is a tricky issue, which cannot be fully resolved, but some heuristics could be applied, e.g.,
1) using the pay level domain as a proxy for a dataset
2) using datasets as provided in the LOD cloud dataset, and falling back to (1) if there are none

The reason why I am so picky about this is my original remark about the statement made in section 7.1 (i.e., "defining the DBpedia type IRIs would solve the vast majority undefined datatype IRIs"). Without a clearer break down of the findings, it is unclear whether this finding comes from (a) a heavy bias of the evaluation set towards DBpedia or (b) the fact that this problem only occurs in DBpedia.

Further comments of my initial review that have remained unaddressed:
* a justification why documenst with a score of less than 40% have been chosen for manual inspection in section 6.3
* a statement whether the distribution of omitted language tags is the same or different than the distribution of explicitly given language tags

However, I am confident that those issues can be fixed.