Literally Better: Analyzing and Improving the Quality of Literals

Tracking #: 1397-2609

Wouter Beek
Filip Ilievski
Jeremy Debattista
Stefan Schlobach
Jan Wielemaker

Responsible editor: 
Guest Editors Quality Management of Semantic Web Assets

Submission type: 
Full Paper
Quality is a complicated and multifarious topic in contemporary Linked Data research. The aspect of literal quality in particular has not yet been rigorously studied. Nevertheless, analyzing and improving the quality of literals is important since literals form a substantial (one in seven statements) and crucial part of the Semantic Web. Specifically, literals allow infinite value spaces to be expressed and they provide the linguistic entry point to the LOD Cloud. We present a toolchain that builds on the LOD Laundromat data cleaning and republishing infrastructure and that allows us to analyze the quality of literals on a very large scale, using a collection of quality criteria we specify in a systematic way. We illustrate the viability of our approach by lifting out two particular aspects in which the current LOD Cloud can be immediately improved by automated means: value canonization and language tagging. Since not all quality aspects can be addressed algorithmically, we also give an overview of other problems that can be used to guide future endeavors in tooling, training, and best practice formulation.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 28/Jun/2016
Review Comment:

The paper has been improved impressively from the first revision. In particular, the related work, motivation, theoretic framework sections are significantly better now, the RDF literal quality taxonomy is a particularly interesting contribution, the evaluation and analysis are enhanced and now more systematic and clear. Overall, very well written, clear explanations illustrated with examples when necessary.

Minor comments:
* the whole "toolchain" point is still a bit unclear; perhaps, saying exactly what the parts/steps are would help
* page 3: "informed information"
* page 6: "strings a specified"
* page 7, "color"en-US - @ missing
* page 11: "muli-word"
* tables 1-4: still thing an additional column with % will help
* page 13: "70% of RDF literals in LOD Laundromat have" - not in LOD Laundromat, but in your 470 document sample
* page 13: "only 39% of RDF literals have a compatible datatype" - do you mean consistent?
* table 5: what is "No library"?

Review #2
Anonymous submitted on 13/Jul/2016
Minor Revision
Review Comment:

The authors have addressed most of the comments mentioned in the previous review and provided some explanations. The taxonomy of literal quality is improved significantly.

However, there are several comments from the initial review that were not addressed or explained in the cover letter and some additional comments. They are listed below:

Previous comments:
A1. The “Unimplemented” category of the taxonomy is still a bit confusing. Is it really a property of a literal quality or rather quality of the RDF processor? This might be confusing. Looking a bit more closely, I guess the taxonomy includes the quality of the literal itself, the datatype, and the RDF processor.
A2. For single word literals, aren’t there more efficient ways of doing dictionary look ups rather than looking for a rdfs:seeAlso property in the resources returned by the Lexvo API? For instance, if I look for a word such as your approach will give a false negative. Do you know the precision and recall of this approach? I think this is somewhat justified as you define this metric as an estimate but I wonder whether there is more efficient way to do this rather than doing an API call, a HTTP dereference, and RDF processing and query.
A3. In Table 1, please add a footnote or a note somewhere to say that the prefix dt is used for
A4. In Section 6.1, Undefined literals are discussed well in detail with three different examples. It would be valuable to have a similar discussion on the causes of invalid literals rather than just saying “as soon as the data gets more complicated, the percentage of invalid occurrences go up”. The subsection on “non-canonical” data can be improved with more examples as well.
A5. It would be useful to have a summary of the top datatypes used in the literals overall, for instance, similar to Table 4. Authors mention that 79% of them were XSD Strings but it would be interesting to know, for example, the top 10 datatypes used and their percentage. That will give some perspective when reading Tables 1 ~ 3.
A6. What was the reason only to use 470 data document from the LOD Laundromat with Luzzu? As the paper talk about web scale analysis, isn’t this number relatively small? Was some type of sampling used for selecting those 470 documents? According to the tool chain description, it is assumed that these tasks were fully automated.
A7. The conclusions section seems to be a bit weak and probably can be improved by providing insights about the challenges of evaluation, and potential.

Additional comments:
B1. In the LOD Laundromat paper, the authors mention bad encoding sequences as a quality issue. In practise, I have seen this as a very common problem in many datasets, especially when characters with accents are used. I wonder why they were not identified as a quality issue here. I still see them in LOD Laundromat, for example,
B2. It will be useful for the community if the complete analysis results were published in figshare or some other place in an understandable manner in addition to the top 5 values provided in Table 1, 2 and 3.
B3. As “Unimplemented” is one category of literal quality taxonomy, an analysis of the support of popular RDF processor for the most commonly used datatypes in the LOD cloud would be useful.
B4. Compared to Section 6.1, Section 6.2 does not analyze the quality aspects. It rather analyzes the distribution of language tags in Linked Data (which of course is interesting but is not the main focus of this paper). If an analysis of different categories of LTS (such as Malformed, Well-formed, Unregistered, Registered, Inconsistent, and Consistent) is presented similar to Section 6.1, it would add value to the paper.
B5. Section 6.3 refers to a metric named “compatible datatype metric” in Section 4.5, but Section 4.5 defines “defined literal quality metric”. I assume they both refers to the same but the names have to be consistent.
Section 6.3, introduces 4 categories of quality problems in the sample authors analyzed. Shouldn’t those categories aligned with the categories in the quality taxonomy? The second and third category in Section 6.3 don’t seems to have a good match in the taxonomy.

Minor comments:

C1. The paper still contains several typos and grammatical errors. It needs to be proof-read and checked for these errors. E.g., page 8: “literal quality much use” -> “literal quality must use”, page 11: “This section present three analysis” -> “three analyses?”
C2. There are some styling issues. E.g., page 3 the date goes beyond the right margin, page 6 the same.

Review #3
By Heiko Paulheim submitted on 02/Aug/2016
Major Revision
Review Comment:

The paper introduces a study of the quality of literals in Semantic Web data, and proposes a number of measures and tools to improve that quality, e.g., by canonicalizing and inferring data types.

My general impression is that the paper, although conveying many interesting ideas, still lacks coherence and clarity in many places. I will detail on those below.

Section 3 describes some benefits of improving the quality of literals, and section 4 defines quality criteria. In my opinion, the order should be reversed. Section 3 describes the benefits of improving literals along the criteria in section 4, and other criteria might lead to other benefits (for example, eliminating outliers in numerical literals may lead to better consumption in data mining tool chains, eliminating redundant literals may lead to more efficient storage and transmission, etc.).

Section 4.3 mixes quality criteria that lie in the data itself (e.g., using undefined datatype IRIs) with quality criteria depending on tools (e.g., using a datatype not implemented by some tools). The authors should tell them apart more critically (although 4.4 seems like a bit of a distinction in that direction). In the same section, the notion of "underspecified" is, well, underspecified. As XSD datatypes form a hierarchy, the authors should define a level in the hierarchy which they deem specific enough, and justify that decision. A similar thing holds for language tags. Is "de-DE" really better than "de", and, in particular, would you consider it better to repeat a literal that is the same in German, Austrian, and Swiss German with three language tags? One could probably argue in both directions here.

One thing I wonder about missing or incoherent language tags are proper names, e.g., of persons. Should a literal like "Albert Einstein" really have a language tag? If yes, what would be the proper one?

For the sake of coherence, the wording of section 4 and the following should be harmonized. For example, there is the definition of "valid literals (category 'Valid') T_{correctLiterals}" - why not simply call it T_{valid}? There are quite a few of such cases where different terms are used in sections 4 and 5, which complicate the readability and reduce the coherence of the paper. At the same time, section 4.5 defines metrics, but those are not used later on in section 6 to report numbers. So why define them then?

In definition 2, would it not make sense to divde T_{correctLiterals} by the number of datatyped literals, instead of all literals (including, e.g., language tagged strings)?

In general, the numbers reported in section 6.1 are quite shallow. For each datatype, I would also like to see the number of datasets this datatype originates from (e.g., is it a commonly made mistake, or a mistake made by a dominant data provider such as DBpedia). In general, some more statistics about the dataset at hand would be interesting, as well as some statements about the representativity of the sample, compared to, e.g., the Billion Triples Challenge Dataset or the LOD Cloud dataset. I would also appreciate a distribution of datasets the triples/documents come from to see a potential bias or skewness to some large/major datasets. Furthermore, not only absolute, but also relative numbers would be appropriate for tables 2+3, i.e., what portion of xsd:int is invalid/non-canonical. The authors should show the top datatypes both by the absolute and the relative numbers.

Sections 6.2 and 6.3 are generally weak. First, it should be defined what it means that a "lexical form contains natural language expressions" (the same for "linguistic content" in section 6.3). Furthermore, it is hard to see any relations to quality metrics in section 6.2. It only reports the distribution of language tags. In both sections, I miss quantitative results. In section 6.3, it is not clear why some of the problems (e.g., an underscore) should actually be quality problems. Furthermore, the "flow cytometer sorter" opens up the whole new field of typos and grammar mistakes, which I assume would be a paper of its own. However, it seems only to be addressed since correctly guessing a language tag from such grammatically incorrect strings (the term "syntax" should be avoided here for the sake of clarity) seems to be problematic for the tools used, not so much as a quality issue in itself.

The selection in 6.3 is a bit arbitrary. The authors state that they use a sample of documents with a score of less than 40% to identify quality issues that are not detectable by the tool at hand. Why? This somehow implies the assumption that the distribution of detectable and non-detectable errors are similar. But no evidence whatsoever is given.

Section 7.1 states that fixing DBpedia datatype IRIs would resolve the majority of the problem. As stated above, without a clear profile of the dataset at hand, it is impossible to decide between two hypotheses: (1) most other datasets do not suffer from that problem, and (2) the evaluation data collection has a heavy bias towards DBpedia.

In section 7.2, while it is intuitively clear that the language of a longer literal is easier to determine than that of a shorter one, it is not clear why the F1 decreases for very long literals. I would appreciate a discussion here. Along the motivation of a data-driven quality study, it would also be interesting to report what language tags are most often inferred for untyped literals, i.e., which language tags are the most omitted ones. Is there any notable deviation from the overall distribution?

In summary, the paper addresses an interesting field, but lacks clarity and coherence in too many spots that I could recommend acceptance.

Minor issues:
* p.1: "Abstract Quality" (in the abstract) is a strange term. Rather use "quality in general" or something like that
* p.1: "First, we create a toolchain" - I expect a "second" in the subsequent text, which never comes
* p.1: "a toolchain that allows billions of literals to be analyzed" - actually, any toolchain can do that, given enough computing power, memory, and time. You should specify some constraints here.
* p.2: missing comma after "Unique Names Assumption"
* p.2: missing comma after "domain violations"
* section 4.1: add example for language tagged string for the sake of completeness
* Fig. 1 can be improved. Inheritance (specialization) arrows usually run from the specific to the general class (not vice versa), and use a non-filled arrow head. The semantics of dashed and solid rectangles should be explained in the caption. Plus, from my understanding, "Well-formed" probably should be solid, not dashed.
* p.8: "For the first metric Luzzu taken..." - sentence is awkward
* p.9: don't use [0], [1] etc. to refer to lines in the listing, as it is easily confused with references.
* p.11: "This section present" -> "presents"
* p.15: the example does not make sense, since zh-cn and zn-tw differ in the first two characters anyways