Literally Better: Analyzing and Improving the Quality of Literals

Tracking #: 1259-2471

Authors: 
Wouter Beek
Filip Ilievski
Jeremy Debattista
Stefan Schlobach

Responsible editor: 
Guest Editors Quality Management of Semantic Web Assets

Submission type: 
Full Paper
Abstract: 
Quality is a complicated and multifarious topic in contemporary Linked Data research. The aspect of literal quality in particular has not yet be rigorously studied. Nevertheless, analyzing and improving the quality of literals is important since literals form a substantial (one in seven statements) and crucial part of the Semantic Web. Specifically, literals allow infinite value spaces to be expressed and they provide the linguistic entry point to the LOD Cloud. We provide a toolchain that builds on the LOD Laundromat data cleaning and republishing infrastructure. This toolchain allows us to analyze the quality of literals on a very large scale, using a collection of quality criteria we systematically specify. We illustrate the viability of our approach by lifting out two particular aspects in which the current LOD Cloud can be immediately improved by automated means. Since not all quality aspects can be addressed algorithmically, we also give an overview of problem areas that may steer future endeavors in tooling, training, and best practices.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Arthur Ryman submitted on 14/Dec/2015
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

(1) The paper makes an original contribution in the area of the definition and practical evaluation of quality metrics for literals in RDF data. The authors define a good set of literal quality metrics and then measure them on a very large data set. The results show that there is a lot of scope for quality improvement.

(2) The paper is moderately significant. Although it is intuitively clear that good quality is desirable and the authors provide several motivations, it is not clear what the concrete benefit of improved quality would be in any real=world application. It would be more convincing if the authors cited a real-world application that was performing poorly with the current quality and performed better with improved quality, e.g. some semantic search application that had quantifiably better precision and recall after the quality improvement.

(3) The paper could be improved in several ways. First, I found it disappointing that the paper omitted any reference to the data cleansing techniques used in traditional Business Intelligence and data warehousing. This narrow focus on the Semantic web is artificial. Data is data. True, RDF introduces its own idiosyncrasies, but I am sure that the broader issues of data quality transcend the syntax used to encode data. Second, the paper omits any metrics on the computational resources used to evaluate and improve the quality, e.g. CPU time, memory, etc. This makes it unclear how practical the toolchain is. Finally, the authors should correct the numerous typos. I found it incongruous that even though this paper was making a case for improved quality, the authors did not even run it through a spell-checker. The most ironic typo was their misspelling of "quality" as "quanlity"!

Review #2
Anonymous submitted on 23/Jan/2016
Suggestion:
Reject
Review Comment:

The paper focuses on the problem of quality of literals in the LOD cloud. The authors present a toolchain for the quality analysis, and its application to the LOD Laundromat data collection. In addition, the problem of filling missing language tags is approached.

Comments:

* Paper is very inaccurately written: many typos, unfinished passages, terms are explained later than mentioned, etc.

* Novelty is not very clear, stated in a very general way: "Our approach is different from existing work ... because we analyze and improve quality aspects on a very large scale and we integrate our results into existing tooling" - it reads as if in all other approaches analysis is small-scale, and the results are never put into practice.
The paper contains a number of interesting pieces that make a contribution to the issue, but the solid well-worked-through and evaluated approach is not yet there,
or at least the presentation (see further comments) makes such an impression.

* Related work: comparison of the presented approach to the existing ones is completely missing.

* Formalization is unclear, both itself and its purpose
** section 4.2: the concepts each datatype defines (enumerated 1 to 4) very much need examples!

* Analysis, section 6: really difficult to follow
** when I follow the link http://wouterbeek.github.io/quality/ I don't find any explanations on the data format. What do numbers in columns there mean?
** "We see that the vast majority of literals..." - without knowing the format AND having some summarizing table in the paper, it's impossible to see it
** "79% ... of literals ... have very few syntactic strictures" - what does it mean to have very few syntactic structures?
** "none of the DBpedia datatype IRIs is currently defined" - at this point it is not clear what it means

* Section 7.3, improving language tags: your F1 numbers are very low. But what is totally missing is the discussion why (I understand that short string are the main issue), and any suggestions on how to approach it. The section is called "improvement" - which it hardly is, given the precision numbers.

Smaller comments:

* Abstract: "literals form a substantial (one in seven statements) and crucial part of the Semantic Web" - would be interesting to have a source of this number.
* p.3, beg. of 4.1: IRI as a set of all IRIs is misleading as a first component of L. Why not IRI_D (introduced later in 4.2)?
* p.4, 4.2: is I(.) an interpretation function?
* p.4, 4.2: excluding rdf:langString from the IRI set needs additional explanations
* p.4, c.1, last paragraph: do you mean L when you write LIT?
* p.5, figure 1: what is the meaning of % and corresponding numbers?
* p.5, c.1: definition of a canonical literal is unclear (as the definition of v2l canonical mapping is unclear), also example would help
* p.6, c.1: the paragraph starting from "The absence of well-formed literals..." is actually very vaguely relevant to the paper's message.
* p.6, c.2: until section 7.3 is totally unclear where these tools are used (I was expecting them to show up in section 5.3.2), perhaps a forward
reference would help, or better, why not describing them in section 7.3?
* p.7, c.1, and later in the paper: really minor comment, but I find the usage (and spelling) of "vis-a-vie" a bit strange. Do you mean versus?
* p.9, tables 2-3: adding % column would help greatly to read the results
* p.10, c.1: "we lift out two quality improvements that can be automated" - I understand "lift out" as "remove", do you not considering this to-be-automatated
quality aspects, or vice versa?
* p.11: "Canonical literals provide a significant computational benefit" - till this point not yet explained what a canonical literal is, with respect to what is it canonical?
* p.12: define "string size bucket" the first time you use it (number of tokens?)

Typos:

* Abstract: space after the word "Abstract" missing
* p.2, c.1: "in our this chracterization" - 2 typos
* p.2, c.2: "fromation"
* p.2, c.2: "Secondly ... indicates what are the current problems" - are misplaced
* p.4, c.2: "RDF processor that recognized" - recognizes
* p.5, c.1: "As is apparent" - as it is apparent
* p.8, c.1: "strictures"
* p.8, c.1: "grammar pointed to in the..."
* p.8, c.2: "1457568017" - what does this hanging number mean??
* p.10, c.2: "easier for other to add" - for others
* p.11: "{ToDo Wouter} literals" - seriously?
* p. 15: many references (e.g. [4] to [10]) do not have a venue

Review #3
Anonymous submitted on 27/Jan/2016
Suggestion:
Minor Revision
Review Comment:

The paper describes a set of quality criteria for assessing the quality of RDF literals and presents a toolchain for automatic analysis of such literals in web scale. The quality criteria take into account the syntax and the semantics of the RDF literals and define a set of measures for grouping them into a set of predefined categories. The tool chain is mainly based on LOD Laundromat with some integration with Luzzu. Further, based on the assessment several improvements are proposed.

The topic of the paper is relevant to the theme of the special issue and the paper is well written with an easy to read to structure. Though there are several metrics defined in the literature for evaluating the quality of RDF literals, authors claim is reasonable that there are no studies focused completely on RDF literal quality done in a web scale thus making the contributions novel.

Detailed Comments:

Section 2:
There are several metrics in the existing literature that are related to the quality of RDF literals. For example, the ones mentioned under syntactic validity, consistency, interoperability etc in [1]. An analysis of those metrics would enrich the related work section.

Section 4:
As the authors are formally representing the definition of literals from the section 3.3 Literals from the RDF 1.1 specification, I would suggest to put a reference to the relevant section of the RDF 1.1 specification. It might help the reader to understand why rdf:langString has to be considered as a special IRI and other details.

Similarly, section 4.2, semantics of literals seems to be built based on “Section 2.3 The Lexical Space and Lexical Mapping” of the XSD 1.1 Part 2: Datatypes specification. It might be helpful for the reader if that specification is referenced.

Figure 1 - I assume that each level should sum up to 100% (e.g., Supported 60%, Unsupported 40%) rather than having 10% in all nodes, isn’t it?

I wonder whether the introduction of the unimplemented category in the quality categories mixes different things. In the same case unimplemented, are we assessing the quality of the literal or actually some aspect of the RDF processor?

“Canonical literals are of higher quality than non-canonical ones because they allow identity to be assessed more efficiently”. It seems that canonical mappings are not easy to derive and not very useful in some situations. It would help the reader if this statement is further elaborated or an example is given.

Can the multiple quality criteria for LTS be related to the existing quality categories such that a set of unified categories are presented?

Section 5:

The tool chain section contains descriptions of each individual tool that has been used for the quality assessment of the RDF literals but it lacks the information about how they were used together as a chain to produce the results presented in the paper.

I quite like how the quality metrics are defined under 5.3.1 ~ 5.3.3. However, I would suggest those definitions to be moved to “Section 4.3. Measures for literal quality” and integrate with the content in that section. For instance, 5.3.1. Assessing the Datatype’s Compatibility is quite related to whether the lexical expression belongs to lexical scope of the datatype.

Further, the measures defined in 4.3 can also define a set of metrics similar to the ones in 5.3.

For single word literals, aren’t there more efficient ways of doing dictionary look ups rather than looking for a rdfs:seeAlso property in the resources returned by the Lexvo API? For instance, if I look for a word such as http://www.lexvo.org/page/term/eng/Canonicalization your approach will give a false negative. Do you know the precision and recall of this approach? I think this is somewhat justified as you define this metric as an estimate but I wonder whether there is more efficient way to do this doing an API call, a HTTP dereference, and RDF processing and query.

Section 6:

In Table 1, add a footnote or a note somewhere to say that the prefix dt is used for http://dbpedia.org/datatype/.

Similar to examples given in partially defined and non-canonical cases, it would be valuable to have a small discussion on the causes of invalid literals.

I assume with the results you have, Figure 1 can be reproduced in this section with the relevant numbers.

It would be useful to have a summary of the top datatypes used in the literals overall, for instance, similar to Table 4. That will give some perspective when reading Tables 1 ~ 3.

What was the reason only to use 470 data document from the LOD Laundromat with Luzzu? As the paper talk about web scale analysis, isn’t this number relatively small? According to the tool chain description, I assume these tasks were fully automated.

Figure 2 doesn’t provide much valuable information on the reasons behind low or high compatible datatype metric values. It needs further discussion and may be the figure 2 and 3 can be omitted if they don’t provide much information.

Section 7:
Does “4. Datatype IRI are regularly not resolved with respect to their RDF prefixes” mean that there are a lot of undeclared prefixes in RDF documents? That statement is bit ambiguous.

Outdated language tags are first mentioned in section 7. I assume they should be introduced and discussed in Section 5.3.2 and 6.2.

Though the analysis of different language processing libraries such as “How accurate are the language detection libraries?” I assume it deviates a bit from the main focus on this paper. I see this analysis more as “homework” for deciding the most suitable approach for detecting the correctness of language tags in RDF literals. Once it is done, the focus should be on automatically deriving those metrics on large number of documents rather than on a small sample. The paper contains large portion for the analysis of the tools with a lot of tables and figures but less information about the quality aspects of the language-tagged strings.

Section 8:
Throughout the paper it is not clear how LOD Laundromat and Luzzu are integrated in this work. It rather seems like two separate lines of work with minimal integration.

The conclusions seems to be a bit weak and probably can be improved by providing insights about the overall process of evaluation, challenges, etc.

General:

There are some formatting issues to be fixed. E.g., 1457568017 in page 8, {ToDo Wouter} literals in page 11

[1] Zaveri, Amrapali, et al. "Quality assessment for linked open data: A survey."Submitted to the Semantic Web Journal (2015).


Comments