A comprehensive quality model for Linked Data

Tracking #: 1247-2459

Filip Radulovic
Nandana Mihindukulasooriya
Raúl García-Castro
Asunción Gómez-Pérez

Responsible editor: 
Guest Editors Quality Management of Semantic Web Assets

Submission type: 
Full Paper
With the increased amount of Linked Data published on the Web, the community has recognised the importance of the quality of such data and a number of initiatives have been undertaken to specify and evaluate Linked Data quality. However, these initiatives are characterised by a high diversity in terms of the quality aspects that they address and measure. This leads to difficulties in comparing and benchmarking evaluation results, as well as in selecting the right data source according to certain quality needs. This paper presents a quality model for Linked Data, which provides a unique terminology and reference for Linked Data quality specification and evaluation. The mentioned quality model specifies a set of quality characteristics and quality measures related to Linked Data, together with formulas for the calculation of measures. Furthermore, this paper also presents an extension of the W3C Data Quality Vocabulary that can be used to capture quality information specific to Linked Data, a Linked Data representation of the Linked Data quality model, and a use case in which the benefits of the quality model proposed in this paper are presented in a tool for Linked Data evaluation.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jeremy Debattista submitted on 26/Jan/2016
Major Revision
Review Comment:

In this paper, the authors present a quality model for Linked Data based on the ISO 25012 data quality model, and formalise a classification of different quality measures. They also extend a W3C Data Quality Vocabulary (DQV) in order to serve the proposed quality model better. The proposed model was implemented in a tool used to evaluate Linked Data. On the whole the paper is well written with some small typos which can easily be fixed.

=== Section 2 ===
In the related work section, I would have expected more research about current quality models such as the W3C Data Quality Vocabulary, daQ or Fürber’s work [1] - all of which were mentioned in Section 4.2. I would rename the section to Preliminaries, rather than Related Work, as what was described helps the reader to understand the rest of the paper better. I was a bit confused with the phrase “To the best of our knowledge, there is no clearly defined quality model for Linked Data”. I disagree with this statement as quality models (or better meta-models) for Linked Data have been described in DQV and daQ. If the authors’ idea of quality model refers to the idea of having a taxonomy with a number of quality measures that can be reused, then I do not agree about the “clearly defined” part, as quality measures can be recommended but others can have different perspectives of the same measures. Therefore, my understanding of “quality model” is a conceptual meta-model that enables the description of quality related information regarding some aspect of Linked Data.

=== Section 3 ===
In this section, the authors present a quality model for Linked Data, adopting the ISO terminology. The separation of the Linked Data aspects is interesting, though this seems to be similar to the categories defined in Zaveri et al [2]. The difference between the two aspects’ categorisation is that in this paper, the authors distinguish between data quality characteristics and infrastructure quality characteristics. I don’t consider the serialisation aspect as part of the inherent group, but it is more suited in the infrastructure group. The serialisation per se does not really affect the data characteristics, but it does affect other issues related to infrastructure for example lack of interoperability, syntactic errors etc. The base and derived measures were well explained though it would have been better if the authors identified their real contributions (against referenced work in [2]) explicitly. This section seems to be an extended version of a number of metrics defined in [2] with more details, and with an additional mapping to the ISO quality model.

=== Section 4 ===
In this section, the authors present a conceptual hierarchical model for the LD quality model and extensions to the DQV model. In the conceptual model, the authors show how quality should be represented. This looks a lot like DQV and daQ. Following a closer look into the proposed model, I found out that some introduced concepts and properties were unnecessary. For example, why is a ranking function required in a metric? Ranking should be separated from the metric itself, as it is finally the consumers (or whoever wants to explore quality models) who decides how to rank different LD aspects. The “Granularity” concept is also unnecessary, as the dqv:QualityMeasure is equivalent to daq:Observation which has the “computedOn” property which seems to cover any assessed resource. On the other hand, extensions related to the semantics of a metric such as the automation level, and the expected duration (although this can differ between different machines and what is being assessed) are really useful.

=== Section 5 ===
It is always a plus to implement such a model in a use case, the problem is that the tool did not work for me. I tried to assess a dbpedia resource, and the results did not appear within 5 minutes. Also, I think users should be left in liberty regarding to what metrics should be assessed - after all quality is commonly defined as “fitness for use”.

=== Section 6 ===
I acknowledge the authors attempt to evaluate the quality model in the discussion section, although a thorough evaluation is required. For example, how practical is the model? To what extent could it be used? Are there any applications (apart from LD Sniffer) using both the extension of the model and the defined taxonomy?

=== Final Remarks ===

Although this paper has some interesting aspects, in my opinion this work lacks originality. Whilst understandably different, there are already a number of Linked Data Quality taxonomies available related to Linked Data, e.g [1] and [3]. Also, reading parts of this paper felt like reading [2] and its references. I suggest that the authors focus on new quality measures, rather than re-explain what was described in [2] (and its references).

The conceptual model is very similar to that described in DQV and daQ. I am not sure whether the LD community needs another conceptual ontology (or extensions) with different terminologies and thus suggest that unless necessary, the authors should stick to the existing terminology described in the standard DQV. The contribution here seems to be the small extension made to the DQV ontology. My question is, how will these new extensions fit in existing quality assessment frameworks? Also, were there any problems in describing these quality measures in DQV or daQ? This kind of exercise would be really interesting in such a paper as then the reader would really understand the importance of the proposed extensions. These extensions lack supporting evidence for why they are required. Regarding the extension, I wholeheartedly agree with the introduction of the “Assessment Technique” concept, and the new sub-types of dqv:Metric (given that eventually they are supported in existing quality assessment frameworks), but on the other hand the introduction of QMO and Eval duplicates the efforts in DQV and daQ.

Minor Comment:
I don’t know what kind of referencing system the authors used, but generally I think references should be in alphabetical order.

[1] Christian Fürber and Martin Hepp. 2011. Towards a Vocabulary for Data Quality Management in Semantic Web Architectures. In Proceedings of the 1st International Workshop on Linked Web Data Management (LDWM). ACM, New York, NY, USA, 1–8. DOI:http://dx.doi.org/10.1145/1966901.1966903

[2] Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for linked data: A survey. Semantic Web – Interoperability, Usability, Applicability (2014)

[3] Jeremy Debattista, Christoph Lange, and Sören Auer. 2014. Representing Dataset Quality Metadata using Multi-Dimensional Views. In SEMANTiCS.

Review #2
Anonymous submitted on 26/Jan/2016
Major Revision
Review Comment:

The paper presents a quality model for linked data. The model is based on existing literature with several new measurements and indicators.
The topic is definitively relevant and after some year of empirical studies and partial proposal this is one systematic proposal. As authors declared some organization published quality model for linked data. In particular the concrete difference between the model presented in the paper and the DQV must be cleared in order to show the originality of the proposed approach. Moreover it is not clear how to use the model in order to assess quality of a given dataset.
For what concerning the model; some points need a better discussion. For what concerning the aspect of LD quality it is not clear the difference between interlinked, domain data and rdf model. In the paper it is written “
The quality of Linked Data interlinking can be measured with respect to quality characteristics such as accessibility or representational conciseness.” However conciseness is already described as one of the quality characteristics od rdf model; moreover “in principle” it is possible to have a self described dataset without the need of external link; in such case the linkability is related to “quantiy and quality” of data described that belongs to the domain data aspect.

The border between domain data and interlink is also shown at the end of page 6 it is written that the numbers of interlinked subjects is related to domain data while it seems to be a typical interlink aspect.

If there is no clear distinction among aspects it is not clear the role of aspects in the proposed quality model.

In the identification of derived measures some definition are “weak”. For example it is written “In order to calculate [disjoint class] this derived measure, an ontology is needed as an input in the evaluation.” More information are needed in order to really use such approach

The definition of domain consistency is reported as “Whether the type of a subject in a specific triple is consistent with the domain of a property of a triple.” But when a triple is consistent with the domain? Moreover it seems a pure syntactic consistency (very close to correctness of RDF model) but the semantic consistency (eg. Madri isCaptial of USA) is not considered

Section 3.6 try to connect quality characteristic with indicators. IMHO it seems that the indicators are some proxy of the quality characteristics but other important indicator must to be considered in the papers
For example “[accessibility] can be measured using Average IRI dereferenceability and Average subject dereferenceability.” But other relevant indicators relate to the infrastructure are not considered; does the model want to include all (or a reasonable subset of ) indicators related to a given accessibility? If not, how the model can be helpful to move from characteristics to indicators?

In the conceptual model of figure 4 some relations among based measured --> derived measured  indicators could be added

I tried to use the LD sniffer but it seems that it does not work I use the example of Madrid. Thus it is difficult to evaluate the quality model “in action”

Finally for what concern the wiki pages I suggest to pay more attention in the future development. For example in the creadibiltiy page http://delicias.dia.fi.upm.es/LDQM/index.php/Credibility base measures are “Triple fact trust (interval, triple) - Trust of a triple.” Or Many-path trust (interval, IRI) - Many-path trust of a an IRI.”. the trust is a subjective relationship (in the paper it is not considered the difference between subjective and objective measures) and it is quite difficult to say that due to the fact X trusts in a given triple fact this can be a credibility measure for Y that does not trust in X

Review #3
By Gavin Mendel-Gleason submitted on 26/Jan/2016
Major Revision
Review Comment:

(1 - originality) The authors of this paper have created a quality model for linked data which is an extension of the ISO 25012 data quality model altered to include aspects which are unique to linked data quality. The work is based on the Zaveri et al. data quality work which is now enjoying use in the evaluation of linked data quality. They extend this with some additional measures. The main contributions of the paper appear to be in segregating base measures from derived measures. It also presents a conceptual model which can itself be expressed as linked data to describe data quality measures. The creation of such data quality models specified in linked data ontologies are doubtless of some importance for linked data quality. However, the authors did not suitably motivate the reasoning for their measures or why the particular model should be useful to practitioners.

(2 - significance of results) The significance of the paper is difficult to evaluate because of the lack of application of the model to any actually existing linked data in concrete. The authors' arguments for their particular conceptual model would have been significantly strengthened by showing the application of the model to some particular data-sets. Without doing so it is difficult to evaluate how useful the approach would be in practice and it remains quite abstract.

(3 - quality of writing) The quality of the writing is high. I observed only one typo on page 8 in Section 3.5 equation (1) lenght, and one grammatical error on page 9 Section 3.6 "a large number of measures described in the survey *is* ...".