Introducing the Data Quality Vocabulary (DQV)

Tracking #: 2079-3292

Authors: 
Riccardo Albertoni
Antoine Isaac

Responsible editor: 
Eero Hyvonen

Submission type: 
Ontology Description
Abstract: 
The Data Quality Vocabulary (DQV) provides a metadata model for expressing data quality. DQV was developed by Data Web Best Practice (DWBP) working group of the World Wide Web Consortium (W3C) between 2013 and 2017. This paper aims at providing a deeper understanding of DQV. It introduces its key design principles, main components, and the main discussion points that have been raised in the process of designing it. The paper compares DQV with previous quality documentation vocabularies and demonstrates the early uptake of DQV by collecting tools, papers, projects that have exploited and extended the DQV.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Christoph Lange submitted on 29/Apr/2019
Suggestion:
Minor Revision
Review Comment:

This paper presents the Data Quality Vocabulary (DQV), which has been standardised as a W3C Working Group Note, covering its methodology and design principles (Section 2), its architecture and terms (Section 3), and community uptake (Section 5).

The ontology is clearly of great relevance: while the paper is a bit short on motivating the relevance of exchanging information about data quality, it does clearly point out that the DQV fills this gap in a better way than related works (Section 4): "(1) being the result of a community effort […] (2) easing interoperability adopting design principles such as minimal ontological commitment and the reuse of best-of-breed W3C vocabularies; (3) covering a wide spectrum of quality requirements including the representation of metrics, quality measurements, certificates, and quality annotations." The ontology is also of a good quality. Not only would it not have survived the W3C process (1) otherwise, but the explanation of design principles confirms that (2) is indeed the case, and the comprehensive overview given in Section 3 also confirms (3).

With a good balance of explanations of the general principles of the ontology, examples for its practical application, and figures visualising the coverage of terms and the available implementations, the paper clearly conveys to the reader the key aspects of the ontology.

I clearly recommend acceptance, given that a number of minor shortcomings be fixed (see details in the annotated PDF at https://www.dropbox.com/s/4pqcg0l921ptu9s/swj2079.pdf?dl=0):

Content:
* In Section 2, it is not clear what you mean by "130 […] actions", as opposed to mailing list messages and issues (which you count separately).
* The idea of "metrics depending on parameters" is introduced as a side note in Section 3.2, but should rather be featured more prominently.
* Regarding the review of the daQ related work (of which I am a co-author), you should also mention the QPRO ontology from the daQ family, which supports "quality reports" similar to what you refer to as "listing errors and inconsistencies found
assessing the quality metrics", see https://arxiv.org/pdf/1412.3750 and http://theme-e.adaptcentre.ie/qpro/qpro.html
* It might make sense to discuss that some aspects of definitions of metrics cannot be modelled explicitly using DQV, e.g., the definition that (Example 10) "A dataset is available if at least one of its distribution is available".
* In Fig. 2 the number of publications per year that cite DQV is going down. Can you explain why?

Presentation:
* lots of minor grammar, punctuation and typography issues

Review #2
By Jose Emilio Labra Gayo submitted on 25/Jul/2019
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Ontology Description' and should be reviewed along the following dimensions:
(1) Quality and relevance of the described ontology (convincing evidence must be provided). (2) Illustration, clarity and readability of the describing paper, which shall convey to the reader the key aspects of the described ontology.
In my opinion, although the data quality vocabulary is a very simple ontology, it contains some classes and properties which can be applied to assess the quality of datasets and distributions and which can be reused in a lot of contexts.
It follows two patterns or best practices:
- Minimal ontological commitment
- Reuse of existing vocabularies
which influence its minimalistic design. The described ontology has been designed as part of a W3C working group so its design has been influenced by the issues raise inside the WG. In my opinion, although it is a simple ontology with only 10 classes, it attempts to cover a generic use case and reuse classes from other ontologies following best practices to increase interoperability.
On the other hand, the paper is readable and contains a good description of the motivation and rationale for the different aspects of the ontology.
Some detailed comments follow:
Page 1. The in the enumeration of quality documentation vocabularies, shouldn't it be at the last one? : "…the Data Quality Management Vocabulary (DQM) [16], the Quality Model Ontology (QMO) [27] <> Evaluation Result ontology (EVAL) [28], <> the Dataset Quality Ontology (daQ) [11])"
Page 1. When enumerating quality assessment vocabularies, I miss EARL [https://www.w3.org/TR/EARL10-Schema/], although I know that it is not exactly about assessing the quality of data, it has been used to assess the quality of implementations for W3C recommendations and it contains several classes/properties related, like: earl:TestResult, earl:Assertor, earl:outcome, etc which could have been reused.
Page 2. <> and future work summarize the contributions…
Page 2. "near weakly teleconferences" -> "near weekly teleconferences"
Page 2. "to avoid systematics use of URIs" -> "to avoid systematic use of URIs"
Page 3. Figure 1 doesn't contain "dqv:computeOn", "dqv:expectedDatatype" or "dqv:value", is there a reason for it?
Page 3. The link: "prov:wasDerivedFrom, prov:wasAttributedTo, prov:wasGeneratedBy" points to "prov:Entity Activity or Agent (resp.)" but the order is wrong, it should be: "prov:Entity, Agent or Activity (resp.)", I mean, the range of prov:wasGeneratedBy is prov:Activity while in the figure it seems to be prov:Agent.
Paper 3, the figure contains a dashed line which says that there is a "containment" relation. It is not clear what it means. Maybe remove it to simplify the figure or explain it better.
Page 3. The paper contains the sentence "SHACL is suggested to express cardinality constraints" but there is no further explanation for that. ShEx can also express cardinality constraints and has other features that could accommodate well for this use case like recursion. I have been playing around with a possible ShEx schema for the data quality vocabulary and I think ShEx can also be used for such a task. If the authors are interested, here is a first draft: https://github.com/labra/dqv-shapes/blob/master/dqv.shex
In my opinion, both ShEx and SHACL can accomplish the definition of dqv integrity constraints, so I would suggest the authors to change the sentence saying that it may be possible to define integrity constraints in the future with either ShEx [1] or SHACL.
Page 3. The sentence " Conformance to standards is modeled with DCTERMS borrowing a pattern from DCAT-AP (Issue 202)." Is not very easy to understand. I had to go to issue 202 to see the resolution, and if I understand it right, it just means that it is suggested to use the dcterms:conformsTo property. Maybe simplify the sentence to make it more clear would make the paper more readable avoiding a reader to read the issue resolution.
Page 3. The authors say that the namespace defines one instance: dqv:qualityAssessment but looking at the vocabulary, it seems that there is also the instance dqv:precision instance. Has it been removed?
Page 4. "…where each dimension represents <> quality related characteristic…"
Page 5. "… and metrics emerged from the use case analysis in the form <> the R-QualityMetrics requirement:"
Page 5: "…so that implementers <> encouraged to represent all measurements…"
Page 6: Add a reference or footnote to CubeViz
Page 6:
":measurement1 qb:dataSet :linksetQualityMeasurements .
:measurement2 qb:dataSet :linksetQualityMeasurements ."
The first appearance of "qb:dataSet" is in bold, while the second not.
Page 10: "Quality statement<> expressed in DQV qualify as…"
Page 10. The section about quality provenance talks about a reification model based on named graphs. Does the approach described also fit with other reification models like the one employed in Wikidata? If it does, maybe some statement about it would be worth
Page 10. Before example 9, "by the <
> tool (:myQualityChecker)" (remove the a)
Page 11. "It is possible to use<> PROV-O's …"
Page 11, "…but it does not <> other DQV quality components…"
Page 11. "…they represent<> metrics and the results…"
Page 11. "…vocabularies such as SKOS to represent<> quality metrics…"
Page 11. "…reuse of best-of-breed W3C vocabulary<>…"
Page 13. "…and data engineering at web-<>cale
Page 13. "…dqv:QualityAnnotation to document<> quality certificate<>."
Page 13. "…in order to document <> its results…"
Page 13. "…has lead us to…" -> "…has led us to…"
Page 13. "…to avoid <> domain restrictions…"
Page 14. The "best-of-breed" wording is used in 8 places in all the paper…maybe replace some of those usages by a different wording? This is a style suggestion, but when I read the paper, I got a bit tired of some many "best-of-breed"s
Page 14. "from best-of-breed W3C vocabularies <> minimized the number of…"
Page 14. "…define a default SHACL profile to help adopters to understand the (few) constraints that apply to DQV data by default…"
As I commented earlier, I think it could also be done by a ShEx schema which could handle the possible cycle between the properties: dqv:computeOn and dqv:hasQualityMeasurement. On the other hand, the notion of a SHACL profile has not been established yet in the community, maybe, the authors are talking about a SHACL shapes graph?
Anyway, in case the authors accept suggestions for future work, I would like them to also consider defining a ShEx schema as the one started at: https://github.com/labra/dqv-shapes/blob/master/dqv.shex which can provide a right level of abstraction and human-readability.
Page 14. "The discussion<> in the group are expected to be <
> source for requirements…"
Page 14. I was intrigued by the last sentence "…which could be as well proposed for (partial) mapping into Schema.org.", why the "partial"? Maybe some explanation about it would be worth…although if the authors prefer to keep it as is, I am fine with it.

Review #3
Anonymous submitted on 03/Aug/2019
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Ontology Description' and should be reviewed along the following dimensions: (1) Quality and relevance of the described ontology (convincing evidence must be provided). (2) Illustration, clarity and readability of the describing paper, which shall convey to the reader the key aspects of the described ontology.

The metadata model Data Quality Vocabulary DQV for representing data quality is presented, a result of a W3C Working Group in 2013-2017, extending the data catalog model DCAT.

The novelty of the vocabulary is argued to combine characteristics of earlier related vocabularies (DQM, QMO, EVAL, daQ) in a useful way. The paper extends an earlier W3C Working Group Note by detailing the process, methodology, model, and uptake of DQV.

The methodology and process of developing DQV (Section 2), starting from the requirements, is well documented. In a more general setting, the paper sheds light on how W3C working groups are working. The paper makes reference to the WG "Issues" in a tracker available on the Web. From a readability point view, more explanation of the issues would be useful in some cases, as the reader is not likely to dig out the issues while reading. I also wonder how peristent is the issue tracker as a reference.

The model is described in Section 3 with an illustrative figure. Color coding is used to show namespaces of the imported classes, which may be a problem in black and white print. Name spaces could therefore be mentioned explicitly for clarity in the legend boxes on the bottom line.

The model is compared and related quite well to various related data models (Section 4).

As for the uptake, the authors maintain a list of implementations on the Web. In 2016, there are over 30 implementations, but in 2018 only 15. According to this, the usage of the model seems to be declining, which is a concern.

To sum up the paper along the evaluation criteria in this category of papers:

(1) Quality and relevance of the described ontology (convincing evidence must be provided).

The quality looks very good, and the work is a result of careful considerations and group work. As for the relevance, quality is a big concern in Linked Data. However, the authors could explain and motivate more in the paper when and why machine readable quality representations are needed. For example, why is the model used and important in the implementations of Table 1? The usage data should be updated, too. Is the usage really declining and why (Fig. 2), or are the tables just not updated? The version on the web is still the same as in the paper.

(2) Illustration, clarity, and readability of the describing paper, which shall convey to the reader the key aspects of the described ontology.

The presentation is detailed, illustrated, and clarified. Some readability issues were noted above. The presentation focuses of documenting the model, and I recommend addition of more motivations on the modeling key choices when possible, not just describing the final model.

Minor typos:

Use camel-back notation in the headings 3.6 and 4.

possible to uses -> possible to use