Open Data Quality from a Developers' Perspective

Tracking #: 1245-2457

Antonio Vetrò
Ertugrul Bircan Copur
Marco Torchiano

Responsible editor: 
Guest Editors Quality Management of Semantic Web Assets

Submission type: 
Full Paper
Context: Developers are among the most active consumers of Open Data, building new services and applications upon them. However, often data quality problems limit the potential for this type of Open Data reuse. Objective: We aim at understanding ifa metric-based evaluation of the quality of Open Data is able to predict the problems experienced by developers building applications that use Open Data. Method: We collected from developers the negative and positive aspects of a sample of datasets they used to develop applications, and compared them with the evaluation provided by a set of metrics. Results: The main gap between the developers' feedback and the adopted metric-based evaluations was the inability to compare the entities in the datasets to real life references and to detect format problems. We observed a few agreements between developers' perception in Accuracy and Understandability. In addition, from a higher perspective, developers lamented the lack of feedback channels between users and publishers and lack of search mechanisms. Conclusions: Although the small sample of datasets and participants used in this study cannot lead to any generalisation, these first results give proper indications on the tuning of the measurement framework to better address developers' issues.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Magnus Knuth submitted on 07/Jan/2016
Major Revision
Review Comment:

The authors target a very interesting topic in a forwarding way. The importance of data quality metrics has been highlighted frequently and several metrics have been developed, while the actual meaning of these metrics for the end users has often be waived or regarded from a theoretical (non-practitioner) perspective. Therefore, the authors compare a metric's measures to the developers expectations.

==Significance of the results==
The study performed has only a very limited extent, it included only 4 interviewees and 5 datasets. Furthermore, all datasets were published by the same publisher ( None of the datasets is available as RDF, but all are available in an open format (3 stars open data!). It should be discussed whether or how the applied quality metrics can be applied to RDF data (simply to explain relevance to this journal issue).

The discussion is insufficient and some questions remain:
"interviewees had also in mind how complete the datasets were in comparison to the real world entities": The interviewees had a different concept of completeness, is this a misunderstanding between interviewees and authors? It should be explained, which metric is affected by missing rows, it could also affect Accuracy (dataset vs. real world) or Currentness (possibly outdated data).
"a second discrepancy between participants’ answers and metrics for Accuracy in dataset 4" was mentioned but not discussed at all.

It is questionable that RQ1 can be answered in that way from the given study. The study is simply to small and the datasets to unbalanced to identify a general answer.
"In addition, a common problem ___" what?

The described difficulties to find appropriate datasets because of "a lack of search mechanisms" sounds like a lack of datasets descriptions, and again is biased by the one publisher regarded. Therefore, also RQ2 cannot be fully answered that way (at least it has been confessed in Sec 7).

==Quality of writing==
The paper is well written and easy to understand. There are some typos (especially in Sec 3): "... up to three datase_s_ts.", "... then compared ____ the positive ...", "Every Developer ...", "The same dataset__ are ...", "For the dataset the ____ were ...", "... real world entities reference_s_ ...".

At some points I recognized inconsistencies when the quality characteristics are listed:
* in Sec 3 Phase II you list "Expiration" (which would make it seven characteristics)
* "Expiration" does not occur in Sec 4
* in Tab 2 only five are listed ("Traceability" missing)

I really would like to see such work published since it brings the open data movement forward, but I don't see a sufficient level of significance in the current state of this work for a journal publication. The study should be evaluated on a greater basis, at least with more varying datasets from diverse data publishers.
Some more related work should be considered.

Review #2
By Maribel Acosta submitted on 03/Feb/2016
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

(1) originality
This work presents an empirical evaluation of quality metrics proposed by the authors in previous work. The novelty and research contributions of this work are limited.

(2) significance of the results
The sample sizes used in the conducted experiments are rather small, therefore it is very questionable how representative the results can be.

(3) quality of writing
The manuscript is not self-contained. Formal definitions of the metrics are not presented and the reader is pointed to previous work for important details of the approach.

This manuscript presents an empirical evaluation to assess quality metrics that were previously proposed by the authors. The experiments consist of three successive phases. First, an interview with four developers that have worked with Open Data sets is conducted; interviewees are inquired about positive and negative features of the datasets. Then, the set of proposed metrics are computed for all the datasets described in the first phase; metric values are compared to the answers of the developers. Lastly, discrepancies between developers’ answers and computed metrics are discussed with the developers to gather further explanations about the observed outcome.

The main strong point of this manuscript is that the motivation of this work is very clear. However, the description of the related work, proposed metrics, and experiments should be improved to make the paper clear and self-contained (see more details below).

In the Background and Motivation section, the authors refer to the “Five Star Linked Open Data” as “Five Star Open Data”. It is important to notice that Open Data is not necessarily *Linked* Open Data, so these two terms should not be used interchangeably. Furthermore, the authors enumerate Open Data characteristics that have been identified by the Open Government Group, however these characteristics are not aligned with the ones studied in this work.

The description of the metrics presented in Table is not very precise. Each metric is defined with either redundant or ambiguous terminology, therefore the following questions should be addressed in order to provide a more comprehensive definition of the metrics.
Q1 Why are these metrics tailored for Open Data and not other types of data?
Q2 Why were these specific characteristics chosen?
Q3 How is each metric computed exactly?
Q4 What is the range of each metric?
Q5 What does a “current value” signify in this context?
Q6 What is the “period of time referred by the dataset”?
Q7 What is a “meaningful value”? Is a value that is incorrect but coherent with the domain still considered meaningful?
Q8 What specific standards are taken into consideration to measure compliance?
Q9 How is the degree measured to which a dataset follows a standard?

Regarding the evaluation, the description of the experimental settings is not sufficient to allow for reproducibility. The design of the questionnaires, the process to conduct the interview, description of the interviewees and the datasets should be provided.
Q10 Besides the dataset characteristics, were there further instructions about the type of answers that should be provided by the interviewees?
Q11 How was traceability assessed by the interviewees? (Traceability does not appear in Table 2)
Q12 What is the level of experience of the interviewees with the Open Data sets?
Q13 How many of the interviewed developers worked with each individual dataset?
Q14 How many rows and attributes does each dataset contain?
Q15 Are the datasets used in other applications?

For the outcome of the three stages, the authors present a coarse-grained analysis of the results. The normalization aggregation and of the metrics is not well justified. I would recommend the authors to address the following questions and include further details of the obtained results .
Q16 What are the values obtained for each metric?
Q17 Why were the specific ranges <0.4., 0.4-0.6, > 0.6 chosen?
Q18 What function was used to aggregate the metrics in each characteristic?
Q19 Was there agreement among the interviewees regarding the negative/positive characteristics of each dataset? How much?
Q20 Why could the “Currentness” metrics not be computed (according to Table 2)?

In addition, it seems that interviewees were not clearly instructed how to asses each of the dimensions of the quality issues (questions P1-Q2, P1-Q4, P1-Q6 in Table 2). As indicated in Section 6, the interviewees had a different definition of “Completeness” than the one presented in Table 1. Therefore, the outcome of the interview cannot be directly compared with the outcome of the metrics, since they seem to measure different things.

The outcome of the empirical study provides interesting insights about developers’ experience when dealing with Open Data sets. However, as confirmed by the authors, these results are very premature and no generalizations can be obtained from this study.

In summary, the presented work tackles the interesting problem of quality assessment in Open Data. Unfortunately, even if the authors implement all the comments raised in the review, I consider that the research contributions of this work are not enough to be considered a journal publication. As a final remark, this manuscript seems not to fit in the topics of the special issue on “Quality Management of Semantic Web Assets” since the presented approach is not related to Semantic Web technologies but to Open Data.

Review #3
By Anisa Rula submitted on 29/Feb/2016
Major Revision
Review Comment:

This paper provides an exploratory study on the data quality assessment topic. The paper introduce an empirical method to understand how good are the quality metrics employed for the assessment of quality with respect to the quality point of view of data providers. As in an empirical work, the authors formulate two hypothesis and test them based on a small number of datasets.

I'll continue with the formal criteria for reviewing the paper as per the CfP.

(1) originality

The paper is clearly on topic and tackles a very important and non-trivial issue for the Open Data community. The work is, to the best of my knowledge, novel in its breadth and depth. I do, however, have a few concerns about the paper and in particular with a lack of clarity in how it conveys its ideas.

The general observation is that the paper looks more like report paper.

(2) significance of the results

On which theoretical basis is the evaluation build? The explicit hypothesis formulated should be tested based on standardized statistical methods (such as regression, t-test, chi square, etc.) in order to determine the validity of the empirical research. However, the major weakness lies in the Results section. Authors have tested their hypothesis only on four data providers which is a limited number and statistically it is not a significant number.

According to the dataset the authors do not consider aspects as the size, the structure of the dataset.

Based on what criteria you selected the sample of Open Datasets

(3) quality of writing

The paper is well written in most parts but not well written in other parts. However, there are various awkwardly constructed paragraphs and confusing statements throughout the middle of the paper, which hinder its comprehension. I try to provide an *incomplete* list of minor comments along these lines at the end of the review. As well as addressing these, I strongly encourage the authors to go through the paper again and try to sharpen the writing throughout.



Section 1
* The Open Knowledge Foundation… -> use a reference when you use others’ text
Section 2
*has risen -> has grown
*the work in Batini et al. is not only on web portal quality but it is mainly on relational data quality
* revise “consuming both time and effort”
* the background and motivation section is focused mainly on reports work rather than conferences or journals papers

Section 3
* what do you mean by positive and negative aspects? You have never mentioned/defined it before
* this section could have been generalized. You could propose each phase in a methodological way such that it can be reproduced by others considering further datasets.
* what was the knowledge of developers on data quality? People who do not have studied data quality is only sensitive on two dimensions that are: accuracy and completeness. They, usually refer to quality as inaccuracy issues.
* is the list of question comprehensive? based on what criteria did you select the questions? why so many questions? are they enough to gather relevant information for the study? How do you filter the question of the data providers? How, do you avoid noisy information?
* what is the difference between measure and metric in Fig.1

Section 4
* instead of using “we applied the following methodology” -> you may say that you used a threshold in order to be able to distinguish between high and low quality data
Section 5
* table 3, the colours are not visible when printed
Section 6
* unfinished sentence “in addition, a common problem”

* This list is very much incomplete! *