Measuring Quality of Evolution in Diachronic Web Vocabularies Using Inferred Optimal Change Models

Tracking #: 1244-2456

Albert Meroño
Christophe Guéret
Stefan Schlobach

Responsible editor: 
Guest Editors Quality Management of Semantic Web Assets

Submission type: 
Full Paper
The Semantic Web uses various commonly agreed vocabularies to enable data from various sources to be effectively integrated and exchanged among applications. In this design, a critical point is the arbitrariness in which these vocabularies can change in subsequent versions. New vocabulary versions reflect changes in the domain, meet new user requirements, and address pitfalls. However, these new versions have an impact in the workflow of publishers of Linked Open Data (LOD), who need to sync their datasets with the new vocabulary releases to avoid ramifications. Predictability of changes in diachronic Web vocabularies is thus highly desired. How predictable are these vocabulary changes in practice? In a longer term, how can we measure the quality of evolving Web vocabularies, and discern between those that "evolve conveniently", and those that change on an arbitrary, even harmful, basis? In this paper, we propose a metric to automatically measure the quality of the evolution of Web vocabularies, based on the performance of inferred optimal change models from past vocabulary versions using well understood evolution predictors. We apply this metric to 139 vocabulary chains from various Semantic Web sources, finding that 39.80\% of them evolve in a highly predictable manner.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Giorgos Flouris submitted on 14/Jan/2016
Review Comment:

This paper is about measuring the evolution quality of a dataset. The evolution quality is defined as "predictability of evolution". In other words, if the evolution of a dataset can be "predicted" using data from previous evolutions, then the quality of evolution is "good". I have several concerns about this paper, the most important of which are related to the approach itself.

First of all, I am not sure whether "quality of evolution" should *only* be determined by its "predictability". I understand that predictability gives some kind of comfort that the ontology will not change in arbitrary ways, but why is this "quality"? I find it hard to think of a way to prove/evaluate this hypothesis and the paper only provides intuitive arguments. Nevertheless, I am willing to accept (in a rather "axiomatic" way) that evolution quality is, by definition, equal to predictability.

Predictability is not such a useful concept either. Being able to predict the evolution of a dataset/ontology O1 is surely a good thing, on first look, as a curator having an ontology O2 that depends on the evolving dataset, can adapt/adjust O2 accordingly. But predictors are never 100% accurate, so, no matter how good the predictor is, it cannot tell us exactly how O1 will change; moreover, it cannot tell us exactly *when* O1 will change. Therefore, the curator still has to probe O1 periodically and ask for the changes. A good quality (=predictable) evolution only guarantees that, on average, the predictor will be "more correct". So what?

Then there is this issue about training and improving the predictors. Much of the paper deals with training the predictor to improve it and identify the most adequate features to serve as predictors. So, this training improves the predictions, essentially (by the accepted definition above) improving the quality of evolution. This sounds strange: by improving a learning algorithm one improves the quality of something unrelated to the algorithm.

Further, note that evolution patterns for a given ontology may change over time, essentially making the predictor less accurate. Does that mean that the evolution quality has degraded? Perhaps it has improved (i.e., a new optimized predictor could have achieved better results, by the authors' definition), but the old predictor is not suitable any more. There is no mention of this problem in the paper.

More detailed comments appear below.

I noticed that the related work includes papers on change detection. Change detection is indeed partially relevant to the problem. Since the area is reviewed, I would suggest including the canonical references to the area, namely [1] (best paper award ISWC-07), [2] (best student paper award ISWC-09), and/or the extended (journal) version of [2], namely [3].

Also, I would advise the authors to have a look at a recent FP7 IP called "DIACHRON", where some discussion on the quality of evolving datasets appears (an entire WP is devoted to that). Some of the metrics proposed there are classic (applicable also for static ontologies), but some (e.g., volatility) are relevant for quality assessment of evolving ontologies and may be useful.

There are several presentation problems, especially related to Section 3, where several notions (e.g., "change model", "predictors", "optimal change models", "rigid and non-rigid properties", ...) are not explained in their first appearance (or at all). Most of them become clear later, especially in Section 5, but that's too late for the reader.

How are the similarity functions (sim_int, sim_ext, sim_label) defined?

Section 3.2: children are usually defined through the rdfs:subClassOf relation ONLY. The authors seem to allow other properties as well, without specifying which ones. Does any property specify a "children" relation in their model?

For data-driven and usage-driven features, the presented "features" are not really features but examples of data-driven/usage-driven changes.

The Identity Aggregator is essentially an instance matching algorithm, right? Which one is used?

ROC should be explained. It is not obvious to non-experts.

The appendix "", mentioned at several points, is not provided.


[1] Dimitris Zeginis, Yannis Tzitzikas, Vassilis Christophides. On the Foundations of Computing Deltas Between RDF Models. In Proceedings of the 6th International Semantic Web Conference (ISWC-07), 2007.

[2] Vicky Papavassiliou, Giorgos Flouris, Irini Fundulaki, Dimitris Kotzinos, Vassilis Christophides. On Detecting High-Level Changes in RDF/S KBs. In Proceedings of the 8th International Semantic Web Conference (ISWC-09), 2009.

[3] Vicky Papavasileiou, Giorgos Flouris, Irini Fundulaki, Dimitris Kotzinos, Vassilis Christophides. High-Level Change Detection in RDF(S) KBs. Transactions on Database Systems (TODS), 38(1), 2013.

Review #2
By Jose Emilio Labra Gayo submitted on 18/Jan/2016
Minor Revision
Review Comment:

The paper presents a systematic study of the evolution of web vocabularies to discern between those that evolve in a predictable way than those whose evolution is not so well behaved. The authors propose a metric to measure the quality of the evolution based on machine learning concepts and apply it to 139 vocabulary chains. The authors show that the quality of evolution is good (greater than 0,9) for 39,80% of the vocabularies, and no good for 25,10%.

The approach presented in the paper is very interesting and as far as I can tell, it is the first time that this kind of research has been done for general semantic web vocabularies.

The introduction motivates the problem and convinces the reader that taking into account the evolution of semantic web vocabularies is an important aspect of linked data quality.

Section 3 describes the change models and the quality of evolution metric. This section could be improved with some more explanations for non-experts in machine learning concepts. For example, the authors mention the ROC concept without any further explanation or citation. Although I understand that the authors didn't want to extend the paper length, I would suggest to extend that section with some explanations or justifications of the decisions taken to propose that metric.

The first paragraph of section 4.3 could also be improved with some further explanation. The authors also talk about 10-fold CV without further explanation, I would also suggest some slight explanation or citation.

Section 4.3 also contains several references to , which are hidden for the reviewer. Is there any reason why they can't include a real URI?

The paper contains several statistical graphs which are not well explained and are difficult to understand. At least, the graphs should include a legend explaining the different values. For example, axis X in figure 3 should contain an explanation of the numbers of QoE (I guess they are 0.0, 0.1, ... 1.0).

The discussion and conclusion sections are quite interesting, I appreciate that the authors included some examples to explain the different evolution behaviors with real data.

The authors include "" as an example of web vocabulary that is evolving, however, I think they didn't include it in the web vocabularies that are evaluated. Maybe, they could include it also (I would be interested to know how it behaved).

Review #3
By Oscar Corcho submitted on 16/May/2016
Major Revision
Review Comment:

This paper describes the work made by the authors on measuring how ontologies (vocabularies, using the term used by the authors) used for the generation and publication of ontology-based data on the Web (commonly in the form of RDF) evolve.

The work done is well justified (such changes in existing ontologies normally require data publishers to take a look at their already published data so as to determine whether their data is still valid or requires changes to be made), and the topic is very relevant in the context of Semantic Web research. The insight that this work provides on how ontologies evolve and their impact on existing ontology-based datasets is really powerful.

The research methodology behind this paper is well executed. A clear research hypothesis and set of research questions are proposed, the metric for quality of the evolution is then well defined, and the hypothesis is adequately evaluated with the experimental setup that has been proposed. Therefore, there are no major issues regarding how the research has been executed.

However, there are some important concerns that I think that should be addressed before the paper can be accepted as it is:
- First of all, one of my main concerns relates to the choice of feature set that has been selected. The selection of the definition of change based on [20] looks sensible to me, and the choice of features based on the work from Stojanovic also makes sense. However, I am missing a good discussion on why these features are considered instead of other options that may have been extracted from the state of the art. I am thinking, for instance, on one of my own PhD students' work on ontology evolution (, which actually provides a very fine-grained approach for detecting and storing changes between ontology versions, and hence may allow defining even more properties than those considered in this work. Why are these direct children, children at depth X, direct partners, and siblings considered only? What if we add more properties based on changes in properties, or other types of changes that may be considered in an ontology? I understand that determining the right set of features to consider is not an easy task, but I would have liked to see a deeper discussion in this respect.
- In fact, and related to the topic above, it is not clear to me how the changes are determined between two versions of an ontology. Is there any piece of code that would be reusable for others to use afterwards? I suggest that if such code is available, this is made available as part of the supplemental material (and together with all the data used for evaluations).
- I don't like two comments made in section 3.4 about the quality metric. First, the authors comment that by design the quality metric is perfect. Is that possible? I would expect a clear description of why it is considered perfect, and what perfect means in terms of a metric. The same applies for the comment on usefulness, which is not defined anywhere. Same applies to "excellent results" in seciton 4. The authors should revise section 3.4 especially to avoid such comments, which are difficult to sustain.
- My main concern, to some extent, is the fact that the authors comment on the usage of several classification algorithms, but they do not provide a very clear listing of the actual algorithms used, and why these are selected. Some of them appear throughout the text: relief, Naïve Bayes, multilayer perceptron, etc., but there is not a clear list of which ones are actually used and why, and which kind of cleaning and preprocessing each one needed. From such a lack of description, it appears to me as if the authors were blankly applying different algorithms from WEKA and then selecting the best, but probably without doing a proper assignment of parameters to the different ones used. Additionally, it would be good to have a sort of table with some numbers describing clearly the performance of the different algorithms and configurations.
- In the characterisation of quality version chains, the authors comment that they use number of inserted new statements. Why not concepts?
- in the discussion section, I find it nice to have a couple of good examples, but I would have also liked to see examples of changes that are not easy to explain. Furthermore, and this is the most important comment apart from the one on algorithms applied, the discussions on the results of algorithms lack a bit of a "why is this actually happening?" Note that this is a very typical comment that I usually make to all papers that are related to the application of data mining algorithms, since authors normally forget about the fact that readers can actually read the numbers, and understand them, but what we would always like from the authors is to understand much better why these numbers are appearing indeed. And I hope that the authors have a clear view on that.
- Another related example is the discussion on the fact that 25.10% of these vocabularies score low in the metric and cause more arduous work to LOD publishers. How can this finding actually be used by a LOD publisher?

Minor comments:
- In section 3.4, the formula on roc(C_k) is wrong. It should be greater or equal than, and not just greater than
- In section 4.2, you comment on a two-fold evaluation. I would recommend rephrasing this, since for some time I was puzzled when trying to understand whether this was a k-fold evaluation as commonly used in data mining.
- The concept of "performant models" used in section 4.3 should be clearly defined.
- Figures 4 and 5 are difficult to understand, and their meaning is a bit unclear. I suggest redoing them and focusing on how to explain them properly in the text.

So, in summary, this is a very valuable contribution to the state of the art in ontology evolution, where I would find it very useful to have a very clear set of supplementary material with all the material that has been generated, and where I would like a bit more detail on how classifiers are selected and used, and explanations of the results obtained with them.