Assessing completeness when complementing SKOS thesauri: two quality measures on SKOS:exactMatch linkesets

Tracking #: 1250-2462

Riccardo Albertoni
Monica De Martino
Paola Podestà1

Responsible editor: 
Guest Editors Quality Management of Semantic Web Assets

Submission type: 
Full Paper
Quality is one of the big challenges when consuming Linked Data. Measures for quality of linked data datasets have been proposed, mainly by adapting concepts defined in the research field of information systems. However, very limited attention has been dedicated to the quality of linksets, that might be as important as dataset’s quality when consuming data coming from distinct sources. In this paper, we address linkset quality proposing two measures, the reachability and the importing, to assess the completeness of linkset-complemented SKOS thesauri. In particular, the reachability and importing estimate the ability of a linkset to enrich a thesaurus with new concepts and their properties respectively. We validate the proposed measures with an in-house developed synthetic benchmark and we show an example of their exploitation on real linksets in the context of the EU project eENVplus.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Christian Mader submitted on 02/Jan/2016
Minor Revision
Review Comment:

The paper introduces two novel measures for SKOS linkset quality, reachability and importing, which are evaluated using a synthetic benchmark within a validation framework. The authors conclude that both measures are suitable to predict the quality of a linkset in terms of information gain when enriching a source thesaurus.

Overall, the paper is well-written (with minor syntactical and grammatical glitches as discussed below), well structured with a clear statement of the contributions and discussion of validation results. In the following I describe some weak points of the paper.

--- Section 1, Introduction
* “Linked Data will evolve the current web data into a Global Data Space": Since this is in quotes, consider adding a footnote of the origin.

--- Section 2, Basic Concepts
The basic concepts of the paper are well and clearly defined.

* would be helpful for the reader to provide a footnote on the semantics of skos:exactMatch when this property is first mentioned. It is provided later in the paper, so it may be moved from there.
* W is sometimes referred to as array, sometimes as vector (Definition 11); be consistent here
* Definition 3: " F(A) i,j is equal to n if there are n distinct object properties connecting the resources represented by vertices i and j". But this is only true, if the weight in W of each of these properties is 1, right? If so, please state this in the definition.
* Definition 5: s' is never used or defined in T_s^L_k

--- Section 3: Linkset Quality

* "Thesauri are special kind of datasets, and, in the following we will refer only to thesauri": Please clarify in what way thesauri are "special kind of datasets" and in what way this is important for the paper. As there is no strict consensus on the kind of relations that a dataset utilizes in order to be considered a thesaurus, the focus of the paper on SKOS thesauri seems rather vague. I understand that in your "definition", a SKOS thesaurus must contain (i) mappings (skos:exactMatch), (ii) labels (skos:prefLabel, skos:altLabel) and (iii) hierarchical relations (skos:broader, skos:narrower). Consider to make this more clear in the text in order to be clear about the nomenclature, maybe in the paper preface.
* "Otherwise, our measures might take into account duplicated information and the final evaluation might differ too much from the real one, leading to misleading conclusions": This sentence is unclear to me. What is the "real" evaluation, and what could these "misleading conclusions" then be?
* Definition 7: The "set of RDF Terms" is denoted Z in the definition and E in the paragraph above. Why is this different?
* Example 2: Make more clear that [{Dog@en}]_L and [{x 2 ,x 5 }]_L actually are two distinct examples.
* Example 3: From Fig.1 it is not immediately clear that also skos:narrower relations are "materialized" in the datasets. It is stated later in the paper though (Example 5), consider moving it forward.
* Definition 10: I understand the asterisks are used as "wildcards" (as in Definitions 12 and 13). This may not be clear to anyone. "The reachability assumes that the
linkset is correct and complete," this sentence appears in similar form in the second paragraph of 3.2 and the first paragraph of 3.

--- Section 4: Validation Framework

* "Creation of paths with length...": "we consider two percentages: 10% and 40%": As a reader I ask myself why these particular values were chosen; likewise for "Deletion of concepts[...]" and "Deletion of links[...]".

--- Others

The paper "Semantics for mapping relations in SKOS" by Mika Cohen also deals with SKOS mapping using exactMatch. Consider discussing it in the related work section.

There are some language related glitches throughout the paper that should be corrected, e.g.,

"scoring functions on different set of parameters"
"Let be ln be an ISO language tag"
"the linkset does not provides any importings"
"Then, this modifiers"
"showing that an high number of links"
"become exactly equal the gold standard"
"there is no a gold standard"
"They reviews quality dimensions"
"complementated thesaurus"
"there not exist benchmarks"

Review #2
By Aidan Hogan submitted on 05/Jan/2016
Review Comment:

Given a pair of SKOS taxonomies, this paper introduces two metrics for assessing the additional information that can be imported through a linkset from one taxonomy into the other. In particular, the authors assume that a complete mapping is provided, and look at how well one taxonomy can complement the other when importing information through skos:exactMatch links. The two metrics in question are "reachability" and "importing". The former metric takes as input a set of "relevant" properties, a source taxonomy, a target taxonomy, and a skos:exactMatch linkset, and measures the ratio of objects in the target taxonomy that are reachable by (a) following the skos:exactMatch links and thereafter (2) following a given number of hops through the properties selected as relevant. The latter metric takes as input a property (e.g., skos:altLabel), a language tag (or wildcard), a linkset, and a source and target taxonomy, and defines the average ratio of increase in the number of values with that property and language tag for each node in the source taxonomy versus when complemented by the target taxonomy; for example, a score of 0.8 for a property skos:altLabel and language tag "en" would, in my understanding, mean that the number of skos:altLabel values found for nodes in the source taxonomy (including only those for which a skos:exactMatch link is present) increases on average by a factor of 5 after importing the values from target nodes of the links. The authors motivate and define these metrics and then present a "validation framework" that aims to evaluate the ability of these measures to indiciate the "completeness" of the complemented taxonomy. This validation framework takes an existing taxonomy (in this case GEMET), makes a copy of it (changing the namespace), generating a skos:exactMatch linkset between the corresponding nodes, and then applies a variety of modifiers that (i) delete nodes and links in the source thesarus to create paths, (ii) delete concepts in either thesaurus, (iii) delete links from the linkset. These modifiers can be combined and their parameters varied to create an array of test-cases, where each contains a source taxonomy, a target taxonomy, and an associated linkset. The original complete taxonomy then serves as a "gold standard" that can be used to gauge various measures of incompleteness for the test-case in question. These measures of incompleteness are compared with the metrics that the authors propose. The authors then present various results relating to how the operations for generating the test-cases and the metrics they propose correlate. The authors then present some results applying their measures to some real-world datatsets before presenting related works and concluding.

The paper deals (perhaps somewhat indirectly) with the important issue of Linked Data quality, and thus I believe it to be of relevance to the Special Issue. The authors motivate their work by stating that a lot of effort has been invested to interlink thesauri such as GEMET, EARTh, AGROVOC, EUROVOC, UNESCO, RAMEAU, TheSoz, but that it is unclear what precisely is the value of these linksets; it is not clear, for example, how these thesauri complement each other. I think this is a solid motivation for looking into metrics to assess the value of a linkset. Indeed, I quite liked the results presented in Figure 9 (and to a lesser extent Figure 10), which provides an interesting overview of the value of the linkset for how the thesauri complement each other with terms in different languages. There is certainly some practical merit to this line of work.

Overall, however, I am afraid I must recommend a reject, for the following main reasons.

First of all, I found key parts of the paper very difficult to read. Just to mention beforehand, though the English is not perfect, it's quite okay and not the main problem (though it contributes in parts to the difficulty). In the following, I want to give an idea of the experience for me of reading the paper, which I hope would give the authors a better impression of the problem from my perspective:

* To start with, in the introduction, it was entirely unclear for me by the end of the section what the paper was about. I mean I can see it's about two metrics, but I have little or no idea what the intuition of these metrics are or what *problem* they aim to address. One thing I did get from the introduction was the motivation (to evaluate the quality of a linkset) but the list of contributions that follows is poorly written: e.g., "a metric ... which checks the linkset complementation potential for any SKOS property" -- I could not understand this at all the first time I read the paper. So by the end of the introduction, I have only a vague idea of what the paper is about.

* Having, in my opinion, failed to give a concrete idea of what metrics are introduced, the paper begins with dense preliminaries in Section 2. First of all, trying to read through this section, I do not understand why these concepts are necessary and have no intuition as to what they will be used for. Having read the paper, I still feel they are unnecessarily complicated and messy. For example, definition 1, "multi-relational network from an RDF triple-set", is (for me) already an extremely awkward representation of an RDF graph, and then we get into various matrix operations for reasons that were lost on me at the time but that are then ultimately used to define reachability in a RDF graph in a specific number of hops traversing only certain properties (something that could be defined and explained a lot more simply, directly, intuitively ...). Struggling to keep all the messy notation and their intuitive meaning together in my head, Section 3 again builds upon that notation with more messy notation. The one really valuable part of Section 3 is that we finally get to see an example using the metric, and the examples in general are appropriate and easy to follow even if the metrics they describe are not.

* The validation framework starts with two research questions, but I do not really understand intuitively what these research questions mean or why they are important. The authors then go into detail on the potential problems of using a synthetic benchmark, but it comes across as defensive and in any case not comprehensive. I think I understand why the authors present this discussion: (i) to highlight design choices in their benchmark, and (ii) to rebut possible criticism of the benchmark. But at this stage I still have no idea what the benchmark is supposed to do. To be clear, there is no problem with a benchmark being synthetic per se. The problem is interpreting results of synthetic benchmarks too broadly or not testing the claims of the paper (like any benchmark). So this whole discussion is just strange to me. Eventually it becomes clear to me why that discussion is there: the benchmark design is not particularly clean, with various modifiers and parameters selected without any real justification that I could follow. Ultimately, the metrics for completeness and the corresponding results presented ... I really could not follow these at all, how they answer the research questions, etc. I'm really conceptually lost at this stage.

Second of all, relating to the previous point, while I now understand more or less the actual metrics proposed, I have little understanding of what Section 4 and 5 intend to show, or more importantly, what the idea behind these sections is. The authors have two metrics and they show how they vary when the linkset/taxonomies vary in completeness. I really don't understand why this is interesting or useful and I could not follow the results presented in Figure 4 and 5. With apologies to the authors, I sort of gave up trying to understand the results of Section 5 in detail: I had no idea what the results were trying to show or what I should be looking for and interpreting the graphs is difficult when there's all these different tests with all of these different modifiers and parameters and so forth. I feel many readers would do likewise.

Third, I feel that the assumption that the linkset is correct and complete is an impractical solution in many scenarios, which would limit the applicability of these metrics.

Perhaps to summarise, I believe the paper does have good motivation and the metrics/tools developed do have practical merit (as suggested in Section 6). However, I feel that the preliminaries are unnecessarily dense and I fail to understand the value of the experimental framework and results in Sections 4 & 5. Removing and simplifying the overly long or otherwise unclear parts of the paper, the remaining contributions feel quite minor to me (not at the level of a journal paper): two metrics to measure the amount of data imported by a SKOS linkset. For this reason, I am selecting a reject rather than a major revision.

In terms of comments to improve the paper, I think the authors could:

1) add a motivating example to the introduction already to give the idea of the metrics (e.g., using Figure 1),
2) clean up and simplify the preliminaries section and the definitions of the metrics,
3) I really don't know what to suggest for Section 4 and 5 because I did not get the idea at all; maybe just remove it all? Otherwise these sections need a lot of work.

I think part of the problem may stem from the fact that the authors are trying to build a research paper from something that, at it's very core, does not have much technical depth. The authors could maybe instead consider developing their tool further and presenting it as a tool paper, or perhaps doing some empirical analysis of real-world linksets and presenting those results.

Some minor comments:

* "importing": This word does not feel right. "Importation" feels like a better noun to apply.
* "an RDF ..." (multiple ... the letter R has a vowel *sound*, like the word or, hence should use "an")
* "set of RDF triple[s]" (multiple)
* "the importing", "the reachabilty", "the importing and reachabilty" ... you should not have "the" here unless you also say something like "the importing and reachability *metrics*"
* cross-walking -> traversing?
* verteces -> vertices (or vertexes perhaps, keep consistent)
* "complement" Sometimes this can be used in a confusing manner since it can also mean set complement.
* spell-check

* "In particular, {the} reachability and importing estimate"

* EU Governs -> EU Governments
* "most interesting promise{s} that Linked Data makes is [that] "Linked Data ...". Provide a reference for the quote.
* "in {the} Linked Data"
* List of contributions, particularly second item, not clear
* "Section 3 formalizes {the} importing and reachability"

* "A[n] RDF triple"
* Do you ever use the distinction between RDFProp and OBJProp in the paper? Is it necessary?
* "Such type of linksets binds" -> "Such types of linkset bind"
* "[and] power matrix"
* Definition 1: very awkward. Also an RDF triple set is most common called an RDF graph.
* Definition 2: E_q is a set of pairs of vertices, so not sure how it can contain z.
* "For each object propert*y* z"
* Definition 3: "weigh[t]ed adjacency matrix"
* Definition 5: the superscript k in S^k make it seem like a power.
* "length minor or equal" -> "length less than or equal"
* "to T_o we define:" -> "to T_o. We define:" ... the first part that follows assumes k >= 1?

* "as good as" -> "as good if"
* "are special kind[s] of datasets"
* "which give indication[s] about"
* "user-specified metric[s]"
* "We *also* assume completeness"
* Definition 7, z is not quantified (\exists z?).
* "percentages normalized between 0 and 1." Percentages are values like 97%. Maybe ratio?
* Figure 1: where is y2, y4, x4, etc.? It's a little confusing when going through the examples.
* Definition 9: I realise the value would be different, but rather than do 1 - (|a|/|a U b|), I was wondering why not simply do (|b|/|a U b|). Seems more intuitive to me.
* Definition 9: look me a while to realise that "den" refers to the denominator, not the whole equation
* "but are not direct object[s] of the links"
* "and the set of vertexes" The formula just before has a dangling ')'
* Example 5 ... there are no skos:narrower links. Need to discuss earlier that these are implied.

* "The validation [framework] aims at ..."
* "Definitions 10 and 11[,] to evaluate"
* "when {this}[it] is complemented"
* "We want to demonstrate{, the} importing as a good predictor for" ... better to take a neutral stance and say you wish to investigate if it is a good predictor (in any case, it, by definition, measure multilingual gain, so again I'm not sure what's the idea here).
* "we consider {the these} two set[s] of .."
* "in term[s] of completeness"
* "of {the} our measures"
* "created [by] altering"
* "a varied kind of" -> "a variety of"
* "Thus, our ground truth{,}"
* "affecting synthetic benchmark[s]"
* "since they are not enough difficult" -> "since they are not difficult enough"
* "correctness and complet[e]ness for [the] linkset"
* "Th*ese* assumptions seem reasonable{,} since{,}"
* "enabling {in} a ..."
* "does not provide{s}"
* "th*ese* modifiers"
* "[by] developing"
* "with the aim of fully cover{ing}"
* "alterators"? "alterer" or alternator" or simply "modifier" perhaps.
* "The Test Sets Generator module performs two ... First ... Second ..." The latter two sentences are not full sentences. Make it a list if it's a list.
* "on [the] subject thesaurus (test set 1), on [the] object thesaurus (test set 2)"
* "All [of] the importing modifiers"
* "The we {really} construct"
* "10% and 40%" Why these values? This question extends to other values in the section.
* "related each others" -> "related to each other"

[at this point, apologies but I stopped noting minor corrections]

Review #3
By Christophe Guéret submitted on 08/Jan/2016
Minor Revision
Review Comment:

This manuscript deals with the issue of assessing the usefulness and information gain of linksets. The main contribution of the work is the formal introduction of two metrics and their evaluation on an synthetic test dataset.

# Originality and significance
Much research has been done so far on establishing good linksets but less attention has been paid on looking at what is really gained from doing this work of interlinking entities. This is however an important issue and being able to put a value on this extra work would probably encourage data publishers to link their data. So this work is original and relevant. It is also significant as two metrics are proposed to assess the information gain in terms of importation (new properties) and reachability (new resources accessible via link paths).

This significance is however slightly hampered by the limitations of the metrics. Namely:
* Both metrics assume that all the links in the link set are correct and that this link set is complete
* Both metrics only work for SKOS thesauruses using exactMatch predicate to link concepts
* "Importing" only works for pref and alt labels
* "Reachability" only works if the link set is complete
* "Reachability" is constrained to broader, narrower and related relations
Although these points are clearly stated in the paper at several places it would be good to recall them all together in the introduction so that they are even more clear to the reader. It would also be interesting to discuss what happens when those constraints are not meant (see how robust the metrics are) and how they could be removed (in order to have a more generalisable approach). In particular, could the metrics be adapted to also work for non SKOS datasets using sameAs links ? Say, e.g., to assess the value of interlinking a dataset to DBPedia ?

Furthermore, the introductions stresses out that Linked Data gains value from link sets and forget about resource re-use. Instead of minting their own URIs for every concepts and then try to link those, data publishers can also import identifiers from external sources into their own knowledge base. This should be recalled along with a description of how the proposed approach can (or can not) be used in this scenario.

# Writing
The paper is well organised and easy to read. I would only suggest the authors to consider an histogram instead of the line chart for the figures 4 to 8 as there is no meaning in connecting the dots. The captions of those figures should also be enriched with a description of the axes otherwise the reader has to hunt for this information in the text.

# Details
* Using a dbpedia resource as an example in Section 2 whereas the paper deals only with SKOS datasets is confusing.
* In the related work, the impression given of LINK-QA is slightly inaccurate. LINK-QA aims at finding links that generate a statistically significant difference in how it affects a particular resource. For instance, raise an alarm is one particular node gets connected to 10 concepts whilst all the other nodes get connected to 1 other node on average. In this framework, the "descriptive richness" has the same intent as the "importing" metric though the goal is LINK-QA is not to report this number. In fact, an interesting line of further work would be to drop the original metric of LINK-QA and use the "importing" metric instead to see how the results differ.

Review #4
Anonymous submitted on 21/Jan/2016
Review Comment:

This manuscript describes some metrics to describe the usefulness (complementarity, reachability etc) of linksets for SKOS thesauri. The authors introduce two metrics, "reachability" and "importing" which are aimed at assessing how many additional (non-redundant) nodes can be reached through a linkset (reachability) and how well new values (eg labels) can be imported through a linkset to a target thesaurus. The authors provide a fairly comprehensive set of formalisations and notations to define their metrics and apply them to some artifically created dataset.

In general, the use case seems to be somewhat narrow and limited comparing two very specific quantitative aspects about the level of enrichment/complementarity a SKOS vocabulary adds to another through a given linkset (even limited to skos:exactMatch links), where the introduced metrics do not in any way investigate actual qualitative aspects of the added data (eg correctness of the links). For a mere quantitative analysis of the complementarity of two SKOS vocabularies, or more precisely, the amount of complementary information added through a linkset, established graph/network messures would already provide a fairly decent picture (at the very least for the "reachability" metric). In addition, the introduced metrics do not differentiate between linkset and target SKOS thesauri in the sense that they do not shed light on the question if "low reachability" for instance is caused by a poor linkset or a poor SKOS thesauri. For these reasons, both the practical value and the added contribution of this work seem fairly limited, despite the effort and thought put into the manuscript.

To this end, also the assumptions of the work (Section 3) seem very limiting (eg that each coreference is linked through exactMatch link). The experiments in Section 5 only underline these issues as they are conducted on some articificially created test sets rather than data from the wild.

Data: are the linksets from Sections 5 and 6 available? The paper didn't provide a clear reference. Also, some descriptive statistics about the used linksets/thesauri would be much needed.

The authors state in Section 4 that their framework would be available at This link returned 404s on some requests and a google code page (but no actual content about the framework) on other occasions.

Also, the paper appears to contain a vast amount of typos. A brief set of examples below.

- "verteces" (used throughout) should be either "vertices" or "vertexes"
- "THE European Environment..."
- "for each object properties" (property)
- "is RDF triple"
- "consider the these"
- "demonstrate, the importing" (no comma)
- "LACT" => "LATC"