RODI: Benchmarking Relational-to-Ontology Mapping Generation Quality

Tracking #: 1367-2579

Christoph Pinkel
Carsten Binnig
Ernesto Jimenez-Ruiz
Evgeny Kharlamov
Wolfgang May
Andriy Nikolov
Martin G. Skjaeveland
Alessandro Solimando
Mohsen Taheriyan
Christian Heupel
Ian Horrocks

Responsible editor: 
Guest Editors Quality Management of Semantic Web Assets

Submission type: 
Full Paper
Accessing and utilizing enterprise or Web data that is scattered across multiple data sources is an important task for both applications and users. Ontology-based data integration, where an ontology mediates between the raw data and its consumers, is a promising approach to facilitate such scenarios. This approach crucially relies on useful mappings to relate the ontology and the data, the latter being typically stored in relational databases. A number of systems to support the construction of such mappings have recently been developed. A generic and effective benchmark for reliable and comparable evaluation of the practical utility of such systems would make an important contribution to the development of ontology-based data integration systems and their application in practice. We have proposed such a benchmark, called RODI. In this paper, we present a new version of RODI, which significantly extends our previous benchmark, and we evaluate various systems with it. RODI includes test scenarios from the domains of scientific conferences, geographical data, and oil and gas exploration. Scenarios are constituted of databases, ontologies, and queries to test expected results. Systems that compute relational-to-ontology mappings can be evaluated using RODI by checking how well they can handle various features of relational schemas and ontologies, and how well the computed mappings work for query answering. Using RODI, we conducted a comprehensive evaluation of seven systems.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Patrick Westphal submitted on 17/May/2016
Minor Revision
Review Comment:

The authors have addressed most of my comments. I think the readability increased, there is a pre-packaged RODI release and major issues have been resolved. Since the passages copied over from the previous paper do not seem to be a problem I think the submission is acceptable after fixing some of these (very) minor issues:

> how good the mappings can
> translate between various particularities of relational
> schemata and ontologies

--> 'how well'

> This is of particular importance in large-scale industrial
> projects where support from (semi-)automatic systems
> is vital (e.g., [17]), In order to help ontology-based
> data integration finding its way into mainstream

--> 'is vital (e.g., [17]). In order to help ontology-based'

> - Systematic analyses of challenges and existing ap-
> proaches in relational-to-ontology mapping gener-
> ation: these support and explain the types of

- IMO the sentence after the colon should start with a capital letter. Same holds for the following bullet items in this section.

> The newly
> added scenarios focus on features that are impor-
> tant to test mapping quality under real-world chal-
> lenges

- Sounds a bit odd to me. (Would have expected a passive like 'important to be tested under real-world challenges'.)

> For the RODI benchmark design we consider dif-
> ferent forms of input by means of RODI's composi-
> tion, offering database, ontology, data and queries, parts
> of which can be used as additional input to mapping
> generators.

- Phrasing; sentence is hard to read

> The reference query is directly evaluated by
> RODI against the SQL database.

--> '\emph{RODI}'

> we have a set of tuple pairs from the two result
> set that are candidates

--> 'the two result sets'

> Table 11 shows those the scores for three conference
> domain scenarios

--> 'shows those scores'

Review #2
By Christoph Lange submitted on 25/May/2016
Minor Revision
Review Comment:

The current revision has the following obvious improvements over the initial submission:

* With D2RQ, one additional system has been evaluated; also, additional related work has been taken into account.
* The paper now defines its scope more clearly: what problems are addressed, what ones are not, and why.
* Many aspects of the research are explained in more detail and with a deeper reflection; this holds in particular for the discussion of the coverage of the challenge, and for the geodata scenario.
* The writing has been improved for comprehensibility. This includes copying material from your earlier paper where it helps understanding (e.g. Definition 1).
* The increment in contribution over your earlier paper is now much clearer.

In the context of semantic heterogeneity, I would still like to see a short statement on inferencing approaches in relational databases (views, triggers, etc.) vs. ontological reasoning.

It is a bit annoying that there are some more points that I requested to be fixed in the first submission already. I'm requesting them to be fixed once more, but they are minor.

At please find an annotated PDF with detailed comments.

Review #3
By Anastasia Dimou submitted on 06/Jun/2016
Minor Revision
Review Comment:

This is the second attempt for the extended description of the RODI benchmark. The RODI Benchmark aims to provide a benchmark for relational-to-ontology mapping generation scenarios. In more details, it aims to assess the capacity of different systems to generate mappings which allow to generate an RDF dataset from some data originally stemmed in a relational database to answer a certain posed query.

Based on the aforementioned summarization of the paper and thus the benchmark function, I hesitate to accept that the mappings quality is actually assessed w.r.t. a query workload posed, as only on this version of the benchmark description it was clarified what mapping quality is considered for this paper. As far as I read from the paper, it is the (extend of) systems capacity to generate mappings that is under evaluation and not the mapping per se. As it is mentioned in the paper, different mappings may generate same RDF results. Those two mappings may differ in respect to some mapping quality dimension, but this does not affect the assessed results as long as the systems generate results (RDF triples) which conform with the benchmark’s gold standard. So, I am still not convinced how the benchmark enables assessing or increasing the mappings quality. Especially taking into consideration that it even allows systems to (blindly) generate their RDF results and feed these results in the benchmark for evaluation.

However, what mainly still puts me in thoughts is how balanced the benchmark is. As mentioned in my previous comments, based on the text description, I fail to find out how many scenarios, test-cases and queries exist per mapping challenge, category and per domain. Reading the evaluation results, I do have the impression that the authors do know which sets of test cases need to examine which aspects they cover but this is not communicated in the text. There is still (i) no consistency in respect to the input data (databases), schemas, output data (RDF results) and queries descriptions, (ii) no consistent description of the (number of) queries/test cases per domain and/or mapping challenge, (iii) no grouping or incremental addition of test cases coverage as domains are added with incremental size (or at least it is not clearly described) and, most importantly (iv) no clear definition of the measures taken into consideration.

To be more precise, there are three domains: conferences, geodata and oil & gas. The former has no information of the input database. How big is it? Is this of interest to know? If yes, why it is not mentioned as in other domains and where does the big or small size serve? If no, why is it mentioned for the others domains? There are three ontologies used, but based on Table 2, for instance, there is no evidence why the SIGKDD ontology is necessary to be taken into consideration given that it covers the same scenarios as CMT and CONFERENCE. In the end it is summarized that 23 classes and 66 properties were taken into consideration. The total number of classes and properties used are not described for all domains. And more, why is it of interest to know? How is it related to the input dataset, the queries and the output?
Then it is also mentioned that “we only generate facts for the subset of classes and properties that have an equivalent in the relational schema in question”. But how many times there is a correspondence? How many queries and test cases are generated? Why multiple test cases are necessary for each challenge or category tag? How many test cases are generated per tag/challenge? In this respect, I do not find any evidence in the text regarding how balanced test-cases are generated. Namely, it can be that there are more test cases generated for a certain aspect that turns a tool to seem not good for a certain measure, but that might not be the case.
This brings me to my final and more important remark, there is no clear definition of the benchmark measures. Coming to the evaluation, in Table 5, first dimension examined is related to adjusted naming, fine but which category tags are examined to draw conclusions about this? Which mapping challenge is covered or is associated with this? Which is the relevant mapping challenge? Then restructured hierarchies is mentioned that they are mainly related to the “n:1 mapping challenge”. First of all, n:1 matching appears to be a category in Table 3 and not a mapping challenge as of Table 1. I appreciate Table 3 and its alignment with Table 2 but why is it limited only to the conference domain? From Table 3, I get to know that restructured hierarchies are related to 6,9,10, and 12 sub-challenges which, in their turn are related to denormalization, class hierarchies and key conflicts. So what is meant to be the measure here? What are meant to be the measures overall? When one is in place to conclude that a system addresses the normalization challenge? Or any of its subtypes? Or are the categories the benchmark measures used to compare the systems?
In the end, the evaluation is executed considering scores for measures which are neither in the categories lists nor in the mapping challenges table. Why should we care about the results per domain? For instance, why is it of interest to have the scores for cross-matching scenarios per domain? I assume that one mainly cares regarding which categories and which challenges a tool addresses. So, I fail to clearly read in the text which test cases are added or which different test cases in each different domain/ontology scenario are covered so the domain level evaluation becomes of interest so I can conclude on what e.g. B.OX covers and what IncMap covers. From the text, it is given the impression that other challenges and categories are introduced (which I take them as the comparison measures) and other scores are used to finally evaluate the tools which are not aligned with the originally introduced challenges and categories.

In a nutshell, I miss the interpretation of test cases into categories and/or challenges addressed. In other benchmarks, it is normally clearly defined e.g. that the challenge is the speed and thus the time is the measure. We agree that benchmarking mappings is not as straightforward as benchmarking performance but I would invite the authors to read other benchmark descriptions and clearly define what they consider as their measures and posed the evaluation against those measures. I am not a benchmark expert myself but in order to complete this review, I did look into other benchmarks and I read in most of them that the dataset taken into consideration and the exact measures which are being evaluated are explicitly mentioned, while the tools under evaluation are assessed in respect to these measures, indicatively see [1], or [2] which might be even more comparable (but it also looks of smaller scale).

There is no doubt of the contribution of this work, especially taking into consideration the extended evaluation and the clarification of the current contributions compared to the previous paper. However, there are still some vague points which are of crucial importance.

Heyvaert et al. [21] covers the two mapping generation perspectives introduced by some of us [9], so I would suggest that it is clearly mentioned in the text.
In the same context, the “mapping by example” is what is considered as “result-driven” by [9], clarify if it is not the case.

“Geodata domain has been designed as a medium-sized case” → What does it mean medium size?
In the same context: “For the Mondial scenarios, we use a query workload that mainly approximates real-world explorative queries on the data, although limited to queries of low or medium complexity” and “Those queries are highly complex compared to the ones in other scenarios and require a significant number of schema elements to be correctly mapped at the same time to bear any results”→ what are queries of low, medium or high complexity?

“To keep the number of tested scenarios at bay, we do not consider those additional synthetic variants as part of the default benchmark. Instead, we recommend these as optional tests to dig deeper into specific patterns” → Along the same lines with my main concern, no concrete number of total tested scenarios (test cases?), what are the optional tests?
Following that, I would suggest that the authors pay some attention on keeping the terminology same across the text.

"different modeling variants of the class hierarchy" → Which are?

There is no description of the set up used to run the benchmark (not that it matters a lot).

Neither of these papers, however, address the issue of systematically measuring mapping quality. → Which exactly do you mean? I think that [51] and [54] at least propose quality measures, whereas [7] does systematically measure a different though quality dimension.

[1] Voigt et al. Yet Another Triple Store Benchmark? Practical Experiences with Real-World Data
[2] Rivero et al. Benchmarking the Performance of Linked Data Translation Systems