RODI: Benchmarking Relational-to-Ontology Mapping Generation Quality

Tracking #: 1257-2469

Authors: 
Christoph Pinkel
Carsten Binnig
Ernesto Jimenez Ruiz
Evgeny Kharlamov
Wolfgang May
Andriy Nikolov
Martin G. Skjaeveland
Alessandro Solimando
Mohsen Taheriyan
Christian Heupel
Ian Horrocks

Responsible editor: 
Guest Editors Quality Management of Semantic Web Assets

Submission type: 
Full Paper
Abstract: 
Accessing and utilizing enterprise or Web data that is scattered across multiple data sources is an important task for both applications and users. Ontology-based data integration, where an ontology mediates between the raw data and its consumers, is a promising approach to facilitate such scenarios. This approach crucially relies on high-quality mappings to relate the ontology and the data, the latter being typically stored in relational databases. A number of systems to help mapping construction have recently been developed. A generic and effective benchmark for reliable and comparable evaluation of mapping quality would make an important contribution to the development of ontology-based integration systems and their application in practice. We propose such a benchmark, called RODI, and evaluate various systems with it. It offers test scenarios from conference, geographical, and oil and gas domains. Scenarios are constituted of databases, ontologies, mappings, and queries to test expected results. Systems that compute relational-to-ontology mappings can be evaluated using RODI by checking how well they can handle various features of relational schemas and ontologies, and how well computed mappings work for query answering. Using RODI we conducted a comprehensive evaluation of six systems.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Christoph Lange submitted on 20/Jan/2016
Suggestion:
Minor Revision
Review Comment:

This paper presents a comprehensive framework for benchmarking tools that map from relational database schemas to OWL ontologies. The paper extends earlier works by the same author to a sufficient extent to justify a new publication. The paper …

* motivates relational-to-ontology mapping and the need for benchmarking mapping tools reasonably well,
* provides a well-structured, detailed discussion of the challenges that such mappings face,
* reviews existing mapping approaches on an abstract level while pointing out concrete traits of concrete tools where appropriate,
* presents the RODI benchmark suite: its data (multiple complementary relational database schemas and target ontologies from various application domains) and its implementation,
* presents the results of applying the benchmark to state-of-the-art mapping tools and discusses their performance in depth,
* pointing out tool-specific improvements but also further directions for relational-to-ontology mapping approaches in general.

It thus gives a complete account of a research effort in its field, and it is well structured and generally maintains a high quality. I recommend acceptance with minor revisions.

The most frequent shortcomings include figure readability issues and lack of elaboration of certain specific points, which makes them hard to understand – but these elaborations should be straightforward to add.

Also, the scope of this paper could be defined more clearly. In the paper, the difference that you draw between databases and ontologies sometimes appears stricter than it actually is in practice. E.g., in Section 2.1 "Naming Conflicts" I believe that some of the challenges you claim to be unique to the relational-to-ontology context (singular vs. plural names, different tokenization) do actually also exist in other contexts such as ontology alignment. Also, Section 2.3 "Semantic Heterogeneity" says that databases don't do logical inference, but often they do, e.g., by way of views or triggers. If you don't consider such databases here, it would help to say so explicitly. Similarly, like database schemas, ontologies may also be optimised for performance and not strictly give preference to faithfully modelling the domain (as you state in Section 2.2.1 "Type Conflicts"). With regard to mapping use cases, I think there are, in addition to the scenarios introduced in Section 3.1 "Differences in Availability and Relevance of Input", also the scenario where you first have an ontology (= conceptual domain model) and then a database schema evolves (being an efficient implementation of a data store), and the scenario of ontology/schema co-evolution.

Even though some of this work has been presented earlier, I would recommend making this current paper more self-contained. E.g., it would make sense to re-state the formal definition of structural result equivalence in Section 4.7 "Evaluation Criteria – Scoring Function". There is no doubt that your current submission adds a lot of content over your older papers.

Some detailed comments (more in the PDF linked below):

* 2.1 "Naming Conflicts":
* I'm not sure it's sufficient to identify matches by _name_ of, say, database column vs. ontology property. I could imagine many practical settings where these names are completely different.

* 2.2.3 "Dependency Conflicts":
* Figure 2: Rather than numbering the different options I'd distinguish them by _name_.

* Table 1: It is not clear to me how something as unspecific as an instance of owl:Class can provide _guidance_ to a mapping creator.

* 4.1 "Overview":
* Table 2 largely reproduces Table 1 and thus wastes space.

At http://www.iai.uni-bonn.de/~langec/exchange/swj1257.pdf please find an annotated PDF with detailed comments.

Review #2
By Patrick Westphal submitted on 22/Jan/2016
Suggestion:
Major Revision
Review Comment:

In the paper 'RODI: Benchmarking Relational-to-Ontology Mapping Generation Quality' the authors present RODI, a framework for benchmarking relational data-to-ontology mappers. This framework is equipped with several evaluation scenarios covering three application domains. These are used to evaluate the generated relational-to-ontology mappings of six mapping systems.

After motivating and introducing their approach the authors discuss the main challenges of mapping relational data to ontologies. Afterwards aspects are shown in which mapping systems differ and what kind of challenges this poses to a benchmarking tool. In the following section the RODI suite is introduced along with its evaluation scenarios covering different domains and the actual approach to score generated mappings. After describing some details about the framework the evaluation is presented and results are discussed. The authors close with related work and conclusions.

I think the approach is relevant since it provides means and a measure to detect deficiencies in relational-to-ontology mapping systems. The authors could also prove this in an evaluation on real tools. Moreover the findings could be used to improve the tools under assessment.
On the other hand, even though the source code of assessment frame work is provided, it is not trivial to re-run the evaluation since there are no build mechanisms for the framework and required libraries are not provided (and in the worst case not downloadable at all, but would have to be built from other projects in turn).

The ideas presented seem to be valid as far as I can tell. (I'm not familiar with the tools and don't have much practical experience in ontology-based data integration.) However, some things I would have been interested in were not discussed. Since the authors use the mean of many different queries to represent an overall quality score (e.g. Tbl. 5) it would have been interesting to get an idea about the distribution of the queries' main assessment targets. E.g. when 99 of 100 queries target a certain normalization issue a tool failing on that would perform really bad. So, in other words: Are certain integration challenges overrepresented in the scenarios.
I think the same holds for the distribution of the actual data. Having a table used to test denormalization issues which is far smaller compared to other tables, one wrong entry in this table would have a far bigger impact on the score than in other tables. So like missing one in three the accuracy drops by one third; missing one in 1000 the accuracy doesn't change considerably.

In my opinion the related work covered is sufficient. The only paper that came to my mind which was not mentioned but suits the topic might be "Measuring the Quality of Relational-to-RDF Mappings" from Tarasowa, Lange & Auer.
Besides this, for me the phrasing "Their proposals do not include a comparable scoring measure" referring to bib entry [54] was not really clear. The trial in [54] was to provide a formal framework and define metrics formally. I mean, of course you cannot compare the score of a completeness metric with a metric covering e.g. syntactic validity but each metric should be comparable amongst different mappings.

I think the main issue of this paper is its originality. Text-wise (counted in quarter columns) the sentences either directly copied from the previous paper or re-phrased but still having the same content make a bit less then half of the paper. But I think content-wise the authors are pretty close to the required minimum of 30% improvement. (Even though this is hard to measure, of course.) For me the improvement here comprises Sec. 3 'Analysis of Mapping Approaches' (which I think could be compressed a lot), more detailed descriptions of the scenarios, framework implementation details, and re-running the evaluation scenario on additional tools with newly introduced semi-automatic scenarios. IMHO only Sec. 3 and ability to run semi-automatic scenarios account for improvements with respect to the call and submission type.

The paper is mostly well written but quite some newly introduced paragraphs hamper the reading flow. One particular example would be the introduction of the concepts of 1:n/n:1 matches which need further explanation IMHO. Moreover the trials to rephrase every sentence in the first sections makes it a bit hard to read. Another severe issue is that some literature references seem to be broken.

Due to my concerns stated above I don't think this paper is publishable as is. Besides the question whether it meets the requirements of 30% improvement at all, I would like some of the newly introduced parts to be improved to better fit in the whole paper.

Minor comments and concrete issues in chronological order:

1. Introduction

1.1. Motivation

> Mapping development has, however, received much
> less attention. Moreover, existing mappings are typi-
> cally tailored to relate generic ontologies to specific
> database schemata.

This sounds a bit like a over-generalized claim. (Though I can't prove it's actually wrong since I don't have much experience in ontology-based data integration.)

> This is a complex and time consum-
> ing process that calls for automatic or semi-automatic
> support, i.e., systems that (semi-) automatically con-
> struct mappings of good quality, and in order to ad-
> dress this challenge, a number of systems that gener-
> ate relational-to-ontology mappings have recently been
> developed [8,40,14,53,3,46,28].

> The quality of such generated relational-to-ontology
> mappings is usually evaluated using self-designed and
> therefore potentially biased benchmarks, which makes
> it difficult to compare results across systems, and does
> not provide enough evidence to select an adequate map-
> ping generation system in ontology-based data integra-
> tion projects.

Hard to understand. Maybe this could be split into multiple sentences.

> Thus, in order to ensure that
> ontology-based data integration can find its way into
> mainstream practice, there is a need for a generic and
> effective benchmark that can be used for the reliable
> evaluation of the quality of computed mappings w.r.t.
> their utility under actual query workloads.

For me this reads like ontology-based data integration would never become relevant in practice unless there is an effective benchmark which I think is a bit exaggerated.

1.3. Contributions

> - The RODI framework: the RODI software pack-
> age, including all scenarios, has been implemented
> and made available for public download under an
> open source license.

I would not count this as a contribution of a scientific paper.

2. Integration challenges

2.2.1. Type Conflicts

> With this variant, map-
> ping systems have to resolve n:1 matches, i.e.,
> they need to filter from one single table to extract
> information about different classes.

> In this variant, mapping systems
> need to resolve 1:n matches, i.e., build a union of
> information from several tables to retrieve entities
> for a single class.

IMHO the 1:n/n:1 concept could be made clearer.

2.2.3. Dependency Conflicts

I think Table 1 needs more explanation.

3. Analysis of Mapping Approaches

As far as I understand the main content of Sec. 3 is that (conceptually) a mapping task might differ in
- the information it can use (schema, RDB data, TBox, ABox)
- whether it runs fully or semi-automatically
- the way a semi-automatic approach is guided with feedback (actual mapping results, definitions in mapping source/target, kind of user input)

First, I think this content could be shortened a bit. On the other hand it isn't really discussed how these issues were reflected in the framework design. All that is said is that all mapping systems might stick to the one or the other variant and thus an end-to-end approach was chosen. In my opinion a bit more could be said here.

4. RODI Benchmark Suite

> Multi-source integration can be tested as a sequence
> of different scenarios that share the same target ontol-
> ogy. We include specialized scenarios for such testing
> with the conference domain.

I don't really agree that this would cover all issues of multi-source integration. Things like multiple resources referring to the same real-world individuals, or violations of disjoint class axioms could be introduced without being noticed that way.

4.1. Overview

Table 2: Not clear what 'misleading axioms' actually are.

4.2. Data Sources and Scenarios

This should not be a sub section on its own but rather contain 4.3., 4.4., 4.5. and 4.6 as sub-sub sections.

4.3. Conference Scenarios

4.3.2. Relational Schemata

> In particular, the previous version did cover only one
> out of the above-mentioned three design patterns

did cover --> covered (?)

> The choice
> of design pattern in each case is algorithmically
> determined on a "best fit" approach considering
> the number of specific and shared (inherited) at-
> tributes for each of the classes.

Not clear to me what that means.

4.3.4. Data

> Transfor-
> mation of data follows the same process as translating
> the T-Box.

Not clear to me what this means.

4.3.5. Queries

> All scenarios draw on the same
> pool of 56 query pairs, accordingly translated for each
> ontology and schema.

I only found 50 queries at max in the GitHub repository. Are the 56 queries reported somewhere? If some of them could not be translated, are they all useful?

4.4. Geodata Domain -- Mondial Scenarios

> The degree of difficulty in Mondial scenarios is
> therefore generally higher than [...]

phrasing

4.6. Extension Scenarios

Relatively short sub section.

> Our benchmark suite is designed to be extensible

Just out of interest: Are there any guiding/supporting tools (e.g. generators for synthetic data) available to set up own scenarios?

4.7. Evaluation Criteria -- Scoring Function

> (e.g., the overall number of conferences is much smaller
> than the number of paper submission dates, yet are at
> least as important in a query about the same papers)

not clear to me

> meaningful subset of information needs

not clear to me what such subsets are

> F-measures for query results contain-
> ing IRIs are therefore w.r.t. the degree to which they
> satisfy structural equivalence with a reference result.

seems a verb missing

> Structural equivalence effectively
> means that if same-as links were established appropri-
> ately, then both results would be semantically identical.

This explanation needs to be more precise IMHO.

5. Framework Implementation

5.1. Architecture of the Benchmarking Suite

Just as a note: In Fig. 4 the blue and gray parts are hard to distinguish when printed monochrome. Tools like ColorBrewer (http://colorbrewer2.org/) can suggest colors that are also well distinguishable on monochrome printouts.

> via the Sesame API or using SPARQL

footnote with URL would be nice

> More
> generally, mapping tools that cannot comply with the
> assisted benchmark workflow can always trigger indi-
> vidual aspects of initialization of evaluation separately.

This is a bit vague.

> Next, we build a corresponding index for
> keys in the reference set. For both sets we determine
> binding dependencies across tuples (i.e., re-occurrences
> of the same IRI or key in different tuples). As a next
> step, we narrow down match candidates to tuples where
> all corresponding literal values are exact matches. Fi-
> nally, we match complete result tuples with reference
> tuples, i.e., we also check for viable correspondences
> between keys and IRIs.

For me it is not really clear what is done here in detail.

> This last step corresponds to identi-
> fying a maximal common subgraph (MCS) between
> the dependency graphs of tuples on both sides, i.e., it
> corresponds to the MCS-isomorphism problem. For ef-
> ficiency reasons, we approximate the MCS if depen-
> dency graphs contain transitive dependencies, breaking
> them down to fully connected subgraphs. However, it
> is usually possible to formulate query results to not
> contain any such transitive dependencies by avoiding
> inter-dependent IRIs in SPARQL SELECT results in
> favor of a set of significant literals describing them. All
> queries shipped with this benchmark are free of transi-
> tive dependencies, hence the algorithm is accurate for
> all delivered scenarios.

This is not clear to me.

6. Benchmark Results

6.1. Evaluated Systems

There are some systems like Ultrawrap or D2RQ that (AFAIK) also allow automatic mapping generation, that were not used. Is there a reason for that? On the other hand COMA++ is used which (for me) does not really fit in here.

6.3. Default Scenarios: Overall Results

> Good news is that some of the
> most actively developed current systems, BootOX and
> IncMap, could improve

> the two most specialized
> and actively developed systems, BootOX and IncMap,
> are leading the field

This sounds a bit biased (esp. when this is written by the tools' developers and IMHO not the whole state-of-the-art was evaluated).

6.4. Default Scenarios: Drill-down

> both systems benchmarked earlier this year

It's 2016, now. ;)

Review #3
By Anastasia Dimou submitted on 22/Jan/2016
Suggestion:
Major Revision
Review Comment:

This paper is introduced as a new benchmark for relational-to-ontology mapping generation. However it is more an extension of a previous paper on the same subject [1].

However, given that the paper is an extension of prior work, claiming that this paper proposes a benchmark [abstract and p.2] is kind of inaccurate and this is only clarified at page 3.
I would suggest that it becomes clear from the very beginning that this is an extension of an existing and already presented benchmark and the contributions compared to the prior paper are clearly introduced and supported.

More, the benchmark is positioned in respect to relational-to-ontology mapping generation quality. However, quality is perceived and often cited as fitness for use. In this respect, what is perceived as quality in the case of this paper/benchmark? What is the exact use that is discussed? Which quality dimension is covered in respect to Linked Data?
I would suggest that this is clarified. To do so, I would advise to consult [2].

In the same context, in the abstract it is mentioned that "this approach crucially relies on high-quality mappings". What does it mean high quality mappings? When does this occur? When a mapping is of high quality? I would suggest that it is supported with arguments or with a certain citation that defines high-quality mappings, else skip. From my point of you, I would expect that this approach aims to define targets that enable high-quality mappings generation (with quality clarified), rather than it relies on high-quality mappings.

Similarly, on page 2, it is mentioned that “Ontology-based data integration crucially depends on the quality of ontologies and mappings” and “Many of these ontologies are of good quality”. Is there any evidence? I would again ask to bring valid arguments or citations that support the statements and consider determining the notion of ontology quality. But more important, how do these arguments help in supporting the paper? If the argument can not be proven or supported by a citation and it is not crucial for the remaining of the paper (which is the case in my opinion) I would suggest the comment to be removed.

My two major remarks regarding the overall paper are related to (i) the high level presentation of the benchmark and (ii) the delta between the already published benchmark and this paper’s contribution. To be more precise in reverse order:

To start with, the first contribution of the paper (as outlined in section 3.1) is the systematic analyses of challenges in relational-to-ontology mapping generation. However, the now-called “integration challenges” as outlined, offer no additional information compared to the so-called “mapping challenges” outlined in the previous paper that initially introduces the benchmark. To be more accurate, the entire section 2 (apart from three sentences in the structural heterogeneity) is an exact copy-paste of the previous paper.
I would suggest that this section is briefly summarized and the previous paper is cited for more details. The emphasis should be put on the aspects of the challenges which actually form the contribution of this paper.
Although, it would be interesting to be clarified why they turned to be called “integration challenges” instead of “mapping challenges” as they were originally introduced. From my point of view, both are applicable but in different cases, namely a mapping is not limited to integration cases but integration is a certain type of mapping. But I would suggest that the authors clarify what they had in mind.

The third contribution of the paper (as outlined in section 1.3) is claimed to be the RODI benchmark. However, the RODI benchmark was proposed in the previous paper [2] as explained above. The extension is the contribution in this paper which is accomplished thanks to the new evaluation scenarios (second contribution in section 1.3) and the extended system evaluation (fourth contribution in section 1.3). The RODI benchmark, among others, was extended to cover also semi-automatic mapping generation. I would suggest that this (and all other concrete differentiation) is explicitly mentioned and emphasis should be put on how these were achieved.

Focusing on the second contribution of section 1.3, the major contribution of this paper is the extension of the evaluation scenarios. It is mentioned that 9 new evaluation scenarios are introduced but it is not clear what these new evaluation scenarios offer in comparison to the existing ones. To be more precise, the 9 new evaluation scenario fall in which category of challenges? I would expect new challenges or particular details of the outlined challenges to be associated with these 9 new scenarios.
And this leads to my first remark, namely the high level presentation of the benchmark. It never becomes clear which the evaluation scenarios are, which the exact test cases are and which aspect of each challenge category each one of them addresses.
On page 11, it is mentioned that query pairs are tagged with categories, but which categories? It only becomes clear later on which these categories are (class instances, datatype/object properties, if I am not mistaken). But additional categories are mentioned too. How all these categories are aligned with the mapping/integration challenges? I would suggest that these query pairs/test cases/challenges alignments are summarized in an additional table (or at least clusters of them) so an overall idea of the benchmark becomes clear. It is also not clear how many scenarios, query pairs, test cases are defined. Additionally, I would suggest to clearly indicate which scenarios already existed and which were introduced.

More, in the paper it is mentioned that ”queries are highly complex compared to the ones in other scenarios and require a significant number of schema elements to be correctly mapped at the same time to bear any results.” I would suggest that it is clarified why this is important and what additional additional information offers? I assume that if the simple cases fail, then the more complex one would also fail. Then why such complex cases are necessary? Are there any cases identified that simple cases properly generate the triples but in the complex case they fail? If so, I would suggest that this is elaborated, including examples. This way it becomes clearer how incrementally complicated test cases are added to the different domains and what they serve.

In the same context, the descriptions of the different domains provide fragmented information. For instance, it is mention that the database size in the case of the oil and gas domain is approximately 40MB. But what is the size in the case of the other domains? The same occurs in the case of ontologies (classes and properties). I would suggest that the same information is provided for all different domains.

Minors:
There is some overall problem with the citations enumeration (from [7] on I think)

[p.11] Query pairs are manually curated → What do you mean as manually?

[p. 11] “We have collected 17 queries in scenario npd_user_tests” → Where does this come from? If the scenarios are outlined, I would suggest to point to the corresponding one. Else, I would suggest that this is not mentioned.

[p.12] Table 4 looks more like a figure.

Table 1 and Table 2 differ only by 1 column, I would suggest that the two tables are merged.

[1] Christoph Pinkel, Carsten Binnig, Ernesto Jiménez-Ruiz, Wolfgang May, Dominique Ritze, Martin G Skjæveland, Alessandro Solimando, Evgeny Kharlamov. RODI: A Benchmark for Automatic Mapping Generation in Relational-to-Ontology Data Integration. Published in Proceeding of the 12th Extended Semantic Web Conference, 2015.
[2] Quality Assessment for Linked Data: A survey. http://www.semantic-web-journal.net/content/quality-assessment-linked-da...