Ontop: Answering SPARQL Queries over Relational Databases

Tracking #: 1004-2215

Diego Calvanese
Benjamin Cogrel
Sarah Komla-Ebri
Roman Kontchakov
Davide Lanti
Martin Rezk
Mariano Rodriguez-Muro
Guohui Xiao

Responsible editor: 
Oscar Corcho

Submission type: 
Tool/System Report
In this paper we present Ontop, an open-source Ontology Based Data Access (OBDA) system that allows for querying relational data sources through a conceptual representation of the domain of interest, provided in terms of an ontology, to which the data sources are mapped. Key features of Ontop are its solid theoretical foundations, a virtual approach to OBDA that avoids materializing triples and that is implemented through query rewriting techniques, extensive optimizations exploiting all elements of the OBDA architecture, its compliance to all relevant W3C recommendations (including SPARQL queries, R2RML mappings, and OWL 2 QL and RDFS ontologies), and its support for all major relational databases.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jean Paul Calbimonte submitted on 21/Apr/2015
Review Comment:

This paper is a summary of the features of the Ontop system for querying relational databases using SPARQL queries.
The system described is based on well-known and well-studied query rewriting techniques, relying on mappings that bridge the relational and ontological models. Ontop is not a new system, it has been already presented in other papers, as well noted in section 6. Different pieces and previous versions have been described in several papers in different conferences in the past.
All main features described here (e.g. SPARQL query answering, Owl2QL support), or the Protégé plugin, have been described in those previous works. Therefore this paper is mostly a summary of these previous efforts, but I see no major new contribution or novelty, apart from the reworked examples and the migration to Github, and other minor details.
The Ontop framework is certainly one of the leading tools in OBDA, and I would have expected in a new submission of Ontop (even as a software paper) to include new features such as the ones discussed on the future work section: e.g. no-sql systems, or even support for very common operators such as aggregates, which are essential in many applications. Something that is also not evident in this paper is how important OBDA is in the Semantic Web world, which seems to be dominated by Linked Data and ‘normal’ RDF and SPARQL endpoints. Is OBDA still somehow marginal in terms of adoption with respect to classical bulk translation of *Anything* to RDF? This is perhaps a too broad question but nowadays RDBMs are no longer the only widely adopted storage options. In some cases RDBMs are seen (sometimes unjustifiably) as legacy systems. This is why it might be important also to indicate when OBDA might be the right way to go. This will also give the readers an idea of the importance and impact of Ontop, even beyond the OBDA world.
Other than that, Ontop relies heavily on the existence of mappings. However, the task of producing reliable mappings is painfully hard in practice. In data integration systems, mappings, validation of mappings, and maintenance of mappings are very costly tasks that require very expensive curation from experts who have to master both the database world and the ontology world. This issue is apparently not covered (yet) by Ontop, but it seems an essential piece of the puzzle. Although there exist approaches like direct mappings (and refining mappings using schema matchers) none of this seems to be yet in the scope of this open source project.
Concerning adoption, the efforts made to make the project more appealing to the community are very well appreciated, and it seems it is starting to gather some attention. The examples cited seem to be more in the line of the description of EU project use cases. This is Ok to me, but they seem out of place as they (at least in the way they are presented) do not add any insight on why Ontop is key to solve the problems that these use cases have. This oddity is more evident in the Siemens use case, which looks like a CEP or event processing problem, rather than an OBDA problem, and Ontop is not (yet) ready for event or stream processing, as far as this paper is concerned.
The summary of section 5 is quite useful, but the comparison is a bit shallow. How does Ontop compare to GraphDB or RDFox in terms of query evaluation performance (at least in the scenarios where they are comparable)? Or how about ultrawrap? It seems that a more comprehensive comparative evaluation (Table 1 provides not too much information.) is needed at this stage, so that readers can have a better idea of which one to use under which conditions. There are hints of this in the manuscript (e.g. comparison with Stardog about rewritings resulting in large SQL), but perhaps a more systematic comparison is needed, given that there are several aspects to consider. The authors have already worked on similar comparison efforts for the query rewriting part, with nice results for the community (e.g. Ontop at work, NPD benchmark).
In summary, Ontop is a leading system on ODBA. However, the contents of this paper do not offer much new features or substantial content w.r.t. previous Ontop papers.

Review #2
By José Luis Ambite submitted on 14/Jul/2015
Minor Revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should
be reviewed along the following dimensions:

(1) Quality, importance, and impact of the described tool or system
(convincing evidence must be provided).

The paper presents a capability overview and historical retrospective
on Ontop, a system for Ontology-Based Data Access. Ontop builds on the
experience on DL-lite query rewriting and previous systems for OBDA
such as Mastro. Ontop is currently the best example of implemented OBDA
systems. Onto is available under the Apache Open Source license, which
makes it particularly appealing for both academia and industry.

(2) Clarity, illustration, and readability of the describing paper,
which shall convey to the reader both the capabilities and the
limitations of the tool.

The paper is well written, and does a great job of presenting the
major characteristics and software APIs of Ontop.

Some additional discussion would make the paper more self contained
and valuable:

1. The mappings in the examples are essentially GAV. Skolem symbols
(URIs) are conveniently generated by using the values provided by the
data sources (e.g., :db1/{pid} in Example 2.2). However, when
integrating multiple sources, there may not exist so conveniently shared
ids. What is the Ontop approach to deal with more general schema
mappings (i.e., LAV, GLAV mappings, with existential variables in the
mapping consequent)? What happens when there no shared ids across
sources (e.g., in one source employees are identified by employee_id
and in another by SSN)?

2. How these more general mapping rules interact with the compiled

A full discussion of 1 and 2 may be beyond the scope of the
paper. However, the authors should include a few sentences describing
precisely the limitations of their mapping language and
algorithms. (Without the reader needing to go to the references (e.g.,

3. The industrial applications of Section 4 seem to be in their very
initial stages, but do raise some questions about the applicability of
the approach. In particular, both involve large sources. Can you
discuss your approach to generate the needed schema mappings? Would
you model all the source tables? Is there some (semi-)automatic help to
generate the schema mappings? Based on previous experience, can you
provide an estimate of the effort (manual and/or software-aided) needed to
model the large integration use cases like those in Section 4.

Minor comments:
- In the mappings of example 2.2 and in page 6, the stage attribute is
not needed in the SQL queries. It is not used in the mapping rule

Review #3
Anonymous submitted on 17/Aug/2015
Minor Revision
Review Comment:

This paper presents Ontop, an approach and tool for answering SPARQL queries over relational databases.
Its importance relies on the use of well known query rewriting techniques that take the R2RML mappings (or the native Ontop mappings), to produce SQL queries, applies query optimization techniques for both mappings and queries, and also provides OWL inference capability in order to complete the initial set of mappings. Also, it presents two industrial applications that give evidence of its applicability to real-world scenarios.

The paper is very clear and well written.
I have some concerns on some points of this work that I believe should be expanded. These are the following:
- In Section 1, Introduction, when motivating the work, it would be interesting to mention and give an example on a federated system. I believe a federation of sources is the context that would most benefit from this work. In fact, section 2 mentions the possibility of using Ontop with federated databases, and I believe that its use on multiple data sources is very important.
- In Section 2, Architecture, there is mention of support to two mapping languages: it is not clear if the native mapping language is complete with respect to the R2RML standard or if it is only syntactically different for ease of use.
- Also, in Section 2 there is mention of using the platform to query streaming data. This is currently an important issue and should be expanded: how does this architecture fit to queries over streaming data.
- I believe that Section 3 in some parts should be more self-contained. For example, it should give brief explanations for how the T-mappings are constructed.
- If SQO is applied in the online stage, and is expensive (as stated previously for the offline stage), what is the impact of this stage on the cost of query answering.
- The more relevant experimental results mentioned in section 3.2.4 should be presented. Also, the experimental results on the newer benchmark, NPD, should be presented.

- Some typos: page8, says [38,21], should be [21,38]. page 10, second column, first paragraph, says "chaininqg"