The Mastro System for Ontology-based Data Access

Paper Title: 
The Mastro System for Ontology-based Data Access
Authors: 
Diego Calvanese, Giuseppe De Giacomo, Domenico Lembo, Maurizio Lenzerini, Antonella Poggi, Mariano Rodriguez-Muro, Riccardo Rosati, Marco Ruzzi and Domenico Fabio Savo
Abstract: 
In this paper we present Mastro, a java tool for ontology-based data access (OBDA) developed at the University of Rome "La Sapienza" and at the Free University of Bozen-Bolzano. Mastro manages OBDA systems in which the ontology is speci fied in DL-Lite(A;id), a logic of the DL-Lite family of tractable Description Logics speci fically tailored to ontology-based data access, and is connected to external JDBC enabled data management systems through semantic mappings that associate SQL queries over the external data to the elements of the ontology. Advanced forms of integrity constraints, which turned out to be very useful in practical applications, are also enabled over the ontologies. Optimized algorithms for answering expressive queries are provided, as well as features for intensional reasoning and consistency checking. Mastro provides a proprietary API, an OWLAPI compatible interface, and a plugin for the Protege 4 ontology editor. It has been successfully used in several projects carried out in collaboration with important organizations, on which we briefly comment in this paper.
Full PDF Version: 
Submission type: 
Tool/System Report
Responsible editor: 
Thomas Lukasiewicz
Decision/Status: 
Accept
Reviews: 

This is the camera-ready version of the paper. The reviews below are for previously submitted versions.

Solicited review by Roman Kontchakov:

The revised version is much clearer and more consistent. Some paragraphs, however, might do better with a second reading (e.g., second par, right col, p 7, starting from "A challenge...").

typos:
abstract: Java with capital J
p 4, left col, l 1: equate_d_
p 4, right col, l -1: "(seen as a database)" is irrelevant here
p 6, right col, l -20: being OWL 2 a very expressive -> OWL 2 being a very expressive
p 7, left col, l 5: Company_, which_ is
p 7, left col, l 6: air traffic control system?
p 7, left col, l -5: "domain interested by our experimentation" -> domain
p 7, right col, l -6: calculus -> estimation
p 9, right col, l -22: "makes it difficulty applicable" -> makes it difficult to apply

Solicited review by anonymous Reviewer:

The changes implemented by the authors in response to the first round of reviews have improved the quality of the paper. The coverage of the related literature is significantly better now. Various claims are more precise and thus more convincing, too. I recommend to accept this version.

One remaining comment for the camera ready: I notice that the mentioned DL reasoners are somewhat outdated. Racer is now RacerPro, FaCT is FaCT++ (used in the text, but the paper cited is still about FaCT) and the HermiT reasoner was forgotten (with only 4 "living" OWL DL reasoners why mention just 3 of them?). I trust that the authors will update this without requiring a further reviewing round.

This is a revised submission. The reviews below are for the original version.

Solicited review by Carsten Lutz:

This is a system description of the Mastro system for ontology-based
data access (OBDA). It contains

- a brief introduction to OBDA in general and the DL-Lite approach in
particular, along with many references to the relevant literature

- a description of the Mastro system including its general architecture,
functionality, main components, and interfaces

- a brief discussion of three concrete applications / use cases in data
integration, and

- a discussion of related work.

The paper is well-written and pleasant to read. As is appropriate for
a system description, it does not contain much technicalities or novel
theoretical results. It also does not contain major new ideas of
system implementation. However, the paper is a very readable and
gentle introduction to OBDA in general, and to using the powerful
Mastro tool in particular. It can also serve as a (brief) survey of
the area. Therefore, I believe that this paper will provide a useful
and accessible starting point for practitioners that want to learn
about OBDA and put the Mastro system to work. I recommend that the
paper is accepted.

General comments:

1) I am missing a discussion of the arity of relations. You propose
the integration of existing relational database systems, which
typically involve relations of arity greater than 2, using a
description logic that is restricted to relations of arity at most 2.
Especially from a practitioners viewpoint, one would expect an
explicit discussion of this gap, and preferably a systematic way to
bridge it. Maybe this is supposed to be implicit in your discussion of
mapping assertions and the impedance mismatch (reification?), but
that does not become very clear.

2) The three use cases are too uninformative, as their description
mainly says "we have applied Mastro with company X in application
area Y, the constructed ontology had size Z; and it all went very
well". What one would _really_ want to hear is a careful and balanced
discussion of the strength and challenges of the approach in each of
the applications. What went particularly well ("everything" is not a
too convincing answer)? What is (currently) difficult? Where do you
see the limitations? How long did query processing take? Etc. Questions
like these should be at the heart of a system description.

Specific comments:

p3c1l2: "the aims it" -> "the aim is"

p3c2: it is a bit surprising to first read that your mappings are GAV
because of the lack of existential variables in the body, and then that
there are function symbols in the body used to generate new objects.
Some clarification would be good.

p4c1l-18: first use of the term "epistemic". Up to here, you have called
it "semantic approximation". It might be good to say that these are
the same thing (you say that, but too late).

p4c2l17+: it is not quite clear here what "expressed over a TBox T" means.
Exploiting the constraints in T? Or only formulated in the language of T?
It seems that in line 17 you mean the former, and in line 19 the latter

p5c1: I found the description of unfolding (points (i)-(v)) very hard to
understand; any improvement welcome.

p5c1: parallelization: it did not become clear to me whether your parallelization
means to (a) perform in parallel queries over _different_ databases or
(b) also to perform parallel queries over the same database. A clarification
would be welcome.

Section 4: I am not a native speaker, but some phrases in this section sounded
ungrammatical to me (e.g. "experimenting the system"). You might want to
check.

Solicited review by Roman Kontchakov:

The submission describes MASTRO, a system for ontology-based data access. On 11 pages the authors present basic principles and architecture of the system as well as three real-life applications where it is being used.

The submission is clearly relevant to the journal and the three cases of using MASTRO in industrial setting show maturity of the system.
The experimental evidence provided by the authors is convincing but somewhat incomplete (see detailed comments).
In Section 5 the authors provide a detailed analysis of alternative approaches and tools. The analysis, however, raises the following important question. In the case of MPS the data is updated infrequently but queried often (in fact, it is updated "on a daily basis"). Pure query rewriting is known to produce very large queries. On the other hand, ABox extension (mentioned on p 9) allows one to avoid the exponential query blowup; it, however, has to be run, e.g., "on a daily basis". So why would one use the expensive pure query rewriting and not the ABox extension if the data is not updated frequently?

The paper is quite patchy: some paragraphs are good, others rather poorly written, with lots of heavy sentences and unnecessary words. There is also a mixture of terminology taken from different traditions (e.g., classes and properties versus concepts and roles). The authors' desire to appeal to different communities is, of course, a positive thing but such a mess of names makes it difficult to read.

On the whole, the MASTRO system description is a relevant and significant contribution to the journal. However, the submission is not quite ready for publication in its present form.

Detailed comments:

p 2, left column, line 2: what does `etc.' mean here?
p 2, left column, par 2: "DL_Lite_{A,id} captures all basic constructs of the languages for ontologies" is misleading; obviously, the authors did not mean DL-Lite_{A,id} contains the whole of OWL
p 2, left column, par 2: too vague to be understood: it seems the authors start with the core part of DL-Lite, then mention identification constraints and then EQL constraints (as `other general forms of constraints'); but why semantic approximation then? and what are `all such constraints' that are very useful?
p 3, left column, par 2: DL-Lite is a family of lightweight DLs; why do the authors associate lightweight DLs with the low data complexity of inference? EL is also a family of lightweight DLs, but they were designed for efficient classification; on the other hand, inference (for example, instance checking) can be in AC^0 for bigger languages than DL-Lite_{A,id}.
p 4, right column, par 1: what are `design-time' and `run-time' tasks?
p 5, left column, par 1: why query unfolding is more complicated in Mastro if from the very abstract point of view it is just one of the data integration GAV systems? is it only because of the `object identifiers'? then it should be said so.
p 5, right column, par 3: the "Roughly speaking, " sentence leaves the reader to wonder: UCQ are `unions of select-project-join SQL queries', so why would anyone call an `SQL query over ... UCQs' something other than `an SQL query'? SQL queries are closed under substitution after all.
p 6, right column, end of par 2: what are "syntactic approximations"?
p 7, right column, last par: the data is updated "on a daily basis"
p 7, right column, par 1: how many axioms?
p 8, left column, par 1: how many attributes?
p 8, left column, par 4: "around 100" sounds a bit uncertain comparing to "112 concepts"; how many precisely?

Heavy sentences:

p 1, left column, par 2: split the first sentence
p 2, right column, line 6-8: `(that is in AC^0 with respect to ..., i.e., w.r.t. ...)' is extremely heavy on its own (contains _two_ w.r.t.); moreover, this part in brackets breaks the flow of the outer sentence and makes it difficult to read
p 2, right column, line 21: "possibly specified in non-relational form" -> "possibly in a non-relational form"
p 2, right column, line 24: "represents them as if they were ..." --> "represents them as"
p 4, left column, line 10: "modules that constitute the system, which are shown in Figure 2": which reads as relating to `the system', why plural then?
p 4, left column, par 2: the two sentences at the end (`We start ...') do not help the reader at all
p 4, right column, par 2: "as, for example, the use of" -> "for example, "
p 4, right column, par 3: two "but rather" in a single sentence
p 5, left column par 1: (i)--(v) are nicely written but they are a way too handwavy to be understood without reading, say, [31]
p 5, left column, line -7: `This feature is a key feature that allows' -> `This feature is key'
p 5, right column, line 5: "the answer is forwarded to a result set wrapper": what is `a result set wrapper'? is is a bit of software? (but then how can it be `returned to the client'?) or is it a data structure? (but then how can something be forwarded to it?) or is it both?
p 6, left column, par 2: "the Consistency Checker allows one to verify consistency of very expressive constraints": what are those "very expressive constraints"? do they include anything apart from "identification and EQL constraints"? if not, why not to write "identification and EQL constraints" instead of "very expressive constraints (e.g., ...)"?
p 6, left column, par 2: "whose answers return data that give rise" sounds wrong
p 7, left column, par 3: "managed by Websphere" reads as "15 attributes are managed by Websphere"
p 7, left column, last par: two "that" in a sentence is quite hard to read
p 8, left column, par 2: "namely, identification constraints" (without "the possibility to specify")

Terminology:

p 1, left column, par 2, line 7: `instance level v intensional level,' whereas on p 3, left column, par 1, it is `extensional v intensional knowledge'
p 3, left column, par 2: axioms are used in the text elsewhere but are not defined here (on p 3 they are called `logical assertions' instead)
p 3, left column, line -14: why negated concepts are mentioned here at all?
p 3, right column, last line of par 1: what are "instances of an ontology"? did the authors mean "object terms, which serve as object identifiers for individuals in the ontology"?
p 3, right column, line -25: concepts v classes; gets rather nasty in "class satisfiability, i.e., ... a given concept"
p 3, right column, line -22: why not stick to the standard "concept satisfiability"?
p 4, right column, par 2: does `ontology' equal `TBox'? if not, elsewhere that would cause lots of misunderstanding, e.g., 7 lines below we read "ontology (T,A)" but in most other contexts the ontology is just T
p 4, right column, par 2: why Q_r is a set of CQ, and not a UCQ?
p 4, right column, line -11: `concepts, roles and attributes' would be more suitable for a TBox than `classes and properties'
p 6, left column, par 2: `exclusion dependencies' were called disjointness constraints on p 3

Typos:

p 1, left column, line 4: Bolzan_o_
p 2, left column, line 8: _the_ expressive power
p 2, right column, line -3: experimented -> trialed
p 3, left column, line 1: "the aims it" -> "the aim is"
p 3, left column, line 12: "is traditionally constituted" -> "consists of"
p 3, left column, line -9: "An ABox instead," -> "An ABox"
p 3, right column, line -5: would "tuples that are answers to Q in every model" be a better option?
p 3, right column, line -2: `this latter' -> `the latter'
p 4, left column, line 12: `recall' -> `mention'
p 4, right column par 2: _an_ UCQ
p 4, right column, par 2: `natively deal' -> `deal natively'
p 4, right column, line -10: `alphabet' -> `vocabulary'
p 4, right column, line -6: definition_s_
p 4, right column, line -3: quite _a_ straightforward
p 6, left column, par 2: "proprietary API is used to integrate" (without "is the one that")
p 6, left column, par 2: "This API is also used to implement specific procedures"
p 7, left column, par 2: is _a_ world leader
p 7, left column, par 2: "In such a case study" is not needed
p 7, left column, par 2: significant, not significantive
p 8, left column, par 1 (and elsewhere): are these `experimentations' and not `experiments'?
p 8, left column, par 2: incompleteness and inconsistency (in singular)
p 8, right column, last par: "several optimization_s_ have been _implemented_"
p 8, right column, lat par: "to deal with very large ABoxes" ("with ontologies" is not needed)
p 8, left column, par 1: "In [21] an alternative approach to query answering is presented"
p 9, left column, par 1: "given by first experiments are encouraging"
p 9, left column, par 2: "supports none of the advanced features" (neither any -> none)
p 9, right column, line 1: "_a_ mechanism"
p 9, right column, line -20: "form" (not form_a_)
p 9, right column, line -15: "whose basic task is" (no s, no "consists in")
p 10, right column, (iv): capital F in "Finally"

Solicited review by anonymous reviewer:

This paper describes the Mastro system that allows the construction of views on
the relational databases with mapping rules and ontological axioms (in formalism
DL-Lite_A,id). Unions of conjunctive queries that use the ontology's vocabulary
can be executed by translating them first to sets of conjunctive queries, then
to logic programming rules (to incorporate the RDBMS mappings) and then to a set
of SQL queries that can be evaluated. The general approach of formulating
queries with ontological terms is called "ontology-based data access" by the
authors.

The paper gives a very high level view on the system and the practical
experiences. For the details of the semantics and algorithms that are used in
Mastro the reader is referred to existing publications. The paper contains no
evaluations or proofs. The practical experiences that are mainly described are a
use of Mastro in the military domain and another use in the finance domain. Some
figures are provided (number of axioms, number of mapping axioms).

I think that the paper fits to the call for papers of this special issue. The
main research works on the topic have been published elsewhere but the tool that
is described here could be interesting to some. The paper is too long (11 pages
but the call says 10 are the maximum). So the paper should be shortened for the
final version if it is accepted.

The other issue with the submission is in the related work section. It mentions
some approaches that are not so closely related and omits others that are more
similar. The authors discuss that OWL DL reasoners are not able to scale as well
as Mastro but they do not mention that there are different OWL profiles that are
also scalable. E.g. the OWL RL tools (OWLIM, Oracle, ...) today scale to very
large databases. But also the OWL EL tools are much more scalable than the OWL
DL tools that the authors discuss. This could be misleading to a reader. Another
important aspect is that all other OWL systems that are mentioned allow users to
write data and Mastro is "read only". This is an important difference that must
be mentioned.

The paper also says that Mastro is the only system that offers reasoning and
mapping to external data sources together. I don't think that this is true. Two
other examples that I found on the web are Virtuoso from OpenLink and OntoBroker
from OntoPrise. I do not know what exactly they do but both claim to support
data integration on their web sites. Both tools support query answering and some
forms of reasoning. Other tools that have supported data from RDBMSs are KAON2
and SHER. It is important that the related work section does not give the
impression that Mastro is the only tool with this goal or that OWL QL is the
only scalable OWL fragment. A reader could be made to believe this since the
authors say that Mastro provides the "best expressive power allowed" for
obtaining tractable system.

I think it is not difficult to fix this problem so I think the paper can still
be accepted if the related work section is expanded as suggested.

Some other comments:
- The first sentence in section 2 seems to be misformed.
- On page 6 it is said that EQL queries allow negation and comparison. This
seems to be related to the SPARQL query language (it also has closed world
negation and it is supported by some OWL tools that do reasoning). This should
be explained.
- on page 8 it says "several optimization"
- page 9 contains a word "forma"
- the word "expressive" occurs twice on page 10

Tags: