Module Extraction for Efficient Object Query over Ontologies with Large ABoxes

Tracking #: 619-1827

Jia Xu
Patrick Shironoshita
Ubbo Visser
Nigel John
Mansur Kabuka

Responsible editor: 
Bernardo Cuenca Grau

Submission type: 
Full Paper
The extraction of logically-independent fragments out of an ontology ABox can be useful for solving the tractability problem of querying ontologies with large ABoxes. In this paper, we propose a formal definition of an ABox module, such that it guarantees complete preservation of facts about a given set of individuals, and thus can be reasoned independently w.r.t. the ontology TBox. With ABox modules of this type, isolated or distributed (parallel) ABox reasoning becomes feasible, and more efficient data retrieval from ontology ABoxes can be attained. To compute such an ABox module, we present a theoretical approach and also an approximation for SHIQ ontologies. Evaluation of the module approximation on different types of ontologies shows that, on average, extracted ABox modules are significantly smaller than the entire ABox, and the time for ontology reasoning based on ABox modules can be improved significantly.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 10/Mar/2014
Review Comment:

The paper tries to address a very interesting issue in reasoning with ontologies: the modularisation of ABoxes into fragments to improve the time performance of important reasoning tasks as Instance Checking (IC) and Instance Retrieval (IR). The paper can be split into two main parts: a first part, that should lay the theoretical foundation for the correctness of the approach, and a second one, aiming at evaluating the improvement in terms of time to perform IC and IR.

I have several concerns about both parts, and another, important concern about the connection between these two parts.

First part (sections 3 - 5):
The results heavily depend on results from other papers (in particular, Horrocks et at., 2000) that are not explained well enough. The tableau algorithm which the authors refer to many times is not even sketched. This is a journal paper, so I cannot even take into account the limitation of space as a reason for that. I find this a crucial problem since it is impossible to check whether the results described actually hold. Additionally, they refer to definitions "as-you-go", for example, the notion of a concept in negation normal form by Baader et al., 2007, is just mentioned (and not defined) in Definition 4.1.
There is a general lack of conceptualisation in how the results are presented: for example, Proposition 4.2 should rather be presented as an algorithm, of which soundness and completeness should then be proved; additionally, this proposition makes use of notions that are defined afterwards (property-preserved modules and classification-preserved modules). I find this way of exposing a topic rather confusing. Other examples of lack of clarity include:
- Figure 1 and Figure 2, that should be algorithms instead;
- Definition 4.1, which is not a definition after all.
Other specific issues:
- In Remark 3.2 the authors say that the investigation can be restricted to atomic classes since one can "always assign a new name for a […] (possibly complex) class C by adding axiom C \equiv A into T". Since there are infinitely many complex classes, this means that in principle the ontology could become infinite.
- On page 13, property (E1) starts with "T^\prime \cup {a}". This is a type mismatch since {a} is not an axiom. It is also unclear what they mean with this.
- The proof for Proposition 5.2 is only sketched; moreover, it is claimed to be proved by induction, but the authors do not mention on which integer.

Second part (section 7):
The experimental part contains two main experiments: the first experiment investigates on the size of the notion of ABox modules over a small corpus consisting of 9 ontologies with large ABoxes; the second experiment investigates the speed-up obtained by the reasoner HermiT for the performance of IC and IR over these ontologies.
The experimental design lacks of motivations: first of all, how is the corpus selected? Can the results obtained for these ontologies be generalised, and if so, why? In absence of any discussion this approach at selecting a corpus can be said to be cherry-picking. Additionally, there is no experimental hypothesis to prove / disprove, no expected outcome, so it is hard to evaluate the significance of the results. And indeed, the experimental analysis is a mere list of frequencies, there is no proper evaluation of how this notion of modularity performs. This is made clear in the presentation of the results: the intervals are chosen to "have a relatively simple while detailed view of distributions of small, medium, and large modules". However, it is unclear what "small", "medium", and "large" refer to. To give an example, one could say that a module is "small" if humans can understand its content, and claim that this happens for modules whose signature is smaller than 10 individuals. However, no discussion like this one is carried out.
The second experiment is performed over different modules from those discussed in the first 21 pages of the paper: the reasoning tasks investigated, indeed, require modules to contain at least one justification for an entailment to hold. Hence, the focus shifts towards a refinement of these modules.
I find two main concerns about this second experiment. First of all, the authors pick 10 classes at random. Why? Is the outcome statistically significant compared to the TBoxes sizes, of which we are given no information? Second concern: against which instances is the task IC performed? Every possible? Or another random selection?

Lack of connection between the first part and the second one: on page 7, the authors say "the objective of this paper is to extract a precise ABox module […] so that the resulting module ensures completeness of entailments while keeping a relatively small size by excluding unrelated assertions". That is, a precise ABox module contains all the justification for an entailed assertion to hold. For the evaluation (at the end of page page 21), instead, they consider "smaller modules […] for more efficient reasoning and query answering". For a reader is quite confusing to understand the rationale behind a lengthy discussion of one notion of module that leads to the investigation of another notion.

Another annoying remark I have to make is the stretch to find a parallel between TBox modularity as defined by Cuenca Grau et al in 2007, and the approach presented in this paper: the authors say "Analogous to modularity defined for the ontology TBox, the notion of modularity for the ABox proposed in this paper is also based on the semantics and implications of ontologies". However, there is no further discussion of this analogy, and indeed I fail to recognise any relationship at all. I may be wrong, but I read it as an attempt at flattering the responsible editor.

Other minor remarks:
- in Section 6 (page 17) the authors comment that "both approaches […] failed to impose any logical restrictions on a single partition". What does this sentence mean?
- in Section 8, second paragraph, the authors say "all ABox assertions that are semantically "non-local" to a given signature". What does "non-local" mean here? It sounds completely out of context, especially because it is the first time that this expression occurs in this paper. This expression is used in locality-based modules, which are (roughly speaking) TBox modules.
- the authors use interchangeably the terms "concept" and "class" (similarly, they mix the terms "role" and "property"); in general, though, the first term is used in papers about Description Logic ontologies, whilst the second one is used for OWL ontologies. I suggest to keep the
- please, check the grammar, and the use of commas, with a native English speaker.

Review #2
By Pavel Klinov submitted on 08/Apr/2014
Major Revision
Review Comment:

In this paper the authors define the notion of ABox modules, i.e. fragments of the ABox which capture entailments for the given individual w.r.t. the concept names and roles occurring in the ontology. In addition to the generic definition, the authors propose a specialization, called Exact Abox module, and proceed to investigating how (approximations of) such modules can be computed efficiently for SHIQ ontologies. The paper ends with a fairly extensive evaluation on several hand-picked ontologies showing that most of ABox modules are small and thus reasoning over them is much faster than over the entire ABox.

===Significance and novelty===

I believe the paper addresses an important topic since the existing notions of logic-based modules are defined w.r.t. a fixed signature whereas it is often important to capture a certain class of entailments over the signature which is not known in advance (e.g., all class and property assertions for an individual). However, it must be noted that the proposed modules would only preserve atomic entailments, e.g., atomic concept assertions, and do not guarantee that answering complex queries, e.g., DL queries for arbitrary concept expressions, over the module will return the same results. It is also not clear how (or if) the proposed modularization will help evaluating conjunctive queries which may return individuals from multiple modules. Remark 3.2 says (correctly) that answering DL queries can be reduced to instance retrieval for a fresh concept name but this would probably require to recompute existing modules (since the TBox has changed). This is obviously undesirable.

I believe the paper adequately discusses related work.

===Contributions and technical quality===

I'll comment separately on each individual contribution.

The notion of ABox module. This notion simply provides conditions which a fragment of the ABox must meet in order to be called "ABox module". The authors correctly note that it does not prevent the module from containing superfluous parts.

The notion of exact ABox module. This is where things get interesting, this notion basically says that the ABox module is a union over all justifications of atomic ABox facts about the individual. On the one hand, this makes sense because it trivially implies that any exact module is a module. On the other hand, this definition shouldn't be directly used for extracting modules because just one justification per assertion would be enough. BTW, the proposed definition guarantees the property which is called "depletedness" for locality-based modules: the ontology minus the module entails nothing about the individual. It'd be good to mention it.

It is unfortunate that the authors decided to jump directly to the extraction aspects and did not spend any time discussing the properties of their modules, e.g. whether there're counterparts of such properties of locality-based modules as robustness under signature restrictions, self-containedness, etc. (see [1] and [2]). It'd be good to fully understand what the modules are before starting to extract them.

Extracting exact ABox modules. This is the most problematic section of the paper, which suffers from imprecise notions, statements, and lack of proofs. Here are the main issues:

1) Page 11: the whole notion of "class term behind a's assertions..." is totally undefined and very confusing. It's rather unfortunate because the authors seem to build the extraction methodology on it (which culminates with the condition (3) which is then referenced in many other places). Both this notion and the condition (3) must be made proper (formal) definitions. Then the statement that the condition (3) is necessary and sufficient for entailing a concept assertion should be formulated as a proposition.
2) Page 12: Nothing is proved (or even formulated) about the extraction procedure shown at the top of the page. This is actually one of the central contributions of the paper: the algorithm for extracting exact ABox modules. It has to be shown that 1) it is correct, i.e. it selects all and only module-essential assertions, 2) it terminates (this is rather easy), and 3) what it's complexity is (apparently as hard as reasoning for Exptime-logics).

Extracting approximate ABox modules. The authors propose a syntactic check to decide if an axiom can be potentially relevant for individual entailments. Unfortunately it's rather hard to understand until (3) is made clear (because the syntactic form essentially approximates (3)).

Evaluation. The performed evaluation is pretty strong and shows several important results, e.g., i) approximate ABox module extraction is fast (Table 1) and ii) approximate ABox modules are generally small (Table 2). But some issues need to be fixed:

1) It's not described how the TBox of DBPedia ontologies was generated (page 19). Apparently some complex class expression have been auto-generated but the methodology isn't explained.
2) According to which principles were these ontologies selected? What makes them representative or interesting (other than large ABoxes which are sometimes synthetic)? There some others ontologies with large ABoxes, e.g., the IMDB ontology.
3) I wonder if any of the ontologies contain transitive roles (and assertions for them). My understanding is that transitive roles could be one of the main difficulties because they can blow the property module for "a" (by including role assertions for other individuals). If not, this is a weakness.
4) It's unclear how the time spent on module extraction for a single individual was measured, e.g., were all extractions done independently or was the whole ABox modularized in one go?


While most of the prose is OK, the paper suffers from various imprecise statements and confusing/incoherent use of terminology. Here are some of the issues:

* p2: "... up to exponential worst-case complexity...", actually it's N2ExpTime for SROIQ.
* p2: ".. a setting of semantic webs" -> "the Semantic Web setting".
* p2: One has to be precise when talking about the closure of logical implications (e.g., does it include concept assertions for complex concepts of arbitrary length?).
* Definition 2.3 isn't quite a definition. T and A have to be defined precisely as sets of axioms of a specific form.
* Definition 2.5: "Logic Entailment" -> "Logical Entailment"?
* p6, top: The statement that all reasoners implement (hyper)-tableau isn't quite true. Even for expressive DLs, e.g. Horn-SHIQ, there're other methods such as consequence-based reasoning.
* Definition 2.9: "to be" is missing
* Definition 3.4: need to make clear that Just(alpha, K) here means *some* justification, not any specific justification of alpha.
* p8 and elsewhere: It'd be considerably better to define equality-free ontologies syntactically. I.e., if I have an ontology, how do I know which extraction procedure should I run, the one which accounts for individual equality or the simpler one?
* p11: what is meant by "decidable R-neighbors"? The same goes for " its subsumer is undecidable." on page 13.
* Proposition 4.3: it'd be better if this fact was proved without explicitly referring to the tableau's completion rules. It's a fact about the logic, not any particular calculus.
* p14: Essentially the same comment applies to the Module Extraction with Equality section. Explicitly referring to particular tableau rules brings nothing but trouble. Also the statement that equality requires reasoning to detect it is strange since reasoning is required anyway to compute exact ABox modules.
* Proposition 5.1: are we talking about asserted or inferred R-successors? Again, what it "equality-free ABox" exactly?
* Definition 5.1 seems to define potential equivalents in terms of potential equivalents. Until it's fixed I find it nearly impossible to understand Proposition 5.2, which is the key statement about module extraction from ABoxes with equality. Perhaps it would help to prove the simpler Proposition 5.1 first.
* Table 1 and 2: better to explain columns in the captions rather than in text on some other page.
* p19, end: what is "entities" here? Definition 2.7 says that signature is always a set of individuals. Can entities refer to something else?
* p22: It seems that this optimization will lead to the modules not being exact ABox modules any more (since not all module-essential axioms will be included). Better to say it explicitly.


In general, this is a potentially useful paper which presents module notions and extraction methods which can prove useful for applications dealing with instance checking w.r.t. large ABoxes. Eventually I'd like it to be published but I believe it needs another round of reviewing after some key notions and conditions (condition (3), most prominently) are made precise. Also, it can't be published until the correctness of the main extraction procedure (for exact modules) has been formally proved and reviewed (since all subsequent results, e.g. approximations, are based on it).

[1] Bernardo Cuenca Grau, Ian Horrocks, Yevgeny Kazakov, Ulrike Sattler: Modular Reuse of Ontologies: Theory and Practice. J. Artif. Intell. Res. (JAIR) 31: 273-318 (2008)
[2] Ulrike Sattler, Thomas Schneider, Michael Zakharyaschev: Which Kind of Module Should I Extract? Description Logics 2009

Review #3
Anonymous submitted on 03/May/2014
Review Comment:

The submission addresses the problem of partitioning the assertional part (ABox) of ontologies into smaller fragments that contain all the information necessary for reasoning. The problem has been studied before and is certainly relevant to the journal.

The presented approach, however, appears not to be entirely correct.

1. The definition of an equality-free ontology requires that K \not\models a \approx b for all **individuals in K**. The authors then claim that "deduction of a property assertion between different named individuals in A should not be affected by their class assertions except via individual equality". Taking into account that the ontology language has number restrictions, the claim (if I interpret it correctly) is untrue: consider

R <= S
A <= \exists P.\top
P <= S
A <= \leq 1 S.\top

It then only follows that S(a,b). If, however, we add A(a) to the ABox then we will also derive P(a,b) simply because both R and P are subroles of functional S.

Propositions 4.2 and 4.3 are based on the aforementioned claim and so, appear to be incorrect (tableau algorithms make no distinction between ABox individuals and special individuals introduced to satisfy the existential quantifiers).

2. Condition (3) cannot be sufficient because of the following example: let the TBox contain

A <= \forall R.A

and consider an ABox of the form

R(a_1,a_2), R(a_2,a_3),\dots, R(a_{n-1}, a_n).

Then none of the ABox individuals is an instance of A (in every model). If, however, you add A(a_1) to the ABox, then **all** a_i will be instances of A in every model. Thus, the instance checking problem is not "local" (which also explains its P-hardness in data complexity even in tractable languages like EL).

To sum up, I cannot see how the presented approach can work for SHIQ. It is no coincidence that the cited work (Wandelt and Moeller, 2012) deals only with SHI (no number restrictions resolves item 1 above) and looks at paths of ABox individuals (rather than the star-like configurations of in the submission, item 2). On the other hand, the presented approach might work perfectly for DL-Lite.

So, the submission cannot be accepted in the present form.


-- The use of "class" and "role is somewhat inconsistent: it is concepts and roles in DLs, and classes and properties in OWL.

-- The claim that a \approx b is not supported in SHIQ is strange -- it can be added without any change in the complexity (simply rename all b's into a's and remove the equality). The explanation (nominals are illegal in SHIQ) is stranger still (and perhaps it is related to item 1 above).

-- The role of justified entailment is not entirely clear in the submission.

Some typos and minor remarks are listed below.

abstract: solving the tractability problem sounds very much like giving an affirmative answer to P v NP

p 1, par 1: terminologies are not "concepts and roles" --- the sentence needs rephrasing

p 2, par 1: "knowledge data" is a strange term

p 2, par 2: "up to exponential worst-case complexity" is imprecise because OWL 2 is based on SROIQ, which is N2ExpTime-complete

p 2, par 3: "where although" is ungrammatical

p 3, line 2: there are three options in the preceding sentence and it is not quite clear what "the latter" refers to

p 3, par 2: "a closure of all facts" is a new term, which is not defined

p 3, item 1: "sound and complete facts" also needs clarification

p 4, par 1: "SHIQ is extended from ALC" sounds strange

p 4: "\sqsusbeteq is reflexive and transitive" is a strong assumption

p 4, Def 2.3: tuple -> pair

p 5, Def 2.4: \Delta is non-empty and is referred to as the domain (not "referred to as a non-empty domain")

p 5, above Def 2.5: "at least one contradiction" needs clarifying

p 5, after Def 2.6: something else is usually understood by the classification problem (computing the complete lattice of concepts and roles)

p 6, Def 2.9: is said to be in simple form

p 7, after Def 3.2: a criter*ion* (and a criterion is by definition a necessary and sufficient condition, so there is no need to write that)

p 9, footnote 1: the operation \circ is not allowed in TBox axioms (according to Definition 2.1 and below the sentences below it)

p 11, line -2: "the set of decidable and distinct R-neighbors" --- what exactly is a decidable neighbor?

p 16, line -1: author*s*

p 17, Def 6.1: font in (T,A) and , instead of \cup in line 4

p 19: Java *h*eap and \mathcal in ALF

p 20, last par: *more* than 90%

p 24, Table 4: brackets around h and s should be removed

p 24: the claim that modular resigning can change the complexity "from exponential to (approximately) polynomial" is rather bold