HyS: Fast Atomic Decomposition and Module Extraction of OWL-EL ontologies

Tracking #: 1009-2220

Francisco Martin-Recuerda
Dirk Walther

Responsible editor: 
Michel Dumontier

Submission type: 
Tool/System Report
In this paper, we present HyS, an application for fast atomic decomposition and module extraction of OWL-EL ontologies. HyS computes a hypergraph representation of the modular structure of an ontology. This hypergraph representation contains the atomic decomposition of the input ontology, and it allows to extract modules for a given signature. We provide an experimental evaluation of HyS with a selection of large and prominent biomedical ontologies, most of which are available in the NCBO Bioportal.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 27/Mar/2015
Minor Revision
Review Comment:

The paper presents a tool called HyS which is a library to decompose OWL EL ontologies according to the existing notion of Atomic Decomposition (AD). That notion is known to help extracting locality based modules and performing reasoning tasks.

I believe the paper sufficiently meets all requirements for a system/tool paper. It demonstrates that the graph-based ontology representation methods are useful for computing AD and module extraction (ME), and that the tool outperforms previously known algorithms. The paper is a bit weak on the impact requirement: it does not go beyond saying that HyS could be used for AD and ME. It sounds like the authors presume (and that's not incorrect) that the utility of AD and ME has already been well-established and is beyond the paper's scope. However, the paper could be made stronger if, for example, the authors discuss how the tool could be used inside the modular OWL reasoners, e.g., Chainsaw.

If this paper was a research paper kind of submission, I would complain about the lack of formal description of the AD algorithm and the resulting lack of its correctness proof. I understand that it is probably not a requirement for a system/tool paper. However I still find Section 3.1 and Figure 4 rather disappointing. A few things are not clear (including correctness). For example, it looks like the partial order over the set of atoms corresponds to the set of edges in the collapsed ADH graph but it's not said explicitly. It is also unclear whether the algorithm performs the transitive reduction for the partial order relation, as it should. If yes, then was it accounted for in the experiments?

I suggest taking another pass over 3.1 before submitting the final version (if accepted). Also, if there is another (research) paper with formal description of the key algorithms, a link should be included. If not, I suggest either writing a technical report or presenting the algorithms here.

A technical remark regarding the graph representation: the authors say correctly that the number of hyperedges in ADH could be exponential in the size of O and suggest to use a labelled graph G_H instead, where nodes are labelled with their MLSes. I don't see how this improves the situation because in principle there could be an exponential number of sub-signatures which make the given axiom non-local. I suspect the authors do some syntactic transformation on the axioms to prevent it but it's not explained.

Another technical remark is that it's unclear why the authors restrict themselves to OWL EL. I don't see any reason other than not implementing the locality check for non EL axioms (which is not hard to do at all). And there is no hyphen between OWL and EL.

Presentation of HyS as a tool with a set of interfaces is adequate.

Evaluation is OK and convincing but there is room for improvement. In particular:
* How much RAM was allocated to the JVM (as opposed to the total RAM)?
* How and why the authors settled at the number 500 for the signature size. What was the ratio between the number of classes and properties? Selecting random signatures for ME evaluation isn't trivial, see [1,2]
* Section 3 says that nodes in ADH correspond to axioms in O but the numbers in the #nodes column in Table 2 differ from the corresponding columns in Table 1. For example, CHEBI contains >= 85347 axioms (summing subsumptions and role axioms) while Table 2 shows only 85344 nodes. I'm not clear what's going on here.
* The paper compares HyS with FaCT++ and OWLAPI-AD on the decomposition task and with FaCT++ and the OWL API on the extraction task. However, I see that in [3] the extraction task is also sped up using the computed AD so it seems natural to compare with that rather than the pure OWL API.

[1] Chiara Del Vescovo, Pavel Klinov, Bijan Parsia, Ulrike Sattler, Thomas Schneider, Dmitry Tsarkov: Empirical Study of Logic-Based Modules: Cheap Is Cheerful. International Semantic Web Conference (1) 2013: 84-100
[2] Boris Konev, Carsten Lutz, Dirk Walther, Frank Wolter:
Model-theoretic inseparability and modularity of description logic ontologies. Artif. Intell. 203: 66-103 (2013)
[3] Chiara Del Vescovo, Damian Gessler, Pavel Klinov, Bijan Parsia, Ulrike Sattler, Thomas Schneider, Andrew Winget: Decomposition and Modular Structure of BioPortal Ontologies. International Semantic Web Conference (1) 2011: 130-145

Review #2
By Thomas Schneider submitted on 25/Apr/2015
Major Revision
Review Comment:

This submission describes HyS, a system that computes the atomic decomposition (AD) of an OWL ontology. AD reflects the modular structure of the ontology by compactly representing all its modules, for a certain logic-based module notion. HyS computes AD with modules based on syntactic locality as the underlying module notion. The approach uses the hypergraph representation of an ontology, which neatly generalizes the notion of reachability-based modules, which were originally devised for lightweight ontology languages. The hypergraph representation additionally allows HyS to compute single modules faster than current implementations. The submission describes the underlying approach, the architecture, and an evaluation of HyS on nine biomedical ontologies.

In a nutshell, the approach is elegant, the performance HyS is impressive, and the problem solved by HyS is a relevant one. The two downsides of the submission are that (1) HyS works for EL ontologies only and it is not said how easy or difficult the planned extension to SROIQ would be; and (2) presentation lacks clarity in some places, particularly in the technical part. I therefore recommend to accept the the submission under the condition that the presentation will be improved.

Details according to the evaluation criteria

* Quality

The approach taken is elegant because it is based on an intuitive hypergraph representation of the syntactic dependencies between axioms, and those dependencies directly affect the formation of a locality-based module. Computing the AD and extracting a module neatly boil down to computing (strongly) connected components in the hypergraph. According to the evaluation provided, both tasks can be performed by HyS significantly faster, often by 1-2 orders of magnitude, than with (the few) existing tools.

* Importance

Module extraction and modularization of ontologies have been widely recognized as important reasoning tasks for ontology development, reuse, and comprehension. HyS provides infrastructure for solving these tasks automatically. However, the importance of HyS has two restrictions. First, HyS only applies to the EL fragment of OWL. I do not see this as a problem per se; I am just concerned that this tool report is a bit premature because the authors write that they plan an extension to SROIQ (OWL), and it is not made clear how easy or hard this extension would be to realize. The description of the technical approach suggests that SROIQ is covered; so if an extension is straightforward, then I think that it is more appropriate to publish a tool report on "full" HyS.

Second, both AD and logic-based modules are still difficult to use in practice because (a) the connection between the modular structure of an ontology as reflected by AD and its _logical_ structure in a more intuitive sense is still not fully understood, and (b) it is still hard to extract the "right" module because one specifying a seed signature is non-trivial, and current tools (including HyS, as I understand) do not provide support for this initial task.

* Impact

Provided that problems (a) and (b) are at least partially solved by the community, I expect HyS to be successful in application areas that involve modularization and module extraction -- even with its restriction to OWL-EL because many biomedical ontologies are written in OWL-EL.

* Clarity, illustration, readability of the submission

The submission consists of three parts: a condensed version of a previous conference paper by the authors describing the technical approach that underlies HyS, a description of the HyS architecture, and an experimental evaluation. The second and third part are new and make the submission fit into the category 'Tools and Systems Report'. They are mostly clear, but the research questions need to be formulated more explicitly and some design choices require discussion (see detailed comments below). Furthermore, the first part is quite high-level. On the one hand, I appreciate the efforts undertaken by the authors to make the text readable for the general SWJ audience, and the main ideas are all there. On the other hand, I have the impression that some parts of the technical sections are now rather vague and confusing. I will point them out and make suggestions below. Altogether, clarity and readability of the paper will be without objection if the authors manage to find a better tradeoff between the summary character and understandability of the technical part.

Detailed comments

* The current paper should be clearly delineated from the authors' previous work. Altough the WoMO'13 and IWSC'14 papers are cited, it should be made clear which parts of the current paper repeat or summarize their contents. Furthermore, the module extraction algorithm (Fig. 1) is just a slight variation of the one in [3] and should therefore be attributed to [3].

* Section 2.1: it is ok to mention safety, but besides safety, locality guarantees that the module encapsulates knowledge about Sigma, and that seems to be equally important.

* Sec. 2.1: You pick two out of "several syntactic locality notions". Why these two? Why can the others not be handled (or why are they perhaps unimportant)?

* Sec. 2.2: The first sentence is vague. It is important to make the scope of "modules" clear: for a _fixed_ module notion x, one considers the modules for all _seed signatures_ \Sigma. Then two axioms occur in the same atom if they co-occur in all these modules.

* Sec. 2.2, first par: The claim that the atoms partition the ontology relies on two assumptions: (a) the ontology contains only logical axioms, and (b) there are no axioms that are non-local w.r.t. the empty seed signature. These assumptions should be stated in a central position.

* Sec. 2.2, "poset" is jargon, in particular for the mathematically less adept readers. The curly symbol \succeq needs to be linked to the notion "depends". I suggest to drop "poset" and the bulky notation (Atoms^x_O, curly symbol) and instead say that the set of atoms together with dependency constitute the AD. In this spirit, the previous use of "Mod_O^x(sig(alpha))" can be replaced by a more generally understandable precise verbal explanation. Furthermore, dependency between atoms has a similarly important meaning as coocurrence within the same atom; so if you explain one, I think that you should explain the other too.

* Sc. 2.3: "some of the most prominent tools" sounds as if there were lots of tools for extracting locality-based modules and computing the AD.

* Sec. 2.3: The OWL API sould be summarized in a couple of sentences. I do not see the use of the version numbers of the tools in this section.

* Sec. 3, first par: Is there a specific reason why the current HyS version supports only bottom-locality? Is the planned extension to top-locality easy or hard? The conclusion says that an extension from EL to SROIQ is planned. Would this be challenging or straightforward?

* Sec. 3: hypergraphs are not really a standard notion, and many SWJ readers may not be familiar with it. I think it should be introduced informally quite at the begin of the paper. Furthermore, it is not obvious how to define the notion of a (strongly) connected component of a hypergraph; so this should be explained before it is first mentioned.

* The examples in Section 3 seem to be overly simple: they do not even involve any existential quantifiers. Also, I suggest to use one running example to demonstrate the ADH as well as its SCCs later. That running example could also accompany the description of the pcADH. Additionally, labels should be provided for the dashed circles "scc_i" in Fig. 6+7 -- these would help precisely with the difficult parts of the preceding explanations.

* Section 3.1 is called "Atomic Decomposition", but what it describes is the computation of a (p)cADH. The central statement that the two coincide is missing.

* Section 3.1: is the notation size(G) and size(H) ever used again?

* Section 3.1: the functionality of collapse_SCCs(.,.) is described three times on pages 5 and 6.

* Section 3.2: The second sentence says that a locality-based module can be computed by computing "the" connected component in an ADH. But which component needs to be taken? Somehow th

* Section 3.2: I think that some words on the complexity of the algorithm in Figure 8 are in order. At first glance, it looks as if extracting a module involves computing a connected component in an exponentially sized graph, while the standard algorithm for syntactic locality-based modules runs in polynomial time. Without further explanation, the algorithm looks suboptimal, and one would not expect the good performance reported later.

* Section 3.2: As general remark, I like the idea of using ADHs to extract LBMs because ADH seems to be the right "data structure" to ectract an LBM in a more goal-directed way than the original algorithm in [3], which makes a lot of unnecessary locality checks. I am also aware that Dmitry Tsarkov was able to reduce the number of locality checks in a more efficient implementation of the module extractor in FaCT++ a while ago. It would be interesting to contact him and find out whether his optimizations can be explained by your approach (but I cannot rule out that they are based on different intuitions).

* Section 4.2: What difference does it make whether HyS uses the ADH or the (p)cADH? Is it a choice that should be left to the user?

* Section 4.2, just a suggestion for the command-line interface: it would be helpful to have the option to save the AD/modules into a folder different from the one where the signatures are stored.

* Section 5.1: The mentioning of normalization raises the following question: if you split axioms via normalization and then compute a module, how do you make sure that the module is still a subset of the previous non-normalized ontology? For example, an axiom may be normalized into two axioms, and only one of those may end up in the module. If you just "de-normalize" and extend the module by the second axiom, you have increased the module signature. Can you describe more precisely how (de-)normalization and your module extraction algorithm interact?

* Section 5.2: Why is it important to know that nodes and symbols are represented using integers? And what is gained by this choice? I would expect that the integers get wrapped into a containing object anyway?

* Section 6: The first sentence starts with "For the evaluation of HyS ...". What exactly do you want to evaluate? Performance? Scalability? Comparison with other tools? Using pcADH versus cADH? Please state the research questions explicitly. For example, it is said later that it is "interesting to measure the impact of using two different programming languages" -- is this one of your research questions? If so, what insights have you gained from the experiments?

* Section 6: Some design choices of the experiment require discussion: How/why did you choose the nine ontologies? What makes you confident that this sample is appropriate to yield reliable answers to your research questions? Furthermore, I don't find it obvious to exclude CEL on the grounds that it hasn't been maintained for five years. After all, to my knowledge, CEL computes _reachability-based_ modules which are much closer to the approach reported here, and CEL may indeed perform better than the other (general-purpose) module extractors.

* Section 6, page 10, last paragraph, 14th line from below: this sentence says that using cADH doesn't achieve a significant speedup compared to using pcADH, "despite" the fact that cADH is slightly smaller than pcADH. But two sentences before, it says that the differences in computation time correspond to (are proportional to?) the size differences between cADH and pcADH. It seems to me that both sentences convey the same information, and that the "despite" is misplaced.