Discovery of Reused Ontology Fragments and Design Patterns Using Tree Mining

Tracking #: 1243-2455

Agnieszka Lawrynowicz
Jedrzej Potoniec
Michal Robaczyk
Tania Tudorache

Responsible editor: 
Rinke Hoekstra

Submission type: 
Full Paper
The research goal of this work is to investigate ontology fragments that are recurring in ontologies. Such reused fragments may originate from certain design solutions, and may possibly form emerging ontology design patterns. We describe a method based on tree mining, and a transformation of ontology axioms into a tree form, to discover reused ontology axiom fragments. We, then, use association analysis to mine co-occurring axiom fragment sets in order to extract emerging design patterns. Using these methods, we conduct an experimental study on a set of 331 ontologies from the BioPortal ontology repository. We show that recurring axiom fragments appear across all individual ontologies, as well as across the whole set. In individual ontologies, we find frequent and non-trivial reused fragments, some having more than 3,000 occurrences. The longest reused fragment discovered from the whole ontology set is formed of 12 elements, and it appears in 14 ontologies. To the best of our knowledge, this is the first method for automatic discovery of emerging ontology design patterns. Finally, we demonstrate that we are able to automatically reconstruct fragments of ontology design patterns described in the literature. Since our method is not specific to particular ontologies, we conclude that we should be able to discover new design patterns for arbitrary ontology sets. We envisage that the reused ontology fragments and patterns, that we discovered in our work, will be helpful in performing quality assurance for ontologies, as well as in creating custom user interfaces to support the authoring of modeling constructs specific for an ontology.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vojtěch Svátek submitted on 11/Feb/2016
Major Revision
Review Comment:

The paper is devoted to discovery of design pattern fragments in ontologies. The empirical analysis has been carried out on Bioportal.
It is well-focused and potentially impactful piece of work. As far I can judge, the approach taken is original in many aspects. The only real issue to me is the quality of writing. I list below several cases of what I see as sloppy formalization.
It is the content of Table 8 that I find as the most interesting ‚nuggets‘ of the whole experimental part. Yet, I would love to see some expert evaluation of it.
The dividing line between design patterns and ‚empirical patterns‘ as recurring fragments should be clearer in the paper. A design pattern is usually not just the structure of entities per se, but also accompanying textual explanations, diagrams, usage examples, pre- and post-conditions, etc.
The literature survey is reasonably comprehensive. Maybe the older study by Tempich & Volz [1] would also be worth mentioning.
As regards English, overusage of commas (e.g., before ‘that’, which is perhaps sometimes meant as non-restrictive clause requiring ‘which’) is ubiquitous, and so is improper use of singular/plural. Copy-checking by a person with better language skills (if not native speaker) is a must.
Detailed comments:
Section 3.1
- „E \subseteq N \times N is a set of edges“ – this is a minor detail, but since the graph is acyclic, certainly E = N \times N cannot hold, so it should always be a *proper* subset.
- „A forest is a set of rooted, labeled trees“ – I would expect a forest to be a set of trees and a labeled forest to be a set of labeled trees?
- Technically, you say that a labeled tree is a tree, but a labeled tree is a 3-tuple while a tree is a 2-tuple. I would say that for every labeled tree there is a corresponding unlabeled tree - but they are not the same thing.
- When defining an embedded subtree, you speak about the „path from the root of T“, where T is a labeled tree. A labeled tree is however not explicitly declared to be rooted, thus it is not guaranteed that it only has one root. Something should be fixed here.
- The notion of ‚frequent‘ fragment/subtree is not explicitly introduced. You only informally mention that „the aim is to enumerate all subtrees of a given forest, such that their support is greater than a given threshold“. You probably mean that such subtrees are called frequent?
Section 3.2
- „The signature of an OWL ontology O, denoted sig(O), is the set of entities that appear in that ontology.“ Why is there no connection to the vocabulary of the ontology?
- „By N_CC we will denote a set of class constructors“ – it might not be completely clear *what* the class constructors are, from the previously introduced grammar.
- The OWL notation used throughout the paper should be explained. It looks like symbolic expressions with Manchester syntax keywords instead of DL symbols.
- „each variable ?classexpr \inn VCC…“ – you should say clearly right here that when multiple variables of the same type appear in a single fragment, they would be extended with consecutive natural numbers (?classexpr1 etc.)
- „An axiom fragment of an OWL axiom...“ - I don’t see the term ‘fragment’ fully adequate here, as the first operation used (replacement of term with a variable) rather has to do with abstraction/generalization of the axiom. The other operation producing the ‘fragment’ is “cutting a part of…”, which is however not properly explained. (And the ‘fragment’ is not obtained *from* a vocabulary but rather *given* it, or so?)
- Strictly speaking, N_O \cup V is not a vocabulary, as variables are not assumed to be part of a vocabulary.
- “A frequent axiom fragment is obtained after decoding a subtree S, and the possible addition of variables (see Section 4.5).” It is not fortunate to make such a forward reference: without looking ahead the sentence is unclear. Actually, you speak separately about axioms and trees, and only in Section 4 (set aside the abstract) the reader learns that one would be the representation for the other. It would help readability to point at this principle prior to the whole ‘preliminaries’.
- “SubClassOf(X; g(C))” – while previously you spoke about g(.) as of axioms, now they are treated as concept expressions in subclass relationship. It should then probably, formally, mean entailment of the two instantiation axioms?
Section 4.2
- “every distinct pair (type and name) is assigned an unique integer value.” This means that the type alone cannot be used as feature?
Section 4.3
- Fig. 3a caption: „T1 rooted in C^1 and T2 rooted in A^1“ – technically, the superscript only refers to the traversal index and not to the tree itself. Rooting is a feature of the tree itself. Superscript thus should not be used here.
- In tree T1 one node label (E) appears twice. This could however mean different things in terms of OWL structure: just same constructor, literal or cardinality value… or the same ontological entity. It seems you do not distinguish between the two? Is it related to your note in Conclusions „It might happen that the variables appearing in a class frame fragment (as part of different axiom fragments) refer to the same entity“ ? However, here it would be even within the same axiom?
- Fig. 3c: I would say that there should be T2[1,3] in row 2, column 1.
- Fig. 3d refers to „some more complex subtrees“, however, without seeing the trees the table is not so instructive and the caption is quite long and complex.
Section 4.4
- I am not sure if I understand right the first para. Do you say that deriving the presence of the generic ‚subclass axiom pattern‘ from the presence of subclass axioms with restrictions in RHS is ‚not really meaningful‘? It is not much analytical value in it, but is it truly ‚meaningless‘?
- „we filter discovered embedded subtrees, and keep, and extend only those that are also frequent induced subtrees.“ I am afraid I don’t understand this sentence, both in terms of content and syntax/grammar.
- „the number of forests containing at least one tree containing the given subtree“ – wouldn’t it be better to define it with ‚at least k trees‘ first and only then refine that for easy computation k is set to 1? The criterion on ontologies (forests) having a certain kind of axiom structure (tree) in them might also be set differently, with less strong emphasis on cross-ontology co-occurrence of the axioms?
Section 4.5
- „which can be translated into an axiom–centric representation as…“ The tree2axiom ‚decoding‘ algorithm might be intuitive but completely omitting it is not a good thing though.
- „We favor object properties and class constructors over datatype properties and datatypes, e.g., in rare cases, when there are no children for a node labeled some, we add variables for an object property, and a class constructor, instead of variables for a datatype property and a datatype.“ This seems to be a distortion of the original data. Wouldn’t it be possible to generalize to ‚property‘ and ‚type‘ then?
Section 4.6
- Again, the class frame mining algorithm is not explicitly introduced.
- Formulas should be numbered.
Section 5.1
- „One of patterns of size 43 in the NEMO ontology…“ It is not clear to me if this paragraph is an ad hoc chosen example or if it brings some more general lesson learned.
- „We looked mainly at top–level and middle–level ontologies.“ How did you recognize them?
- Fig. 12 precedes Fig. 11. It is also not clear why it is inserted as screenshot and not just as text.
- You should mention if you also got specific classes as LHS as you explained in 4.6: „In the simplest case, there might be axiom fragments that have a named class on their left–hand side“. You do not show such a case in Table 8.
Section 5.3
- The fourth fragment contains the expression „(?classexpr or ?classexpr ...)“ – should it rather be „(?classexpr1 or ?classexpr2 ...)“, to keep notation coherent with other tables?
- „The pattern presented in Table 8 describes a class that only has a numeric identifier“ I would say that it rather describes the common features of *subclasses* of this class?
- „cardinal restrictions“ – probably „cardinality restrictions“?
- „Another interesting result is discovering several class frame fragments, which contain a part that is not included in the original ODP“ – but how do we know that this part is a plausible addition to the ODP?
- „For example, we were able to automatically mine the ’assay’ design pattern“ – if I understand right, you only reconstruct a part of it. How do we know that it is a crucial part?
Section 6
- … is called Discussion, but a large part of it is rather a continuation or summary of the experiments than a discussion with broader focus.
- „Do such reused ontology fragments exist in a set of ontologies?“ Maybe „…appear across“ would be clearer.
- „We call them emerging content design patterns, in contrast, to structural design patterns from the ODP Portal, which we can not discover automatically, as they are merely ideas, and do not use any concrete vocabulary.“ Do you speak about ODP Portal LODPs or CODPs? CODPs clearly do use specific vocabulary.
- Fig. 13 caption: „The selected corresponding class–frame fragments, …“ You should perhaps attempt to quantify the degree of such correspondence (a kind of ‚ground truth reconstruction‘), in graph terms.
- „We note that BioPortal hosts a relatively well–described set of ontologies.“ You mean, well described in terms of associated papers or structured metadata?
I suggest the tables would not be shoved to the appendix – it would be more practical to have them closer to the associated text.
For tables in general: you should emphasize before referring to the first one that you display labels instead of IRIs where possible. This might be common practice in the biomed ontology world but not all over the semantic web.
The biblio has the usual problem – bibtex-decapitalization of some terms (obi, protege, owl, webprotege, neon). There is also some issue with items 8 and 32: they both contain another reference to [8], which does not make sense.
To summarize along the main review axes:
- Originality: I find the problem addressed (mining sensible patterns from ontologies, well balancing the degree of domain entity generalization) not yet sufficiently tackled, and the approach taken is novel in many respects.
- Significance of the results: The output of the method could be practically useful for ontology designers, tool developers and users. Yet, the paper could be improved by involving expert users in evaluation of the mined fragments. Comparison with the results by Mikroyannidi et al. (on a sample) would also be a nice addition: if I understand right, the difference is in your two-step approach (mining itemsets of previously abstracted patterns) vs. their more direct but selective generalization of concrete entities to patterns. I wonder what this would yield!
- Quality of writing: significant improvement is required.

My major revision requirement is mainly due to the last point. However, I hope some improvement for the second can be expected, too.

[1] Christoph Tempich, Raphael Volz: Towards a benchmark for Semantic Web reasoners - an analysis of the DAML ontology library. EON 2003.

Review #2
By Eva Blomqvist submitted on 23/Mar/2016
Major Revision
Review Comment:

The paper describes a method for discovering patterns in ontologies, based on discovering frequent axiom sets and what is called class frames. The authors then attempt to connect the discovered patterns to the notion of Ontology Design Patterns (ODPs). The paper is very interesting to read, although a bit unclear and confusing in some parts, and the method is definitely original, although others have made (imho considerably weaker) attempts to do similar things before. I am happy to read that these authors have really taken the task seriously and indeed found a way to do this! The problem addressed is an important one, and a solution like this has the potential to greatly impact ODP research in the years to come. The main problem of the paper however is the quality of the presentation, and some issues concerning when the authors cross the line from mere "patterns" (according to their definition) to ODPs. In summary, I really like the idea and the presented solution, it is novel and highly valuable, but the authors need to improve the quality of the presentation and discussion in order for the paper to be published, further details given below.

The only quality issue I see that also partly relates to the work itself, and not only to presentation, is the issue of what is a "pattern" and what is an ODP? The authors move too easily between these notions without going into detail on their relation and in what way they differ. In my opinion, a pattern (according to the authors definition) may be nothing at all in terms of ODPs, i.e., it may be a frequently occurring class frame with some variables, but which is completely meaningless when it comes to being a coherent design solution for a small modelling problem. The authors themselves discuss that some frequent axiom fragments without variables are meaningless in terms of ODPs, they are just artefacts of the design process, and give an example in Figure 12. However, I would expect that the same could be true also for sets with variables, i.e., patterns - what is the guarantee that these will be any more meaningful, or useful, as ODPs? I would not expect the authors to present an automated solution for distinguishing such "meaningless patterns" from ODPs in the paper, but I would expect much more attention to be put on this issue in the paper, e.g. by including:
- A discussion on the difference between their pattern notion and the ODP notion already in the introduction and also when their notion of "pattern" is defined, as well as making this clearer when the contributions are listed in section 1.
- Explanations in the experiment setup sections that clarify how you (probably manually) manage to decide if a "pattern" is an ODP or not, which is used later on in your analysis.
- References back to the two points above in the discussion section, when answering the research questions, and there also be more stringent when saying what was actually detected by the method. Plus more discussion on what this means for the answers to your questions.
- Include this issue in the limitations section.
- Reduce the claim in the second paragraph of the conclusions, first sentence, to something that is actually based on the results of the study, i.e. that you are able to detect (emerging) patterns, that may or may not be ODPs, but that you confirmed manually that some of them were indeed ODPs.

Apart from this issue, I only see minor issues in the paper, but there are quite a number of them, which warrants my suggestion for a major revision:

- The term elements in the abstract is not clear, nor defined later in the paper. This would not be problematic if the sentence was less exact, but here you say that you count the elements, then the reader should know what you are counting.

- I have a bit of a problem with the term "reconstruct" which is used already in the abstract, and then throughout the paper. To me reconstruct indicates that you are rebuilding something that was once there but is not there any more, i.e., that you are constructing something. While I interpret it as what you are actually doing is detecting something that has been there all along. I am not sure I have a good alternative term, but maybe "detect", "discover" or "map out" or something like that.

- The last sentence of the abstract presents two future uses of these results, however, these are not stressed and explained very much in the paper. They appear, a bit hidden, at the end of section 6.1 but I think they deserve a more prominent place, e.g., a separate section discussing the potential uses of the results and future work that is needed for that.

- The paper would greatly benefit from a "running example", in terms of a small ontology that can be used to exemplify everything from the definitions in section 3 to the encoding and methods in section 4. As it is now the first example appears on page 5, and then various different example axioms are used throughout section 4. I would suggest to use one larger ontology (just a few more axioms), set it in a domain with "real" class names rather than having class names like A, B, C, and then use the same ontology as an example throughout the whole paper.

- What do you mean with "fragments of axioms" on page 2? After the comma you say "sets of axioms", do you mean that that is a synonym? I would interpret a fragment of an axiom to be a part of one axiom, rather than a set of axioms.

- Similarly, I feel there is a bit of confusion around several of the other terms used throughout the paper, some of them are not very intuitive, others need to be properly defined or they could potentially be reformulated if you are just using synonyms of things you already define. These are some examples, but please go through the flora of terms used and try to also be a bit mor consisten in their use:
-- What is an "ontology fragment"? The term is somewhat implicitly defined in section 3.2, by defining what a "frequent ontology fragment" is, but on its own it is not defined. Also, there are quite a few mentions of this term before section 3.2, so I would suggest to at least give an intuitive feeling to the reader of what it is already in the introduction. Further, the term is rarely used in a large part of the paper (e.g. not even once on pages 5-8, where you actually describe the methods you use), and then reappears when discussing the experiments, which means the reader has then potentially forgotten it and has to return to the definition again.
-- "fragments in ontologies" on page 2 is not so clear either, what are they fragments of that in turn reside in the ontologies, do you mean axiom fragments in ontologies?
-- You define "axiom fragment" as either an axiom where something is replaced by a variable, or where something was completely removed from the original axiom. In the second case I intuitively associate to the term "fragment" but in the former case I instead associate to the term "pattern" (e.g. as in "triple pattern" in SPARQL) or "template". It is a bit confusing to the reader that these two things are bundled together under the term "axiom fragment".

- When you specify the research questions on page 2, they are not numbered, but later when you revisit them they are, and actually then several of them are combined into one. It would be better to number them also on page 2 and then use the same numbers and grouping of questions later or. One thing to consider is also the formulation of the questions and use of terminology in them. As it is now, they use all the terms defined later in section 3, which means that the questions are not very clear to the reader when they appear in the paper. Can they be expressed more generally in more intuitive terms, or at least add references to definitions later on?

- Question 5: What do you mean by "ODPs proposed for them"? ODPs are not usually proposed for a specific ontology only, so I guess you mean something like "claimed to be used by them"?

- There are two minor (and very preliminary) work [1,2] that are not cited in the related work section, which the authors may consider to cite simply for completeness, but it is certainly not necessary for the paper to include them. These were both master students of mine who explored the potential of various existing methods to detect patterns in existing ontologies, but both achieved mostly negative results, i.e., they tested "obvious methods" that did not work very well.

- Section 3 is the section of the paper that I think needs the most work. A few comments, questions and suggestions:
-- Number all definitions so that they can be referenced from other parts of the paper.
-- It is not entirely clear why some things are defined, but others are assumed to be known, e.g., "tree" is given a definition/explanation, but "root node" and "path" are not defined.
-- Is the definition of "forest" really a standard one? I understand that you work with labeled trees, but it seems strange to me that a forest cannot be formed by any kind of trees, but only labeled trees.
-- I am not sure I understand the difference between an induced and an embedded subtree based on the definition/descriptions given, please make it a bit more clear. Maybe an example?
-- Some letters are used several times for slightly different things, e.g. in section 3.1 N is a set of nodes, while in 3.2 N_c is a set of names (node labels I guess?), and F is a forest in 3.1 but a set of frequent axiom fragments in 3.2.
-- I find the whole section 3.2 a bit confusing with respect to the ontology as entities (which is explained as the vocabulary of the ontology) plus class expressions vs. the view of axioms as graphs from 3.1. N_c is explained as the set of class names in the ontology, i.e., a set of strings or node labels I guess, rather than a set of nodes, which is the feeling I get later in the paper. This could be clarified.
-- What is a well-formed literal?
-- Right below the grammar listing you say that A is in N_c and then that a is in V_I, is that correct or should it be N_I? You also say that lit is a literal, but shouldn't it then be part of N_lit as defined on the previous page?
-- I think that N_CC deserves a bit more explanation. You define everything else very carefully, but then assume that the reader is familiar with what a class constructor is, and what this set then contains.
-- You mention non-logical axioms, I have not heard this term before, is it a standard term to denote some OWL expressions?
-- You mention the term "subset-maximal" but it is not defined or even explained. I assume that it is the same things as "maximal ..." that appears later in section 4.5.
-- In the definition of axiom fragment I am not sure why you need the union of the ontology vocabulary and all variables, what does this give you?
-- The definition of FREQUENT axiom fragment seems quite strange to me - in what way is the axiom fragment frequent just because it is decoded and some variables are added? Doesn't it have anything to do with the frequency of occurence? Similar objections apply to also frequent class frame fragment and frequent ontology fragment.
-- Some things are introduced but never used, such as the letter Q for denoting a pattern.

- Section 4.1 seems to belong to the experiment setup, rather than the generic methods developed - move to section 5?

- Why do you only use BioPortal ontologies for your experiments? What could this mean in terms of skewed results or other consequences? Is this suitable to discuss as a limitation, or do you only intend to talk about limitations of the method as such and not the experiments in that section?

- What does "rendered out" mean?

- You used the OWL API to extract axioms "relevant for this work", which are those exactly?

- Fig 1 does not really tell me much when it uses a logarithmic scale. If it is impossible to fit with a standard scale then consider to remove the figure.

- Towards the end of section 4.2 you again refer forward in the paper, this time to section 4.3. Overall you should consider the order in which you present things in the paper, there are many many references back and forth in the paper. I would suggest to at least try to minimise the need for referring to sections later in the paper, that usually indicate that the order in which things are presented may not be the right one.

- I find section 4.3 quite hard to understand. Is the algorithm too complex/lengthy to present in pseudocode? If not, please include it. Otherwise please try to clarify the steps that are mentioned in the text, and use the running example I suggested earlier.

- The last sentence of the first paragraph on page 7 (section 4.4) does not make sense to me. There you say that "all induced subtrees of a frequent induced subtree must also be frequent" but in the sentence before you talk about embedded subtrees. Potentially this has to do with the fact that I did not really get the difference of these two notions earlier?

- In the first sentence of 4.5 you mention "frequent subtrees", I assume you mean it does not matter if they are induced or embedded in this context?

- Towards the end of the first paragraph of 4.5 you give an example and say "subtrees of a real pattern" - but is this a pattern, since there is no variable until you rewrite it, or did I miss something here and this is how all patterns are constructed?

- 4.5, second paragraph, first sentence: in the definition of "frequent axiom fragment" there is no mention of subClassOf and EquivalentTo, these do not appear until you define the class fram fragment. Is there an error here?

- Figure 9: what is propositionalisation? It is mentioned also in the text but to me it is not obvious what it means. Additionally, why are the transactions denoted by c in the figure while t in the text? Overall, this figure would benefit from some more explanation in the text.

- On page 8, when you define Z, this is the first time in the paper that you mention what it means for something to be "frequent", I would expect this to appear already in the preliminaries, when you discuss frequent axiom fragments etc.

- It is unclear to me if you handle class frame fragments with lhs? on the left hand side or not? In the first paragraph of 4.6 it sounds like you do, but then later in the fourth paragraph it sounds as if there has to be a class A, i.e. known class, on the left hand side. This confusion also carries on a bit later in the paper, and in section 5.1 you again claim to have variables on the left hand side since you measure the fraction that has that.

- There are several things in the experiment section (section 5) that should be better motivated and explained. For instance, why do you choose 1% as the support threshold? What is the rationale, what are the consequences and what could this be in absolute numbers? Why do you decide to divide the set of ontologies by size into a couple of groups? What made you think that these groups would be interesting to look at? Do they represent typical kinds of ontologies as well? What does "popular" mean in your experiment setting? Certain number of uses/users? Why do you want/need to select ontologies based on popularity? What are the consequences of doing that?

- I assume that you mean that the size of an ontology in your experiment is the sum of the number of each of the two axiom types ("and" is a bit ambiguous there)?

- There is not much analysis done on the results in some parts of section 5. For instance, you say that it is interesting that frequent axiom size increases with the size of the ontology, but do you have any idea why this may be the case, since you find it interesting?

- Page 9, sixth paragraph: I am not sure what you mean with "concrete left-hand side construct" here.

- Page 9, paragraph 8: So what does this mean? That they made some strange design decision to duplicate information? So this would be a case of discovering a bad modelling choice actually?

- Page 9, paragraph 9: Why is it interesting to know how many contains a variable on the lhs? Also, I am not sure how interesting the numbers are, considering the +/- 32% and +/-17%.

- Top-level and middle level ontologies are never defined. There may be a bit more agreement in general of what is a top-level ontology, but middle level I am pretty sure there is no agreemnet on in general, so you should define this. Unless there is some definition for that specifically for BioPortal?

- Figure 10 is a bit hard to read. Sizes on the x-axis are hard to read when the paper is printed, and could be made bigger. Scales on the y-axis should be more detailed, especially for diagram c, where it is not so easy to determine what approximate values the boxes refer to.

- Figures 12 and 13 differ quite a lot in notation/presentation, while 12 looks like a screenshot of something, 13 is much more readable.

- Page 11, first paragraph: "can be"? Do you mean that this is an upper bound?

- Page 11, second paragraph: "The top pattern... turned out not to be a real pattern" Is it a pattern or not? If it is not a pattern, i.e. no variable, then I guess it cannot be the top pattern in the first place.

- In the complete section 5 it would be very good if you could refer back to definitions in your preliminaries and experiment setup, once you have numbered those, so that the readers can more easily remind themselves of what the terms mean.

- Table 1: "selected" - how? Why? By what criteria?

- Am I correct in thinking that you only look at class frame fragments recurring within one ontology, not across ontologies? Or am I misunderstanding section 5.3 and Table 8?

- Page 12: It sounds like you are saying that [1] lists all the patterns in CCO, but you found one that is not included, is that the case (I did not look at [1])? Or is it that [1] happens to describe some example pattern, and you found another example? There is a slight difference in what your achievement means in these two cases.

- Page 13, first paragraph: decided to present, where?

- Page 13: Second part of the 5th paragraph seems more related to the 4th paragraph than the first part of the 5th - move up? The last sentence of the 4th paragraph is however not so clear - which class frame fragments do you mean?

- 6.1, first question: is it important to talk about those that were not converted properly here? Why? In that case how many are they? "As with all our results" in the next paragraph seems to be a strange thing to say. Last paragraph: so what about the middle-sized ones?

- 6.1, third question, last paragraph: this whole paragraph needs to be rewritten in response to the discussion on patterns vs. ODPs, i.e. my major point at the beginning of this review. Also, since your patterns may contain mostly variable in some cases, I guess they could be very similar to structural, i.e. in particular logical, ODPs, so I am not sure how you come to the conclusion in the last sentence.

- 6.1, fourth question: I guess you are not detecting the fact that there are specialisations automatically, but rather this is your manual analysis, right? This could be made more clear. Throughout this discussion it is also not so clear if you are talking about just one example, or if you have observed some general trends in your data. Further, do you have any estimate of how many ODPs you were able to find, as opposed to just patterns?

- 6.2: can you give an example of the limitation in the first paragraph? What are the consequences of the second limitation?

- Conclusions: What do you mean by "reused" in the first sentence? It may not be the case that they were reused from somewhere, but rather "invented again". Further, the second paragraph contains too strong claims. Next, the link is duplicated, you already provided it in section 1. Additionally, I am not sure why you claim that your method is data-driven? It still relies only on the ontology and not on data, right? This is also one of the few places when you talk about "ontology modules" - either incorporate that notion throughout the paper or remove it from the conclusions.

- Table 6 and 8 has the caption below the table, while all others are above.

Language issues:
- Avoid using contractions like "hasn't" in formal text.
- Page 4, section 4.1: containing -> contains
- Page 9: one of patterns -> one of the patterns
- 5.3, 3rd paragraph: shown -> show
- Page 12, first para: this -> these


Review #3
By Heiko Paulheim submitted on 21/Apr/2016
Major Revision
Review Comment:

The paper introduces an approach for automatically discovering design patterns in a collection of ontologies. The authors propose a combination of tree mining and frequent pattern mining to discover class frames with variables that recur across ontologies. They evaluate their approach on the BioPortal ontology collection, showing both that there are a number of recurring patterns, as well as that ontology design patterns from the literature can be reconstructed.

Overall, the paper is well-written and very interesting to read, in particular since illustrative examples are used throughout the whole paper. The methods used are properly grounded in the literature. Furthermore, I deeply appreciate the approach of backing classical ontology design research works with an empirical, data-driven analysis, even more so since the empirical results show that there is a measurable adoption of the research work on ODPs.

The authors mention the Blomqvist et al. paper for referring to different types of ODPs. Additionally, the article "Ontology Design Patterns" by Gangemi and Presutti [1] also distinguishes six types of ontology design patterns, i.e., structural, correspondence, content, reasoning, presentation, and lexico-syntatictic ODPs. All of those types (from both papers) have different characteristics, and I suppose that the approach is not capable of discovering each of those, at least not equally well. In a revised version, I would appreciate a more thorough discussion of which of those types of ODPs the approach is capable of discovering and which it is not, and for which it could, at least in theory, be adapted.

In section 5, the authors present break-down results by ontology size, which is very interesting. It might be interesting to analyze other break-downs as well. A bresak-down by ontology language (OWL, OBO, etc.) would be straight forward. I am not an expert in bioportal, but maybe a topical/domain-based break-down (e.g., using tags given to the ontologies) might be interesting. It would also be interesting to see if the size clusters correspond to some other break-down, e.g., the largest ones are always from the same topical domain. Speaking of a language-based break-down, it would be interesting to see if fragments/patterns reccur across languages or mainly in ontologies encoded in the same language.

For the final research question (can ODPs proposed in the literature be rediscovered in actual ontologies), I would have appreciated a more quantitative answer, e.g.: from catalogue X / paper Y, n out of N patterns have been rediscovered in the BioPortal collection.

The fragments presented in the tables are interesting to see, but not all of them seem to be genuine ODPs. E.g., in table 2, the fragments shown for UBERON and NIFSUBCELL are not what I would consider to be a design pattern. Here, a more thorough analysis would be appreciated, both of the results and of ideas of future work for telling a reused fragment from an actual design pattern. Furthermore, I would expect a more in-depth consideration of other possible reasons for which fragments may appear frequently (copying from ontologies has been mentioned in the paper, but is this the only one?).

Another suggestion for future work: the book by Allemang and Hendler [2] contains a list of anti-patterns. It would be interesting to see whether those can be discovered in the ontology collection as well.

Minor issues:
* in 3.1, there is a labeling function for nodes, but not for edges, so it looks like that there is only one type of edges. This is contradictory to the general notion of RDF/OWL, where edges also come with labels. If this is a mistake, it should be corrected, if it is incindental, it should be explained.
* in Fig. 3a+b, the translation of T1 to a string encoding is only unabmiguous if the subtrees are ordered, but for trees derived from OWL ontologies, they usually are not. Maybe I missed it in some subordinate sentence, but imho, to make the approach, a specific ordering of the subtrees is required. This is not a show stopper (a simple lexicographic ordering would do), but it should be discussed in the text.

Summarizing: I really like the idea of the paper, the direction taken, and the presentation of the approach and results. However, some improvements are required before publication.

[1] Gangemi and Presutti: "Ontology Design Patterns". In: Handbook on ontologies, 2009
[2] Allemang and Hendler: "Semantic Web for the Working Ontologist", 2008