Review Comment:
The paper describes a method for discovering patterns in ontologies, based on discovering frequent axiom sets and what is called class frames. The authors then attempt to connect the discovered patterns to the notion of Ontology Design Patterns (ODPs). The paper is very interesting to read, although a bit unclear and confusing in some parts, and the method is definitely original, although others have made (imho considerably weaker) attempts to do similar things before. I am happy to read that these authors have really taken the task seriously and indeed found a way to do this! The problem addressed is an important one, and a solution like this has the potential to greatly impact ODP research in the years to come. The main problem of the paper however is the quality of the presentation, and some issues concerning when the authors cross the line from mere "patterns" (according to their definition) to ODPs. In summary, I really like the idea and the presented solution, it is novel and highly valuable, but the authors need to improve the quality of the presentation and discussion in order for the paper to be published, further details given below.
The only quality issue I see that also partly relates to the work itself, and not only to presentation, is the issue of what is a "pattern" and what is an ODP? The authors move too easily between these notions without going into detail on their relation and in what way they differ. In my opinion, a pattern (according to the authors definition) may be nothing at all in terms of ODPs, i.e., it may be a frequently occurring class frame with some variables, but which is completely meaningless when it comes to being a coherent design solution for a small modelling problem. The authors themselves discuss that some frequent axiom fragments without variables are meaningless in terms of ODPs, they are just artefacts of the design process, and give an example in Figure 12. However, I would expect that the same could be true also for sets with variables, i.e., patterns - what is the guarantee that these will be any more meaningful, or useful, as ODPs? I would not expect the authors to present an automated solution for distinguishing such "meaningless patterns" from ODPs in the paper, but I would expect much more attention to be put on this issue in the paper, e.g. by including:
- A discussion on the difference between their pattern notion and the ODP notion already in the introduction and also when their notion of "pattern" is defined, as well as making this clearer when the contributions are listed in section 1.
- Explanations in the experiment setup sections that clarify how you (probably manually) manage to decide if a "pattern" is an ODP or not, which is used later on in your analysis.
- References back to the two points above in the discussion section, when answering the research questions, and there also be more stringent when saying what was actually detected by the method. Plus more discussion on what this means for the answers to your questions.
- Include this issue in the limitations section.
- Reduce the claim in the second paragraph of the conclusions, first sentence, to something that is actually based on the results of the study, i.e. that you are able to detect (emerging) patterns, that may or may not be ODPs, but that you confirmed manually that some of them were indeed ODPs.
Apart from this issue, I only see minor issues in the paper, but there are quite a number of them, which warrants my suggestion for a major revision:
- The term elements in the abstract is not clear, nor defined later in the paper. This would not be problematic if the sentence was less exact, but here you say that you count the elements, then the reader should know what you are counting.
- I have a bit of a problem with the term "reconstruct" which is used already in the abstract, and then throughout the paper. To me reconstruct indicates that you are rebuilding something that was once there but is not there any more, i.e., that you are constructing something. While I interpret it as what you are actually doing is detecting something that has been there all along. I am not sure I have a good alternative term, but maybe "detect", "discover" or "map out" or something like that.
- The last sentence of the abstract presents two future uses of these results, however, these are not stressed and explained very much in the paper. They appear, a bit hidden, at the end of section 6.1 but I think they deserve a more prominent place, e.g., a separate section discussing the potential uses of the results and future work that is needed for that.
- The paper would greatly benefit from a "running example", in terms of a small ontology that can be used to exemplify everything from the definitions in section 3 to the encoding and methods in section 4. As it is now the first example appears on page 5, and then various different example axioms are used throughout section 4. I would suggest to use one larger ontology (just a few more axioms), set it in a domain with "real" class names rather than having class names like A, B, C, and then use the same ontology as an example throughout the whole paper.
- What do you mean with "fragments of axioms" on page 2? After the comma you say "sets of axioms", do you mean that that is a synonym? I would interpret a fragment of an axiom to be a part of one axiom, rather than a set of axioms.
- Similarly, I feel there is a bit of confusion around several of the other terms used throughout the paper, some of them are not very intuitive, others need to be properly defined or they could potentially be reformulated if you are just using synonyms of things you already define. These are some examples, but please go through the flora of terms used and try to also be a bit mor consisten in their use:
-- What is an "ontology fragment"? The term is somewhat implicitly defined in section 3.2, by defining what a "frequent ontology fragment" is, but on its own it is not defined. Also, there are quite a few mentions of this term before section 3.2, so I would suggest to at least give an intuitive feeling to the reader of what it is already in the introduction. Further, the term is rarely used in a large part of the paper (e.g. not even once on pages 5-8, where you actually describe the methods you use), and then reappears when discussing the experiments, which means the reader has then potentially forgotten it and has to return to the definition again.
-- "fragments in ontologies" on page 2 is not so clear either, what are they fragments of that in turn reside in the ontologies, do you mean axiom fragments in ontologies?
-- You define "axiom fragment" as either an axiom where something is replaced by a variable, or where something was completely removed from the original axiom. In the second case I intuitively associate to the term "fragment" but in the former case I instead associate to the term "pattern" (e.g. as in "triple pattern" in SPARQL) or "template". It is a bit confusing to the reader that these two things are bundled together under the term "axiom fragment".
- When you specify the research questions on page 2, they are not numbered, but later when you revisit them they are, and actually then several of them are combined into one. It would be better to number them also on page 2 and then use the same numbers and grouping of questions later or. One thing to consider is also the formulation of the questions and use of terminology in them. As it is now, they use all the terms defined later in section 3, which means that the questions are not very clear to the reader when they appear in the paper. Can they be expressed more generally in more intuitive terms, or at least add references to definitions later on?
- Question 5: What do you mean by "ODPs proposed for them"? ODPs are not usually proposed for a specific ontology only, so I guess you mean something like "claimed to be used by them"?
- There are two minor (and very preliminary) work [1,2] that are not cited in the related work section, which the authors may consider to cite simply for completeness, but it is certainly not necessary for the paper to include them. These were both master students of mine who explored the potential of various existing methods to detect patterns in existing ontologies, but both achieved mostly negative results, i.e., they tested "obvious methods" that did not work very well.
- Section 3 is the section of the paper that I think needs the most work. A few comments, questions and suggestions:
-- Number all definitions so that they can be referenced from other parts of the paper.
-- It is not entirely clear why some things are defined, but others are assumed to be known, e.g., "tree" is given a definition/explanation, but "root node" and "path" are not defined.
-- Is the definition of "forest" really a standard one? I understand that you work with labeled trees, but it seems strange to me that a forest cannot be formed by any kind of trees, but only labeled trees.
-- I am not sure I understand the difference between an induced and an embedded subtree based on the definition/descriptions given, please make it a bit more clear. Maybe an example?
-- Some letters are used several times for slightly different things, e.g. in section 3.1 N is a set of nodes, while in 3.2 N_c is a set of names (node labels I guess?), and F is a forest in 3.1 but a set of frequent axiom fragments in 3.2.
-- I find the whole section 3.2 a bit confusing with respect to the ontology as entities (which is explained as the vocabulary of the ontology) plus class expressions vs. the view of axioms as graphs from 3.1. N_c is explained as the set of class names in the ontology, i.e., a set of strings or node labels I guess, rather than a set of nodes, which is the feeling I get later in the paper. This could be clarified.
-- What is a well-formed literal?
-- Right below the grammar listing you say that A is in N_c and then that a is in V_I, is that correct or should it be N_I? You also say that lit is a literal, but shouldn't it then be part of N_lit as defined on the previous page?
-- I think that N_CC deserves a bit more explanation. You define everything else very carefully, but then assume that the reader is familiar with what a class constructor is, and what this set then contains.
-- You mention non-logical axioms, I have not heard this term before, is it a standard term to denote some OWL expressions?
-- You mention the term "subset-maximal" but it is not defined or even explained. I assume that it is the same things as "maximal ..." that appears later in section 4.5.
-- In the definition of axiom fragment I am not sure why you need the union of the ontology vocabulary and all variables, what does this give you?
-- The definition of FREQUENT axiom fragment seems quite strange to me - in what way is the axiom fragment frequent just because it is decoded and some variables are added? Doesn't it have anything to do with the frequency of occurence? Similar objections apply to also frequent class frame fragment and frequent ontology fragment.
-- Some things are introduced but never used, such as the letter Q for denoting a pattern.
- Section 4.1 seems to belong to the experiment setup, rather than the generic methods developed - move to section 5?
- Why do you only use BioPortal ontologies for your experiments? What could this mean in terms of skewed results or other consequences? Is this suitable to discuss as a limitation, or do you only intend to talk about limitations of the method as such and not the experiments in that section?
- What does "rendered out" mean?
- You used the OWL API to extract axioms "relevant for this work", which are those exactly?
- Fig 1 does not really tell me much when it uses a logarithmic scale. If it is impossible to fit with a standard scale then consider to remove the figure.
- Towards the end of section 4.2 you again refer forward in the paper, this time to section 4.3. Overall you should consider the order in which you present things in the paper, there are many many references back and forth in the paper. I would suggest to at least try to minimise the need for referring to sections later in the paper, that usually indicate that the order in which things are presented may not be the right one.
- I find section 4.3 quite hard to understand. Is the algorithm too complex/lengthy to present in pseudocode? If not, please include it. Otherwise please try to clarify the steps that are mentioned in the text, and use the running example I suggested earlier.
- The last sentence of the first paragraph on page 7 (section 4.4) does not make sense to me. There you say that "all induced subtrees of a frequent induced subtree must also be frequent" but in the sentence before you talk about embedded subtrees. Potentially this has to do with the fact that I did not really get the difference of these two notions earlier?
- In the first sentence of 4.5 you mention "frequent subtrees", I assume you mean it does not matter if they are induced or embedded in this context?
- Towards the end of the first paragraph of 4.5 you give an example and say "subtrees of a real pattern" - but is this a pattern, since there is no variable until you rewrite it, or did I miss something here and this is how all patterns are constructed?
- 4.5, second paragraph, first sentence: in the definition of "frequent axiom fragment" there is no mention of subClassOf and EquivalentTo, these do not appear until you define the class fram fragment. Is there an error here?
- Figure 9: what is propositionalisation? It is mentioned also in the text but to me it is not obvious what it means. Additionally, why are the transactions denoted by c in the figure while t in the text? Overall, this figure would benefit from some more explanation in the text.
- On page 8, when you define Z, this is the first time in the paper that you mention what it means for something to be "frequent", I would expect this to appear already in the preliminaries, when you discuss frequent axiom fragments etc.
- It is unclear to me if you handle class frame fragments with lhs? on the left hand side or not? In the first paragraph of 4.6 it sounds like you do, but then later in the fourth paragraph it sounds as if there has to be a class A, i.e. known class, on the left hand side. This confusion also carries on a bit later in the paper, and in section 5.1 you again claim to have variables on the left hand side since you measure the fraction that has that.
- There are several things in the experiment section (section 5) that should be better motivated and explained. For instance, why do you choose 1% as the support threshold? What is the rationale, what are the consequences and what could this be in absolute numbers? Why do you decide to divide the set of ontologies by size into a couple of groups? What made you think that these groups would be interesting to look at? Do they represent typical kinds of ontologies as well? What does "popular" mean in your experiment setting? Certain number of uses/users? Why do you want/need to select ontologies based on popularity? What are the consequences of doing that?
- I assume that you mean that the size of an ontology in your experiment is the sum of the number of each of the two axiom types ("and" is a bit ambiguous there)?
- There is not much analysis done on the results in some parts of section 5. For instance, you say that it is interesting that frequent axiom size increases with the size of the ontology, but do you have any idea why this may be the case, since you find it interesting?
- Page 9, sixth paragraph: I am not sure what you mean with "concrete left-hand side construct" here.
- Page 9, paragraph 8: So what does this mean? That they made some strange design decision to duplicate information? So this would be a case of discovering a bad modelling choice actually?
- Page 9, paragraph 9: Why is it interesting to know how many contains a variable on the lhs? Also, I am not sure how interesting the numbers are, considering the +/- 32% and +/-17%.
- Top-level and middle level ontologies are never defined. There may be a bit more agreement in general of what is a top-level ontology, but middle level I am pretty sure there is no agreemnet on in general, so you should define this. Unless there is some definition for that specifically for BioPortal?
- Figure 10 is a bit hard to read. Sizes on the x-axis are hard to read when the paper is printed, and could be made bigger. Scales on the y-axis should be more detailed, especially for diagram c, where it is not so easy to determine what approximate values the boxes refer to.
- Figures 12 and 13 differ quite a lot in notation/presentation, while 12 looks like a screenshot of something, 13 is much more readable.
- Page 11, first paragraph: "can be"? Do you mean that this is an upper bound?
- Page 11, second paragraph: "The top pattern... turned out not to be a real pattern" Is it a pattern or not? If it is not a pattern, i.e. no variable, then I guess it cannot be the top pattern in the first place.
- In the complete section 5 it would be very good if you could refer back to definitions in your preliminaries and experiment setup, once you have numbered those, so that the readers can more easily remind themselves of what the terms mean.
- Table 1: "selected" - how? Why? By what criteria?
- Am I correct in thinking that you only look at class frame fragments recurring within one ontology, not across ontologies? Or am I misunderstanding section 5.3 and Table 8?
- Page 12: It sounds like you are saying that [1] lists all the patterns in CCO, but you found one that is not included, is that the case (I did not look at [1])? Or is it that [1] happens to describe some example pattern, and you found another example? There is a slight difference in what your achievement means in these two cases.
- Page 13, first paragraph: decided to present, where?
- Page 13: Second part of the 5th paragraph seems more related to the 4th paragraph than the first part of the 5th - move up? The last sentence of the 4th paragraph is however not so clear - which class frame fragments do you mean?
- 6.1, first question: is it important to talk about those that were not converted properly here? Why? In that case how many are they? "As with all our results" in the next paragraph seems to be a strange thing to say. Last paragraph: so what about the middle-sized ones?
- 6.1, third question, last paragraph: this whole paragraph needs to be rewritten in response to the discussion on patterns vs. ODPs, i.e. my major point at the beginning of this review. Also, since your patterns may contain mostly variable in some cases, I guess they could be very similar to structural, i.e. in particular logical, ODPs, so I am not sure how you come to the conclusion in the last sentence.
- 6.1, fourth question: I guess you are not detecting the fact that there are specialisations automatically, but rather this is your manual analysis, right? This could be made more clear. Throughout this discussion it is also not so clear if you are talking about just one example, or if you have observed some general trends in your data. Further, do you have any estimate of how many ODPs you were able to find, as opposed to just patterns?
- 6.2: can you give an example of the limitation in the first paragraph? What are the consequences of the second limitation?
- Conclusions: What do you mean by "reused" in the first sentence? It may not be the case that they were reused from somewhere, but rather "invented again". Further, the second paragraph contains too strong claims. Next, the link is duplicated, you already provided it in section 1. Additionally, I am not sure why you claim that your method is data-driven? It still relies only on the ontology and not on data, right? This is also one of the few places when you talk about "ontology modules" - either incorporate that notion throughout the paper or remove it from the conclusions.
- Table 6 and 8 has the caption below the table, while all others are above.
Language issues:
- Avoid using contractions like "hasn't" in formal text.
- Page 4, section 4.1: containing -> contains
- Page 9: one of patterns -> one of the patterns
- 5.3, 3rd paragraph: shown -> show
- Page 12, first para: this -> these
[1] http://dx.doi.org/10.1109/CIMCA.2005.1631261
[2] https://www.thinkmind.org/download.php?articleid=semapro_2010_1_40_50071
|