Approaches, methods, metrics, measures, and subjectivity in ontology evaluation: A survey

Tracking #: 657-1867

Authors: 
Hlomani Hlomani
Deborah A. Stacey

Responsible editor: 
Michel Dumontier

Submission type: 
Survey Article
Abstract: 
Ontology evaluation is concerned with ascertain two important aspects of ontologies: quality and correctness. The distinction between the two is attempted in this survey as a way to better approach ontology evaluation. The role that ontologies play on the semantic web at large has been has been argued to have catalyzed the proliferation of ontologies in existence. This has also presented the challenge of deciding one the suitability of an given ontology to one's purposes as compared with another ontology in a similar domain. This survey intends to analyze the state of the art in ontology evaluate spanning such topics as the approaches to ontology evaluation, the metrics and measures used. Particular interest is given to Data-driven ontology evaluation with special emphasis on the notion of bias and it's relevance to evaluation results. Chief among the outputs of this survey is the gap analysis on the topic of ontology evaluation.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Amrapali Zaveri submitted on 10/Aug/2014
Suggestion:
Major Revision
Review Comment:

This article “Approaches, methods, metrics, measures, and subjectivity in ontology evaluation: A survey“ provides a survey of the literature on the topic of ontology evaluation. The paper attempts to survey approaches, metrics and subjectivity in ontology evaluation. Unfortunately, in its current state, this paper does not read like a survey at all but just a collection of few papers randomly put together, thus lacking in breadth and depth. I provide my critique for on each of the four criteria, based on which this paper was reviewed.

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.
This paper is not at all suitable as introductory text to get started on the covered topic. There is no clear overview of the concepts regarding ontology evaluation and there are several concepts introduced but none are studied in detail such as ontology evaluation, ontology reuse, ontology integration. First of all, I think a brief description of the basic concepts of ontologies should be provided such as what is an ontology, what is ontology evaluation etc. This would help any one new to the field at least gain enough background knowledge to go through the rest of the paper. Then a consensus of the concepts should be provided since currently there is a lot of overlap, which is causing confusion. For example, sometimes the authors call it “metrics”, other times “measures” or “criteria” or “term”. Is there any difference between them? Another example is use of the words “method” and “approaches” - what is the difference? This also points to the title of this paper. I would keep the terminology consistent.

More importantly, I do not regard this paper as a “survey” at all because a survey is usually conducted for several reasons [1] such as: (i) the summarization and comparison, in terms of advantages and disadvantages, of various approaches in a field; (ii) the identification of open problems; (iii) the contribution of a joint conceptualization comprising the various approaches developed in a field; or (iv) the synthesis of a new idea to cover the emphasized problems. This paper does not summarize nor compare the various approaches - they are just randomly mentioned throughout the paper or very superficially referenced and clumped together. Although the authors attempt to identify “limitations” (or “open problems” per-say) in the current approaches, these are also not clearly explained and are also all over the place, which makes it difficult to identify them clearly. The Table 3 does attempt to provide an overview of the approaches, but only one paper is cited, which itself is a survey of ontology evaluation techniques done in 2005. What about all the other papers/approaches that are newly proposed in the literature? There is no real attempt to unify or summarize these findings. Only in the “limitation” section there are some more references, which should actually be in the main part of the paper. Even there its only mentioned how many metrics are there rather than discussing in detail which metric is applicable in which situation and how it is measured. Thus, I do not see much scientific contribution in this paper. Right now, it looks like a related work section of several papers put loosely together.

I provide a detailed list of my critique for specific parts:
Title and Conclusion
- The title is “Approaches, methods, metrics, measures, and subjectivity in ontology evaluation”. As I already mentioned, what is the difference between “approaches” and “methods” and between “metrics” and “measures”?
- Then in the conclusion, you only mention “approach, methods and metrics”. It is again confusing here that you don’t mention subjectivity. This is only loosely mentioned in the last paragraph. Is this a contribution or not?
- Moreover, there is a lot of repetition in the conclusion section - you mention twice about ontology quality and correctness and about the focus of this paper.
- Also, why provide an explanation of what is “correctness” only in the conclusion?
- Please just be consistent with what are the contributions of this paper.
Abstract
- What do you mean by gap analysis? Did you do this? Right now it is quite unclear.
Section 1
- In general, it is rather weakly motivated as to why ontology evaluation is important. I suggest to provide a stronger argument.
Section 2
- “An interesting definition of ontology evaluation has be given by Gómez-Pérez et al. [17] and later echoed by Vrandecic et al. [40].” -> It would be helpful to the reader to actually provide the definition here rather than her having to look up the references for it.
- I would strongly suggest to even provide the definitions/explanations for each of the concepts here such as what is an ontology, what is ontology evaluation, what is ontology reuse etc. so as to make this paper stand by itself as an introductory text to any one new in this field. Provide examples and/or references too for each.
- After ontology verification in Section 2.1, you suddenly talk about ontology integration and merging? What is the connection? Then you introduce ontology reuse. I find it hard to follow the structure of this section.
- Is the difference between ontology integration and merging introduced by the authors? If not, please just add a reference from where you got them. Also, it is not entirely clear to me what the difference is. Maybe adding an example (perhaps visually) would help. For me, ontology integration sounds more like ontology expansion.
Section 3
- Again, either clarify the difference between metrics and measures or use a single term throughout the paper.
- All the papers are mixed up here - clearly adding bullets and also adding an overview table would help.
- Also, the section 3.1 heading mentions “State of the art”, isn’t this whole paper supposed to be a “state of the art”? Also, details of only a couple of papers are provided. Are all covered? This point is covered in my critique for (2)
- Even the three types of metrics of [14] should be further clarified or explained with examples.
- In Section 3.2, why does your work “perceive” ontology evaluation as these perspectives. Isn’t this supposed to be a survey of existing literature? Aren’t there any more already defined in the literature? Please provide references.
- For ontology quality, is there only one paper that discusses it?
- Provide examples for internal and external attributes.
- Figure 1 seems rather vague for the depiction of “ontology evaluation” but probably only tries to show what an ontology is. I would remove this or significantly improve it so that it clearly depicts what are the detailed steps for ontology evaluation.
- This section leads me to wonder what is exactly the difference between “quality” and “correctness”? I would think “quality” is the general overall concept and “correctness” is one of the data quality dimensions
- Also, it is mentioned “These scenarios necessitate the need for separation of these concerns and advocates for separate determination of each.” I would expect the authors to do this sort of analysis in this paper!
- In Table 1, I do not see clearly the four layers of the suite. I only see two evaluation perspectives and seven metrics. Even in that, there is an overlap in the measures such as “coverage”. Also, is there no measure for “Conciseness”?
- First you mention “This metric suite is reminiscent of Button-Jones et al. [6]’ metric suite for ontology auditing.” and later you say “This metric suite has been largely based on Vrandecic [41]’s evaluation criteria and Button-Jones [6]’s metric suite.” Which one is it? Then you mention the eight criteria proposed by [41]. Then you say it is not an exhaustive list so I am not sure what exactly is your contribution here. Why don’t you provide a list of all those that you have actually found in those papers? Then if you add more of your own, clearly mark which ones are from previous literature and which ones are newly introduced.
- I would put Table 2 before Table 1 - first provide the definitions of the metrics and then how they can be measured. Also, would be helpful to map the terms/metrics in Table 2 to the “evaluation perspective” (from Table 1).
- Also, they do not seem to be very clear definitions in Table 2, for example, for Accuracy, is the second sentence part of the definition? Same for cohesion.
- The terms/metrics provided in Table 2 also seem to overlap - such as “coupling” and “cohesion”, “coverage” and “completeness”. Probably merge them or show how they are inter-related.
- Also, only a few of the definitions have references, are the others introduced by the authors?
- Ideally, each of the terms/metrics in Table 2 should be also provided with a measure. For example, how is Completeness measured in a OWA and CWA perspective?
Section 4
- The caption of Table 3 is “An overview of approaches to ontology evaluation as related to the aspects (layers) of an ontology.” but in the text it is mentioned that the table is only a comparison between two classifications and that is also taken from one paper. In fact the table is a replica of the table in [2] so I do not see the reason to reproduce it here.
- Provide more details about the classifications - add examples, add more references - name/highlight the types.
- In Sections 4.1 to 4.3, why are only the different types of the third classification discussed? What about the other classifications?
- I appreciate you try to find limitations in each of those classifications discussed in the last paragraphs of each section but then there are limitations in Section 5 too. I would just provide a separate section at the end.
- Doesn’t user-based evaluation also involve crowdsourcing based tasks/evaluation? I provide some references later on, which are missing and should also be considered as part of this survey.
Section 5
- Isn’t subjectivity an important core contribution of your paper? Why does it appear only in the “Limitations” section. Is “subjectivity” the “only” limitation in ontology evaluation? I would make “Subjectivity in ontology evaluation” a section by itself and then add all the limitations in another separate section (including the ones from Section 5) and call it “Open Problems”
- I did not understand 5.2 Subjectivity in thresholds entirely. It read rather vague, probably providing an example might help.
- What about those tools which automatically assess the quality of ontologies? Where does “subjectivity” come in the picture then?

[1] Barbara Kitchenham. Procedures for Performing Systematic Reviews. Department of Computer Science, Keele University, (2004)

(2) How comprehensive and how balanced is the presentation and coverage.
Even though the authors cite several papers that propose a methodology for ontology evaluation, it is unclear how many are actually surveyed. After reading this paper, I was left confused as to which and how many ontology evaluation methods are actually surveyed and how many and which methodologies are exactly available for ontology evaluation. Thus, it is difficult at this point for me to judge whether the paper is comprehensive enough unless I look up these papers and then compare the references individually myself. Sometimes it seems that this survey is based on only two papers - Button-Jones et al. and Denny’s thesis, which are repeatedly mentioned in most of the paper, whereas the others are just mentioned in the limitations section. In this regard, this paper is rather imbalanced. Also, on that, having written a survey paper myself, the citation of a thesis was strongly criticized so I would recommend to refer to the papers (that the thesis refers to) rather than referring the thesis only because a thesis is not peer-reviewed work. In any case, I strongly recommend the authors to dive deep into each of the papers they cite and provide a summarization and comparison of the various approaches. In the current state, the paper reads like random musings of some papers that the authors seemed to have chanced upon as opposed to having undertaken an actual survey.

I am aware of Denny’s thesis, which already provides a number of frameworks for ontology evaluation published in 2010. The authors also cite another survey done on the same topic in 2005. I also found this book chapter [2] published in 2010 which compares different ontology evaluation techniques and also provides a tool.This raises several questions: Did you include all of the papers that they cited? How many more papers are newly published since the publication of this thesis? Was it a significant number of papers? Did you include all of those too? What methodology did you follow to search for the papers? What inclusion/exclusion criteria did you use to include the papers that you mention? Please provide a table with the list of all papers that are related to ontology evaluation included in this survey. Providing such a table would be much easier for a reader or a researcher, PhD student or practitioner to get started on the covered topic.

I am also surprised that the authors do not mention any ontology evaluation tools, which should be an important part of this survey considering there are several tools out there to perform this assessment. Please provide an overview of the tools and also a comparison of them. I already identified tools for ontology evaluation and provide references later in my review. But, the authors should make an effort to look for all the ones currently available if they plan to include it in their survey.

[2] Samir Tartir, I. Budak Arpinar, Amit P. Sheth. Ontological Evaluation and Validation. Theory and Applications of Ontology: Computer Applications 2010, pp 115-130

(3) Readability and clarity of the presentation.
The paper is poorly structured and quite difficult to follow. The sentences are unnecessarily complex and repetitive. Also, there are several facts which are unaccompanied by references, which leaves the reader wondering whether they are really true or speculations by the authors. I have provided an incomplete list of some of the formal errors that I encountered at the end of my review. However, I strongly recommend the authors to proof read the paper and have a native speaker or third person also read the paper to ensure clarity in the presentation.

It is extremely important for a survey paper to provide a general overview of the existing approaches (be it in the form of a figure/tables) so that the reader can easily refer to the paper to choose any one for their purpose. As of now, it is all very chaotically put together into one paper. I think the authors should really ensure that the text is clearer and tighter in the next iteration.

(4) Importance of the covered material to the broader Semantic Web community.
The topic of ontology quality/ontology evaluation is indeed an important topic to the Semantic Web community. However, this paper, in its current state, is far from providing an overview of all the existing approaches proposed so far in the literature. I think the authors should make an effort to investigate further into the existing approaches. I think in the paper they do have several approaches already referenced but they are all over the places so I would suggest that they look into each one of them in detail and show the reader clearly which paper to refer to in order to evaluate which specific metric(s).

Here are my recommendations to the authors:
- Unify the concepts regarding ontology evaluation - definitions, formulae, etc.
- On a bigger picture, provide an explanation on the importance of ontologies, ontology quality, ontology evaluation etc.
- Clearly identify which research questions you aim to answer with this survey (set the boundary) - ontology evaluation, ontology integration/matching, ontology reuse. For each of these, then there are several papers that tackle each area.
- Use consistent terminology
- Provide a quantitative and qualitative overview
- Provide an overview table of which approach provides which category, which metrics, which tools it provides
- Extract the (common) steps and metrics involved in each of the approaches and/or compare the different approaches qualitatively
- Look at each criteria/metric in detail and discuss how it can be measured and then provide references to which papers measures which metric
- Provide an overview of the tools - even perhaps compare the tools based on certain criteria (e.g. https://en.wikipedia.org/wiki/Non-functional_requirement) or actually apply them to an example ontology and evaluate it
- I think even some important and interesting aspects of ontologies need to be explored such as “evolution of ontologies and their evaluation”, “domain specific ontology evaluation”, “crowdsourcing ontology verification” (I provide few references related to these)
- It is great that you provide limitations but I would rather see them as challenges and just only provide them as bullet points or clear paragraphs without any references so that it paves way for new research in this topic. Right now it is all clumped together with the existing literature so it is difficult to clearly identify these challenges. This could be a significant contribution.
I think even the addition of a few of these recommendations can significantly help improve the quality of this paper and its contributions.

Here is a short list of papers that should be looked into either to gather additional references or to consider adding the papers themselves:
- Aruna, T.; Saranya, K.; Bhandari, C., "A Survey on Ontology Evaluation Tools," Process Automation, Control and Computing (PACC), 2011 International Conference on , vol., no., pp.1,5, 20-22 July 2011
- Sara García-Ramos, Abraham Otero, Mariano Fernández-López. OntologyTest: A Tool to Evaluate Ontologies through Tests Defined by the User. Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living. Lecture Notes in Computer Science Volume 5518, 2009, pp 91-98
- Peroni, Silvio, David Shotton and Fabio Vitali. "Tools for the Automatic Generation of Ontology Documentation: A Task-Based Evaluation." IJSWIS 9.1 (2013): 21-44. Web. 1 Aug. 2014. doi:10.4018/jswis.2013010102
- Fernandez, Miriam; Cantador, Iván and Castells, Pablo (2006). CORE: a tool for collaborative ontology reuse and evaluation. In: 4th International Workshop on Evaluation of Ontologies for the Web (EON 2006) , 23 - 26 May 2006, Edinburgh, UK.
- Jonathan M. Mortensen, Paul R. Alexander, Mark A. Musen , and Natalya F. Noy. Crowdsourcing Ontology Verification. The Semantic Web – ISWC 2013. Lecture Notes in Computer Science Volume 8219, 2013, pp 448-455
- Cristina Sarasua, Elena Simperl, and Natalya F. Noy. CROWDMAP: Crowdsourcing Ontology Alignment with Microtasks. ISWC 2012
- Xi Deng, Volker Haarslev, and Nematollaah Shiri. Measuring Inconsistencies in Ontologies. ESWC 2007
- Samir Tartir, I. Budak Arpinar, Michael Moore, Amit P. Sheth, Boanerges Aleman-Meza. OntoQA: Metric-Based Ontology Quality Analysis. IEEE Workshop on Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources 2005
- Liping Zhou. Dealing with Inconsistencies inDL-Lite Ontologies. The Semantic Web: Research and Applications. Lecture Notes in Computer Science Volume 5554, 2009, pp 954-958
- Peter Plessers, Olga De Troyer. Resolving Inconsistencies in Evolving Ontologies. ESWC 2006
- Survey on vocabulary and ontology tools. A Semicolon project deliverable. Version 1.0. http://www.semicolon.no/wp-content/uploads/2013/09/Semicolon_Vocabulary-...
- D1.2.3 Methods for ontology evaluation. Knowledge Web. http://www.starlab.vub.ac.be/research/projects/knowledgeweb/KWeb-Del-1.2...

As a side note, I would strongly recommend the authors to look at other accepted surveys in the same journal or even elsewhere to get an idea of what a survey should essentially contain.

Here is an incomplete list of formal errors that I encountered:
Abstract
- ascertain -> ascertaining
- web at large has been has been -> web at large has been
- deciding one the suitability -> rephrase
1 Introduction
- is a emerging field -> is an emerging field
- “It first gives a context to ontology evaluation by defining the notion of ontology evaluation (Section 2.1) and discussing ontology evaluation in the context of ontology reuse as an example scenario for the role of ontology evaluation (Section 2.2).” -> too much use of “ontology evaluation” which makes this sentence difficult to interpret
- Why is Section 4 mentioned before Section 3 in the last paragraph?
- You forgot to mention Section 5
2.1 A definition
- I would rename this section
- determining which in a collection of ontologies would -> determining which, in a collection of ontologies, would
3.1 Ontology evaluation metrics: State of the art
- ( are also -> (are also
3.2 Ontology evaluation measures: Perspective, criteria, metrics
- Ontology quality perspective -> Ontology quality perspective.
- Ontology correctness perspective -> Ontology correctness perspective.
- ( the model) -> (the model)
- and get a high score -> and gets a high score
Table 1
- precision -> Precision
Table 2
- determining is the asserted -> determining if the asserted
- if its its classes -> if its classes
3.3
Verendicic -> Vrandecic
4.1 Gold standard-based evaluation
- ( the target -> (the target
4.2 Application or task-based evaluation
- set of concept, -> set of concepts,
4.3 User-based evaluation
- two type of -> two types of
- will be give a weighted value depend on -> will be given a weighted value depending on
4.4 Data-driven evaluation
- how appropriate -> how appropriately
5.1 Subjectivity in the criteria for evaluation
- Keep the heading consistent with that mentioned in the previous paragraph - (i) subjectivity in the selection of the criteria for evaluation
- Ontology evaluation can be regarded over several different decision criteria. -> rephrase
5.1.1 Inductive approach to criteria selection
- inductive in that a -> inductive, in that, a
- (i) Number of Root Classes (NoR) - Number of root classes explicitly defined in an ontology and (ii) Number of Leaf Classes (NoL) - Number of Leaf classes explicitly defined in an ontology - remove the repetition
- Burton-Jones et al. [6]’s -> I find this way of citation strange. Either say “In [6], the authors...” or “Burton-Jones et al. [6] propose…”. Also keep it consistent, sometimes it is “The work of [17]...” or “Button-Jones et al. [6]’ metric” - use one form of citation. This applies to all such references.
- Also there are several sentences which are not backed by any reference and/or need more details such as:
-- While this is attractive, it presents a challenge in deciding which ontology to reuse (as they are reusable knowledge artefacts) and hence the topic of ontology evaluation. -> the sentence structure is very weak and needs references.
-- For example, given an algorithms ontology, one may introduce another class or type of algorithm… -> add reference
-- A considerable number of ontologies have been created. -> add some examples
-- The relevance of ontologies as engineering artifacts has been justified in literature. -> add reference
-- Most research on ontology evaluation considers ontology evaluation from the ontology quality perspective. -> add reference
-- Table 2 only some of the definitions have a reference and others not. Are the ones that don’t have a reference contributed by the authors?
- Also check the capitalization of certain words in the references e.g. oops! -> OOPS!

Review #2
By Jesualdo Tomás Fernández-Breis submitted on 20/Aug/2014
Suggestion:
Reject
Review Comment:

Summary:
The current paper is a survey in the area of ontology evaluation, considering approaches, methods, metrics, measures and subjectivity. Ontology evaluation is a very important topic in ontology engineering research. The paper is clearly in the scope of this journal.
The paper is structured in six sections: introduction, context for ontology evaluation, metrics for ontology evaluation, approaches to ontology evaluation, overall limitations and conclusions. The structure of the paper is appropriate and the paper is easy to read and follow.

Overall evaluation:
I think that the current manuscript does not provide a good review of the field or is a good introductory text. On the one hand, relevant related literature and community efforts are missing in this survey. On the other hand, the analysis of the different approaches and metrics is too shallow, the paper would need to be extended to include appropriate explanations and discussion on the different approaches and perspectives, as well as providing some additional core definitions. I find the content of some sections not very appropriate for this paper.

Detailed comments:
The introduction is quite short and does not provide a convincing motivation. Basically, it describes the structure and content of the paper. Why is this survey needed besides the fact that ontologies are basic for the semantic web? There is a little mention to choosing which ontology to reuse.

The methodological approach followed for selecting the works included in this survey is not clear. Last year, there was a large community-driven activity on ontology evaluation, the Ontology Summit 2013 : Ontology Evaluation Across the Ontology Lifecycle, which has not been mentioned in the survey:
http://ontolog.cim3.net/cgi-bin/wiki.pl?OntologySummit2013.
Neuhaus, F. et al (2013). Towards ontology evaluation across the life cycleThe Ontology Summit 2013. Applied Ontology, 8(3), 179-194.

In section 2, some definitions are provided. They mention that ontology evaluation is related to verification and validation. This made me think that metrics, approaches, etc. in further section would be analyzed attending to such dimensions, but that was not the case. The role of section 2.2 is not clear to me. Ontologies are not only reused for integration and merging. The metrics and approaches are not analyzed having ontology integration and merging as context. They are not mentioned in the limitations or discussion sections.

Metrics are dealt with in section 3. On the one hand, I miss some relevant traditional ontology metrics like the ones provided by OntoQA. Mentioning internal and external metrics reminds me to software engineering standards and adaptations to ontology quality as OQuaRE, which is not included in this study. There are also quality in use metrics which are not considered in this study, that could include aspects like user reviews and ratings, which are available in some repositories of ontologies. Recent works related to ontology quality assurance have defined metrics based on axiom regularity (i.e., Mikroyannidi's work on RIO). Here, evaluation is approached from two perspectives: correctness and quality. This distinction is also made in the conclusions. Whereas a definition is provided for correctness, I am afraid I have not been able to find any definition in the paper for ontology quality. Section 3.3 is somewhat confusing because the paper intends to be a survey and here they are proposing a metrics-suite. However, there are currently approaches which have not included in the paper which contain more than those metrics and some discussion and comparison should be provided. However, I think that proposing a suite is not the goal of a survey.

Tartir, S. et al (2005, November). OntoQA: Metric-based ontology quality analysis. In IEEE Workshop on Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources (Vol. 9).
Duque-Ramos, A. et al. (2013). Evaluation of the OQuaRE framework for ontology quality. Expert Systems with Applications, 40(7), 2696-2703.
Mikroyannidi, E. (2013). Detection of syntactic and semantic regularities in ontologies (Doctoral dissertation, University of Manchester).

Section 4 surveys approaches: gold-standard, application/task-based, data-driven. It must be mentioned that there are very limited experiences of ontology evaluation exercises. Concerning gold-standard evaluation, a similar experiment to Keet et al [23] (comparison of trained and non-trained students to a gold standard) was recently performed by the GoodOD project:
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0061425.
Boeker, M. et al (2013). Effects of Guideline-Based Training on the Quality of Formal Ontologies: A Randomized Controlled Trial. PloS one, 8(5), e61425.

Pitfalls and design patterns are indeed important aspects related to ontology evaluation. Competency questions have also been used to evaluate ontologies but they are not mentioned in the paper. Recent approaches also account for cognitive aspects of quality which are not included in the paper see for instance:
Joerg Evermann, Jennifer Fang: Evaluating ontologies: Towards a cognitive measure of quality. Inf. Syst. 35(4): 391-403 (2010).

Section 5 described the main limitations, focused on subjectivity of criteria, thresholds and overall value. I understand that the authors mean that the criteria and metrics are not subjective by themselves but their inclusion and exclusion in a concrete framework. To my opinion, the diversity of approaches and criteria may mean that what properties a "quality" ontology must have is not clear for the community, and I think this was one of the reasons and problems identified in the Ontology Summit 2013. Besides, the intended context of use of ontology may lead to different criteria for different contexts, which has not been discussed in the paper, but this may not be caused by subjectivity but by expertise.

I agree in the problem of subjectivity in thresholds, which I think should also probably be the result of a community agreement based on experiences and best practices. Moreover, it is not clear to me why ontology evaluation frameworks should give an overall values, they might just provide indicators, in the form of metrics, that describe strengths and weaknesses of the ontologies, since quality evaluation does not necessarily means ranking, which is one of the types of evaluation approaches. The authors does not clearly justify or demonstrate these points by using examples in which concrete methods are unfair with concrete ontologies because of subjectivity.

An aspect that is not really covered by the paper is the evaluation of the evaluation methods, and the practical and combined use of quality evaluation methods with ontology construction methods. I think these are important limitations of the state of the art due to the lack of experiences.

Review #3
By Karin Verspoor submitted on 24/Aug/2014
Suggestion:
Reject
Review Comment:

This paper that aims to survey current approaches and methods in ontology evaluation.

Regarding the specific criteria for survey papers:

(1) Suitability as introductory text.
I believe that a newcomer to the topic of ontology evaluation would struggle with this survey.

Many concepts are referenced without clear definitions. For instance, the distinction between verification and validation, defined as "building an ontology correctly" and "building the correct ontology", respectively, was not clear to me. Notions of "coupling" and "cohesion" are mentioned without explanation (until Table 1, again with no discussion; and no attempt to relate the elements of Table 1 to those of Table 2). The example in 2.2 doesn't really help to clarify the notion of ontology "integration" -- it seems more about adding a new class to an existing ontology, i.e. extending the ontology. How is that "integration"? The authors have listed various approaches without effectively synthesising them, e.g. in 3.1 a list of metrics for reference [6] and another list for reference [14] are mentioned, but they aren't adequately compared/contrasted with each other or other proposals. While a distinction between approaches and methods is made in the Conclusion, this comes very late in the manuscript and does not help to frame the discussion.

(2) Comprehensivity of coverage.
While the authors mention task-based evaluation, there seems to be a substantial literature in that area that is not introduced or discussed. See for instance [1]. Also relevant is the Ontology Alignment Evaluation Initiative (http://oaei.ontologymatching.org/) which has run a number of workshops and evaluation events that do not appear to be included in the references (e.g. [2]). Also missing is mention of some more context-specific examinations of ontology quality, such as the lexical issues explored in [3], and domain-specific issues highlighted in [4]. Given that biomedicine is one of the most active areas in which ontologies are being used in the semantic web, these would be worthwhile to include.

The authors allude to but do not fully address the relationship between ontology construction/induction and ontology evaluation. There seems to be a dependency there that would be worthwhile to explore in more detail. For instance, the idea of a posteriori ontology engineering [5] is closely tied to measurement.

[1] Marta Sabou, Jorge Gracia, Sofia Angeletou, Mathieu d’Aquin, Enrico Motta (2007) Evaluating the Semantic Web: A Task-Based Approach.
In "The Semantic Web". Lecture Notes in Computer Science Volume 4825, 2007, pp 423-437
[2] Euzenat et al 2009 http://hal.inria.fr/hal-00794918/
[3] Verspoor et al 2009 http://bioinformatics.oxfordjournals.org/content/25/12/i77.long
[4] Hoehndorf et al 2012 http://bib.oxfordjournals.org/content/14/6/696
[5] Gessler et al 2013 http://www.crcnetbase.com/doi/abs/10.1201/b14935-14

(3) Readability and clarity of the presentation.
I found this paper difficult to read. I felt that the paper is not well-structured; it is very difficult to differentiate between the review elements and the authors' own analysis of the prior work. Proposals appear in various places; e.g., Section 3.2 identifies two perspectives, 3.3 introduces a suite of metrics, and then Section 4 sets out to establish a classification of ontology evaluation, at the same time as mentioning three prior classifications.

In addition, the language is awkward in many places and impedes understanding. For instance, the last sentence of the first paragraph ends with "and hence the topic of ontology evaluation" but it isn't really clear what the relationship is between evaluation and the challenge of deciding which ontology to reuse. I can work it out, but rephrasing that portion as a new sentence, e.g.,"Ontology evaluation is required to make an informed choice between multiple ontologies." would help the user a lot more.
The second sentence of the Intro states "The relevance and importance of ontology evaluation is evident in the role the[y] play in the semantic web ..." -- ontologies clearly have a direct role in the semantic web, but it isn't so obvious that ontology evaluation does. Similarly in Section 2, clarify the relationship between the increase in the number of ontologies, and the need for evaluating the ontologies. The second sentence of 3.1 doesn't seem to follow from the first, despite the use of "Hence ...". It isn't obvious that evaluation is needed prior to use. In 3.2, what does it mean for "evaluation to be done in the view of a perspective"? How does the perspective influence the evaluation? Do you mean "we have identified two complementary perspectives that influence how ontology evaluation is approached"? (and why are they complementary?) At the end of section 5, how does simply having a subjectivity scale "provide a means to account for the influences of bias"?

Some other issues:
* Use of "this work" is often confusing -- is it referring to the authors' current paper, or the literature being surveyed? The authors seem afraid to use the academic "We" but this would help avoid this confusion.
* The phrase "metric suite" is ambiguous and would be better as "suite of metrics".
* What's a naive interpretation?

(4) Importance of the covered material to the broader Semantic Web community.
The topic of ontology evaluation is clearly important for the entire Semantic Web community, though the authors could more strongly argue that point.

Review #4
By Janna Hastings submitted on 26/Aug/2014
Suggestion:
Major Revision
Review Comment:

Summary:

This manuscript aims to offer a survey of the field of ontology evaluation. It is reasonably comprehensive in terms of mentioning the different components of the field and in terms of references, and includes a few useful summary tables. However, the weighting of the treatments of different aspects of the field seem unbalanced in that some of the aspects of the field, such as ontology evaluation in use for a particular purpose or the different philosophical or structural paradigms for ontology evaluation, were barely mentioned despite the large body of literature they have amassed. Furthermore, the quality of the written text is not quite up to the expected standard -- the manuscript should be seen by an English language editor.

Specific points as requested by the review template:

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.

-- I do not believe the manuscript in the present form is suitable as an introductory text on the topic of ontology evaluation. This is largely due to the problem with the balance of the presentation mentioned above. Also, the topic of ontology usage in the context of the Semantic Web in particular is not well introduced, which I think would be a necessary backdrop to a survey paper on ontology evaluation for this journal.

(2) How comprehensive and how balanced is the presentation and coverage.

-- See above. The coverage approaches comprehensiveness but is not sufficiently balanced.

(3) Readability and clarity of the presentation.

-- The paper needs to be seen by an English language editor before it can be considered for publication. Language errors are pervasive in the current draft.
I have listed a few examples below for illustration.

Abstract:
* is concerned with ascertain two -> is concerned with ascertaining two
* challenge of deciding one the suitability -> the challenge of deciding the suitability
* state of the art in ontology evaluate -> state of the art in ontology evaluation
* it's relevance -> its relevance

Introduction:
* Ontology evaluation is a emerging field -> Ontology evaluation is an emerging field
* is evident in the role the play -> is evident in the role ontologies play
* from both academic and industrial domain -> from both the academic and industrial domains

Also, some of the content in the paper is too weakly argued. For example, the following lines begin section 2.2.: "A considerable number of ontologies have been created. It follows, therefore, that the time has come for these ontologies to be reused." However, there is no justification for the claim that just because many ontologies have been created that now the "time has come" for them to be reused. Many ontologies are not designed for reuse (i.e. they are designed for a single application). Consider rephrasing this paragraph and watch out for similarly unjustified content elsewhere.

(4) Importance of the covered material to the broader Semantic Web community.

The topic is very relevant to the Semantic Web community. Unfortunately, the manuscript does not do sufficient justice to those aspects of the topic that are of the most relevance to this community. For example, in my opinion the paper does not do enough justice to the notion of "purpose" or application context in ontology evaluation. Ontologies are developed to be used for different application needs, and correctness or quality metrics will differ depending on the intended purpose of the ontology including the broader context of that usage. Given the Semantic Web focus of the target journal, perhaps more space could be devoted to the specific use of ontologies within that context, and ontology evaluation for use within the Semantic Web in particular.

Additional specific comments:

I do not agree that "Ontology design best practices can also be considered to fall within the realm of gold-standard." Gold standards are usually other ontologies or textual corpora to which the evaluated ontology can be compared, while best practices are guidelines to follow in creating ontology content. These seem to me to lend themselves to rather different modes of ontology evaluation.

Despite the fact that humans are the objects of research in most ontology evaluation experiments, I believe that we are not doomed to subjectivity as a direct result, as we possess sophisticated statistical tools that are able to determine allowable generalization of results when given sufficient sample sizes.