A Systematic Analysis of Term Reuse and Term Overlap across Biomedical Ontologies

Tracking #: 1198-2410

Authors: 
Maulik R. Kamdar
Tania Tudorache
Mark A. Musen

Responsible editor: 
GQ Zhang

Submission type: 
Full Paper
Abstract: 
Reusing ontologies and their terms is a principle and best practice that most ontology development methodologies strongly encourage. Reuse comes with the promise to support the semantic interoperability and to reduce engineering costs. In this paper, we present a descriptive study of the current extent of term reuse and overlap among biomedical ontologies. We use the corpus of biomedical ontologies stored in the BioPortal repository, and analyze different types of reuse and overlap constructs. While we find an approximate term overlap between 25–31%, the term reuse is only <9%, with most ontologies reusing fewer than 5% of their terms from a small set of popular ontologies. Clustering analysis shows that the terms reused by a common set of ontologies have >90% semantic similarity, hinting that ontology developers tend to reuse terms that are sibling or parent–child nodes. We validate this finding by analysing the logs generated from a Protégé plugin that enables developers to reuse terms from BioPortal. We find most reuse constructs were 2-level subtrees on the higher levels of the class hierarchy. We developed a Web application that visualizes reuse dependencies and overlap among ontologies, and that proposes similar terms from BioPortal for a term of interest. We also identified a set of error patterns that indicate that ontology developers did intend to reuse terms from other ontologies, but that they were using different and sometimes incorrect representations. Our results stipulate the need for semi-automated tools that augment term reuse in the ontology engineering process through personalized recommendations.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Tao Cui submitted on 30/Oct/2015
Suggestion:
Minor Revision
Review Comment:

Kamdar et al. extends upon previous research in “Investigating Term Reuse and Overlap in Biomedical Ontologies”. The reuse of terms is particularly important for interoperability and engineering of ontologies. Both papers conclude that term reuse (where class level terms are not shared among two or more ontologies) and term overlap (where two or more class-level terms are related) are minimal among 377 biomedical ontologies, either due to lack of guidelines or lack of tools to correctly reuse terms. Metrics for term reuse and overlap utilizes IRI, xref, and CUI references to identify specific terms. In this paper, the author furthers the previous study with a better metric (Equation 1) to calculate term reuse and term overlap to overcome some of the previous limitations. Also, the author introduces clustering analysis and log data from BioPortal Import Protégé plugin to validate user's term reuse intention.

In this study, Kamdar et al. summarizes that term reuse amongst a corpus of NCBO biomedical ontologies amounted to less than 9% of their total terms (6% for IRI, 5% for xref, and 8% for CUI reuse). This is slightly higher than the previous study of 5%, but nonetheless a low percentage. For term overlap, utilizing the equation and five overlap modules, the authors reported between 25%-30% term overlap that are also higher than the previous study of 18%. Overall, the new metric is said to have produced more precise figures, yet the same conclusion, as before, i.e., despite some term overlap, term reuse is minimal among biomedical ontologies. Furthermore, the authors employed clustering and BioPortal Import Plugin logs (based from 90 countries and 3538 unique IP addresses) to determine user’s intention of term reuse. The authors correlated the log data with the clustering analyses to support that ontology engineers reuse single terms and hierarchical subtrees, as well as, the reuse of terms from parent-child structures located in higher and upper-middle levels of the hierarchy from an ontology. Visualization tools and publication of the results can be found http://onto-apps.stanford.edu

While the paper highlights some new contributions and also improvements from their previous study, it would be recommended to have a subsection that discusses the previous study to give this current paper more context and clarity. This could be accomplished with a subsection near the Related Studies section, for example. In addition, section 3.4 needs more conceptual explanation. Aside from the clustering analyses and log data to support validation of user’s intent to reuse terms, the other findings are more or less similar to the previous study but with more precise figures to make the same claim. Overall, the study reveals some new findings relating to term usage in biomedical ontologies and innovative methods for the analysis of term usage.

Review #2
By Licong Cui submitted on 04/Nov/2015
Suggestion:
Minor Revision
Review Comment:

This paper presents a comprehensive and systematic analysis of term reuse and overlap among biomedical ontologies (377 distinct ontologies from the BioPortal repository are studied). The authors propose a new method using composite mappings to measure term reuse and overlap across different ontologies. This composite-mapping-based approach overcomes the limitation of the pure string-matching method. In addition, the authors leverage a clustering method using semantic similarity to identify term reuse patterns. The result obtained is further validated through the web logs of BioPortal Import Plugin. The analysis shows 25.31%-30.18% term overlap but less than 9% term reuse. The authors also found strong indications that ontology engineers intended to reuse terms but they represented terms differently and sometimes incorrectly. This informs the need to develop better guidelines and tools to support term reuse for ontology engineers.

The topic covered in this paper is suitable for publishing in the Special Issue on Linked Data and Ontology Reuse. The paper is well written and easy to follow. The reviewer only has a few minor comments that need to be addressed:

- In page 5, the formation of Equation (1) might need to be adjusted. It is somewhat confusing although the explanation of each symbol is given. The symbol t_ij is explained, but it is not used in the equation. From a mathematical formula point of view (i.e., without looking into the meaning of each symbol), the equation seems always equal to 0, since t_j and T_j are both members of M_o and the two sums would always result in the same value. Given that this equation is critical for computing term reuse and overlap, a clear representation will help readers better understand it.

- In page 7, "In the example from Table 3, Term ..." should be "In the example from Table 3, term ..."

- Figure 7 appears before Figure 6.

- In page 13, the authors mentions that "This can be seen in the GO and FMA ontologies in Figure 8". But GO is not covered in Figure 8. Please correct this.

Review #3
Anonymous submitted on 21/Nov/2015
Suggestion:
Major Revision
Review Comment:

This paper presents a comprehensive analysis of the term reuse and term overlap of the biomedical ontologies in BioPortal, the largest repository of biomedical ontologies. The paper reflects the need of a systematic approach for term reuse and overlap analysis. When developing new ontologies, new concepts have been created while existing concepts and terms have not been well utilized, which hinders semantic interoperability among information systems using different ontologies in the similar domain. This paper shows that most ontologies exhibit substantial term overlap but considerably little reuse, and from a very small number of ontologies such as BFO. Meanwhile the reused term mostly come from the 2-level subtrees on the higher levels of the class hierarchy. The authors further implemented a web-based interface that can search for similar and reused terms in BioPortal ontologies and visualize the reuse dependencies and overlap. Existing tools that support term reuse such as OntoFox, MIREOR all require the prior knowledge of the ontologies for reuse. The method in this paper is empirical based on all the existing ontologies therefore can be a superior supporting tool for term reuse. The paper is clearly written, well-structured with tables and figures to support the argument. It has high scientific value. The literature review is sufficient and well supports the paper. The reviewers have some comments for the authors to further improve the manuscript:

Major issues:

1. The reviewer is concerned about the definition of term reuse for UMLS terms. Terms from different UMLS source vocabularies with the same CUI may not be reused terms. They may be considered as implicit reuse at most. In order to analyze the term reuse pattern in the UMLS terminologies, the authors may need to further assess whether terms are lexically identical. If they are not lexically identical, they probably should be considered as term overlap rather than reuse.

2. The authors mentioned the issue that “ontology developer tend to reuse terms with different versions, notations, or namespaces, that are sometimes incorrect and have no explicit mappings to the original term. As this is an important issue that may hinder semantic interoperability, the authors should further give suggestions or future plan to reconcile this issue from BioPortal’s perspective. In Section 5.5, the authors talked about the plugin of WebProtege that can inform the developers of the changes of terms in the original ontology. This should be further elaborated.

Minor issues:

1. There is an inconsistency in Section 4.4: “Cluster 4 is shown in Figure 7”. However, Figure 7 actually shows sub-clusters in Cluster 3

2. There is another inconsistency in Section 4.4: “This can be seen in the GO and FMA ontologies in Figure 8”. However, GO is not in Figure 8.

3. The references should be numbered in the order as they appear in the text.