BioPortal as a Dataset of Linked Biomedical Ontologies and Terminologies in RDF

Paper Title: 
BioPortal as a Dataset of Linked Biomedical Ontologies and Terminologies in RDF
Authors: 
Manuel Salvadores, Paul R. Alexander, Mark A. Musen, Natalya F. Noy
Abstract: 
BioPortal is a repository of biomedical ontologies—the largest such repository, with more than 300 ontologies to date. This set includes ontologies that were developed in OWL, OBO and other formats, as well as a large number of medical terminologies that the US National Library of Medicine distributes in its own proprietary format. We have published the RDF version of all these ontologies at http://sparql.bioontology.org. This dataset contains 190M triples, representing both metadata and content for the 300 ontologies. We use the metadata that the ontology authors provide and simple RDFS reasoning in order to provide dataset users with uniform access to key properties of the ontologies, such as lexical properties for the class names and provenance data. The dataset also contains 9.8M cross-ontology mappings of different types, generated both manually and automatically, which come with their own metadata.
Full PDF Version: 
Submission type: 
Dataset Description
Responsible editor: 
Decision/Status: 
Accept
Reviews: 

Submitted in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Revised manuscript, now accepted for publication. Previous versions received an "accept with major revisions" in the first round and an "accept with minor revisions" in the second round. Previous round reviews are below.

Solicited review by Amrapali Zaveri:

The authors have revised the paper sufficiently well to respond to the comments. The new section and the appendix definitely add more information to the paper and address all the issues that were raised. There are however a few minor things that should be fixed:

1. * the description of a use case or potential usefulness of the
ontologies should have been portrayed so as to add more meaning to
the effort of creating such a large repository. For example,
integration of several ontologies for a specific use case or the
adoption of any one ontology in a biomedical application would have
been useful.

We agree that describing a use case would help to motivate this
effort. We have added the following statement in the introduction:
"These ontologies have been used for drug surveillance, gene
annotation and to enrich and classify scientific literature; among
other things".

Ok it would be good to provide a reference for each too.

2. The link http://www.bioontology.org/wiki/index.php/Sparql_bioportal still does not display any content

3. In Section 2.2 : "All this elements..." -> "All these elements..."

4. In Section 2.2: "Researchers from outside groups..." -> "Researchers from groups outside BioPortal..."

5. The reference or footnote for "NCBO's 4store clone" in the Appendix: Other Tools and Resources is missing.

Solicited review by anonymous reviewer:

The authors did a good job with the revision and the extra page improves greatly the description of BioPortal. The only remaining point is that I would have liked to see more on the quality topic (to the extent covered in the rebuttal) but I will not insist on it.

Two typos:

1) All this references

2) Two versions of the same word (look for all occurrences):
de-referencing and dereferencing.

First round reviews:

Solicited review by Jens Lehmann:

The paper "BioPortal as a Dataset of Linked Biomedical Ontologies and Terminologies in RDF" describes a repository of over 300 biomedical ontologies. The repository mainly focuses on storing ontologies, rather than instance data, in three particular formats: (1) OBO format, (2) OWL and (3) RRF (Rich Release Format). These ontologies have been submitted by several users, making it the largest collection of user generated content in the biomedical domain.

In addition, metadata information as well as mappings are also stored for each ontology. Also, the labels used by different users are unified by mapping them to the SKOS vocabulary. The ontologies are public, licensed or private. If private, the use of an API key included in the SPARQL HTTP call allows a user to access the ontology, which is definitely a plus considering biomedical data is sometimes sensitive and cannot be openly published. All the data in BioPortal, that is, the ontology, the metadata and the mappings can be queried via SPARQL.

Providing such a large repository of biomedical ontologies along with mappings within them is definitely a great effort in unifying researchers in this field. However, there are a few aspects that are to be considered and were lacking from the paper, such as:
- the description of a use case or potential usefulness of the ontologies should have been portrayed so as to add more meaning to the effort of creating such a large repository. For example, integration of several ontologies for a specific use case or the adoption of any one ontology in a biomedical application would have been useful.
- a list of the coverage of the various biomedical areas would be beneficial to show the versatility of the ontologies in BioPortal.
- since the majority of the data is user generated, an evaluation of the correctness and overall quality of the ontologies would be helpful not only for potential users of the BioPortal but also in evaluating the potential of creating similar large-scale user-created repositories.
It would be useful to have a quality check before a user submits ano ntology to BioPortal and if such a quality check exists, it should be mentioned.
- even though there is a mention of the mapping between two terms from different ontologies, it should be explained how the problem of disambiguity between different naming conventions is resolved. Additionally, the precision of the automatic mappings should also be reported.
- an example for the RDF Dataset Creation Workflow would be helpful and clear the process.
- also it should be mentioned whether each ontology as well as its previous versions are downloadable.

The paper is well written and the dataset description is clear. However, there are a few obvious mistakes in the presentation that need to be fixed, such as the blank line on page 2 and the overlapping text on page 5. The URL http://www.bioontology.org/wiki/index.php/Sparql_bioportal currently shows no content on the page.

Overall, I think the BioPortal is a important effort with potentially high impact. Some crucial information is missing in the article, but I recommend to accept it if those issues are resolved.

Solicited review by anonymous reviewer:

In general this is a good paper that describes work that builds on the
remarkable previous experience of this group. There is a lack of given
insight, however, on certain aspects to do with design choices and
patterns, type and statistics of usage (by web users) of the available
dataset, or certain quality aspects: a fundamental one being the
quality of the links. Overall there is no critical discussion or
insight about possible problems. This is a shortcoming of the
paper. I would like to have several of these issues addressed in the paper's final version.

Below are the answers to specific points mentioned in the submission
guidelines, followed by more detailed comments (mostly typos).

1) Name, URL, version date and number, licensing, availability, etc.

This information is given except for the version date and number.

2) Topic coverage, source for the data, purpose and method of creation
and maintenance, reported usage etc.

Reported usage is not described. It would be nice to have an idea of
the number of users, for example. When was the dataset made available
with a SPARQL endpoint?

3) Metrics and statistics on external and internal connectivity, use of
established vocabularies (e.g., RDF, OWL, SKOS, FOAF), language
expressivity, growth.

The statistics provided are for internal connectivity. Apart from the
information that the ontologies have been registered at
thedatahub.org, there is not much else about the connectivity to the
exterior. The growth of the ontologies is not reported exactly, just
some details, for example the frequency at which a new version of GO
is released (daily). There is no discussion of expressivity.

4) Examples and critical discussion of typical knowledge modeling
patterns used. Known shortcomings of the dataset.

There is presentation of the models used and the links established,
whose detail is commensurate with the amount of pages used, though the
paper relies mostly on a couple of examples rather than on modeling
patterns or usage patterns. Overall there is no critical discussion or
insight about possible problems. This is a shortcoming of the paper.

5) Quality of the dataset

The quality is related to the quality of the ontologies themselves,
which are well established. The quality of the internal links that
were established is however not discussed, just the methods used to
create them.

6) Usefulness (or potential usefulness) of the dataset

Again, this measure is established by the quality of the ontologies
themselves, but usage patterns, and better examples for the SPARQL
queries would give a better idea of what can be achieved.

7) Clarity and completeness of the descriptions

Given the space available, the paper is reasonably good but could be made
better if the examples were clearer. A problem is that for some of the
pictures, the captions could be improved upon (see below).

----------------------------------------------------
Detailed comments, including typos

aspects of BioPortal's ecosystem ->
aspects of the BioPortal's ecosystem

of predicates rewrite ->
of predicates to rewrite

avoid use of "she" or "he"; for example, write:
users do not need ... and can query on the ...

Notice the line break after "occurrence of"

In "The predicate used in this case is
http://NIF-RTH.owl#core_prefLabel.", is the meaning supposed to be:

"The custom predicate used in this case is http://NIF-RTH.owl#core_prefLabel." (if so, this connects nicely with the text).

terms[17]. -> terms {space} [17].

Caption of Figure 5 (several things):

1) mapped term -> mapped terms

2) The Process info is the same for all mappings that the process
generated and all mapping records point to it. ->? The Process
information is the same for all the mappings that the process
generated and all the mapping records point to it [it is not clear
here: does it refer to the "Process information"?]

It is written:
"Figure 6, shows how to retrieve all IDs for
ontology content graphs."

But the query actually retrieves the pairs (version, graph). Which IDs
does it refer to? Graph IDs? Then the text should be changed to say
that the query returns the pair version and graph (ID).

Is it really a one-to-one relation between ontologies and named
graphs? Then, where is the redundancy? This needs to be explained.

term URI -> URI term

UMLS2RDF is a set of scripts that connect

->

UMLS2RDF is the set of scripts that connect (or) UMLS2RDF is a set of the scripts that connect

Suggested rewriting:

We process the pipeline in Figure 7 daily at midnight
PST time. Ontology changes are propagated to
the triple store overnight and updates cannot be seen
until the next day.

->

We process the pipeline in Figure 7 daily at midnight PST
time. Ontology changes are propagated to the triple store overnight
and updates can be seen the following day.

It supports de-referencing about URIs -> It supports de-referencing of URIs

endpoint gets users not only to
the ontology content, but also to the ontology content,
metadata and mappings between terms in different ontologies.

->

endpoint gets users not only to
the ontology content, but also to the
metadata and mappings between terms in the different ontologies.

analyse

->

analyze

Open Source {space} . -> Open Source.

Capitalize where appropriate;

Bechhofer. The owl api: A java api for owl
ontologies. Semantic Web, 2(1):11–21, 2011.

Use consistent bibliographic style:

N. F. Noy vs. Natalya F. Noy

URLs need to be reformatted throughout the paper.

Solicited review by Giovanni Tummarello:

The authors describe their work to SPARQL endable the ontologies on bioportal. The work is certainly notable for being the largest such repository. The uniform syntax(RDF) and the fact that its accessible via sparql facilitates the access and use from outside parties.
I think the work is sound so i have on other major comments.

Tags: 

Comments