Facilitating integrated analysis of biological data by enhancing interoperability of RDF resources: Practical Recommendations

Paper Title: 
Facilitating integrated analysis of biological data by enhancing interoperability of RDF resources: Practical Recommendations.
Aravind Venkatesan, Ward Blondé, Erick Antezana, M Scott Marshall, Andrea Splendiani, Mikel Egaña Aranguren, James Malone, Vladimir Mironov, Martin Kuiper
The Resource Description Framework (RDF) has evolved into a language of choice for knowledge exchange on the (Semantic) Web due to its robustness and queribility with SPARQL. However, RDF allows for a variety of modelling practices which, if used unchecked, would result in the production of resources that may be only partially compatible with other resources. We highlight the issues related to incomplete interoperability that exist today with the results of a limited survey of the current interoperability of public RDF resources. We pro-pose a set of informed guidelines that can help in overcoming the problem.
Full PDF Version: 
Submission type: 
Full Paper
Responsible editor: 
Reject and Resubmit

Review 1 by Paul Groth
This paper describes an analysis of four existing integrated biomedical databases based on Semantic Web technologies: Bio2RDF, Neurocommons, HCLS KB, and Linked Life Data. These are not biomedical databases on their own but instead integrate a variety of existing biomedical databases using RDF conversions. The analysis performed by the paper is a brief description of each database, the execution of a simple query (with some variants) to retrieve the neighborhood of information around a protein, results of running this query are then tabulated showing a divergence between how these integrated databases use RDF.

Based on this analysis, the authors present a series of suggestions for making these databases better. They introduce a 7-star system inspired by the 5-star guide to linked data. These provide simple recommendations such as always providing human readable names, use of HTTP uris, use of OWL EL.

Overall, I think the paper highlights an important problem: how does one deal with the heterogeneity in (life sciences) databases even when there is syntactic interoperability through RDF.

However, I think the paper's analysis does not go into enough depth especially to support the various recommendations given. First, just as a basic point, it would have been good to define where the figures for each of the databases come from and where they were obtained (see Table 4). In particular, 4.1 trillion triples seems dramatic for linked life data. I checked their about page and it says 3.8 billion triples. Additionally, maybe some more statistics could be given: number of datasets, properties, classes, vocabularies used, etc. Second, the focus of the analysis is on the "human readability" of the query results or at least the ability to obtain such human readability. There's no motivation of why this is important. Thirdly, I would have expected a larger number of queries to be performed to substantiate particular claims. Why shouldn't reification be used in biomedical rdf? What in the analysis supports this? What are the "most basic" features of RDF and why do these databases not comply? What is wrong with blank nodes and why does this analysis support that conclusion? In some sense, what the recommendations detail maybe seen as good practice but the analysis does not provide evidence for that good practice.

Finally, the authors emphasize that SPARQL queries should be simple to use, for example, by not having to use OPTIONAL and UNIONs but these constructs are there for the express purpose of dealing with messy data. The fundamental tenant is that biologist should be the ones writing sparql queries. This may be the case but it also means that data needs to be normalized and cleaned in a data warehouse type fashion. The authors suggest that the community do this by coming together. It would be interesting to make this case clearer in the paper.

Summing up, the paper has a good idea: use an analysis of current biomedical linked data to support recommendations for going forward. However, the analysis is too lightweight and does not provide enough evidence for the proposed recommendations.

Minor comments:
- In a number of places the is used oddly (e.g. "These are widely used by the biomedical scientists to make inferences about the uncharacterised entities.") What are "the biomedical scientists and "the uncharacterized entities". The sentence would read better as "These are widely used by biomedical scientists to make inferences about uncharacterised entities." Just something to look out for.

Review 2 by Michel Dumontier

This manuscript reports on a simple SPARQL query-based investigation and makes recommendations for greater interoperability between hosted RDF resources. The authors report on the lack of (consistent) conformance to particular metadata vocabularies (rdfs:label, dc:title, foaf:name). Unfortunately, this research is methodologically flawed, makes numerous factual errors and fails to convince the reader that their recommendations would in fact lead to greater interoperability.

Major revisions
p1. "Unlike some other scientific domains" - False - there are thousands of mathematical/quantitative biological models, a good repository is the EBI's biomodels database (there are others). Anyways, I don't see how this is pertinent to the argument.

p2. Bio-ontologies, such as the Gene Ontology, were first created not to capture knowledge, but to be used in consistent annotation across databases.

p2. "it has been observed that there are so far very few successful implementa-tions that exploit the full capabilities of automated reasoning offered by OWL.". Robert Hoehndorf and I have been at the forefront of doing so:
* Semantic integration of physiology phenotypes with an application to the Cellular Phenotype Ontology. Bioinformatics. 2012
* New approaches to the representation and analysis of phenotype knowledge in human diseases and their animal models. Brief Funct Genomics. 2011
* Integrating systems biology models and biomedical ontologies. BMC Syst Biol. 2011 Aug 11;5:124.
* Self-organizing ontology of biochemically relevant small molecules. BMC Bioinformatics. 2012 Jan 6;13:3.
* Towards pharmacogenomics knowledge discovery with the semantic web. Brief Bioinform. 2009 Mar;10(2):153-63. Epub 2009 Feb 24.
-> as a point of discussion, the approaches that we've taken aim to address the problem of semantic interoperability, which is surely a greater challenge than that being identified in this paper.

p2 - robustness; what are the criteria of robustness? We have shown that SPARQL queries involing cyclic conditions easily defeat SPARQL implementations
* Chemical Entity Semantic Specification: Knowledge representation for efficient semantic cheminformatics and facile data integration. J Cheminform. 2011 May 19;3(1):20.
-> you might want to additionally cite other work than yours (or mine!).

p3 - Bio2RDF is jointly maintained by Carleton University and CHUL. It also offers a faceted search for each endpoint accessible at http://namespace.bio2rdf.org/fct
p3 - Is Neurocommons active?
p3 - LinkedLifeData - you can download the data - ftp://ftp.ontotext.com/pub/lld/
p3 - Bio2RDF is *not* a data warehouse. There is a separate endpoint for each dataset. It also provides access to other resources by mapping to other SPARQL endpoints (e.g. PubMed, Pubmed Central).

p4 - querying the uniprot endpoint for bio2rdf gives 192 triples. Bio2RDF has implemented its own federated search - which finds all mentions / incoming links

* it should be stated that Bio2RDF uses the Uniprot RDF distribution, but removes blank nodes, normalizes URIs and adds rdfs:label and dc:title. I assume that the others may (or may not) do this - so problems with this dataset should ultimately be attributed to UniProt.

* the authors should definitely look at the whole of the data that is being offered by each and draw some overall conclusions about provided data.

* moreover, since Bio2RDF is not a warehouse, it does not host all the linked to datasets in the same store. Thus, linked resources may not have rdfs:label. Unless you're prepared to execute a federated SPARQL query, the label on the linked node *must* be put in an optional clause.

* it's not clear *why* there is difference in results. Should they be the same, or be different? The authors need to examine these differences and describe their basis and how it pertains to the interoperability message.

* it's not clear what the purpose of showing us different statements from each of the stores is. I would expect a comparison of statements, or kinds of statements provided.

p7 - since OWL2 supports meta-modelling, OWL-DL remains decidable in the presence of class/instance axioms. There is absolutetely no preliminary discussion of OWL-EL nor why it is needed in this context (we transform OWL-DL ontologies into OWL-EL for faster query answering). You should just remove any reference to this, as I can't see how it pertains to the current investigation.

I have difficulty seeing how the recommendations are supported by the results.
* URIs
- are the URIs provided dereferenceable? Did you do an experiment and report the results of URI dereferenceability?
* (basic) RDF
- what elements are included in basic RDF(S)? Are the vocabulary elements (e.g. rdf:type) used appropriately in the RDF data that you examined?
* avoid other RDF
- which of the stores use these? and why is it bad?
* descriptive labels
- which datasets/stores did not use any of the descriptive properties? Did they use their own? Do the datasets provide them in the first place?

In summary, I don't find the current analysis particularly accurate or compelling, and significantly more work has to be done to analyze the use of metadata annotation, or peer into vocabulary usage. I'm not sure what best practice *is* (unless it's implicit agreement or explicit statement of conformance), but I can usually tell when there is *no* practice or *bad* practice.

* table 4 shows human readable labels for BioGateway, but not Neurocommons