eagle-i: biomedical research resource datasets

Tracking #: 456-1633

Carlo Torniai
Daniela Bourges-Waldegg
Scott Hoffmann

Responsible editor: 
Jens Lehmann

Submission type: 
Dataset Description
In this paper we present the linked data sets produced by the eagle-i project. We describe the content, the features and some of the applications currently leveraging these datasets.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Francois Scharffe submitted on 03/May/2013
Major Revision
Review Comment:

Each of the datasets published by the Eagle-i project describe biomedical resources in academic institutions. Each of the 25 institution has a dataset published as linked-data together with a SPARQL endpoint. Each dataset follow the same structure and is described using a number of ontologies, mainly OBO ontologies. Various tools allow to browse datasets and ontologies and the datasets are used in tools being not directly part of the project. The whole constitute an important biomedical resource.
There is however room for improvement:
URI patters could be improved , ie http://ccny-cuny.eagle-i.net/i/00000137-568d-4cd7-bb24-040880000000 is a resource describing a person in the City College of New York. Used of a mixed pattern ID/label would allow easier usage of the dataset. SPARQL queries are harder to build as a result (example given in the paper:
?resource obo:ERO_0000543 obo:ERO_0000652.
?resource obo:ERO_0000021 ?person. }
This is a known issue the authors point out in the paper arguing this problem can be overcome with proper documentation. Tooling can also certainly help as the end-user should not have to deal with SPARQL, but still.

Some resources descriptions seem to be really shallow. For example, the person at the URI given above is described using foaf:Person, but no foaf property is used. Only metadata such as label, contributor, created appear in the resource description.

The paper does not mention any linkage to external dataset. Little investigation shows many possible links could be performed, to publication datasets, to biomedical datasets, for example DBPedia and Geospecies datasets could be linked to the species described in this dataset. This is the major point of concern.

Every published dataset follows the same structure but no federation mechanism is provided. Ready to use SPARQL federation tools exist an could be used (see for example http://www.fluidops.com/fedx/

Review #2
By Boris Villazon-Terrazas submitted on 16/May/2013
Minor Revision
Review Comment:

This paper describes a set of linked data sets produced in the context of eagle-i project. The data sets contain information related to biomedical research resources available at 25 institutions. These data sets are very interesting and the paper in general suitable for publication in the special issue. However, from my point of view, the paper leaves out some important information related to the linking. What tools with which link specifications were used or did you perform just a simple string similarity matching?

Minor comments:
- It would be interesting to have a link to the ETL toolkit and SWEET
- what is the original format of the source data?
- most of the URLs presented on the paper should be footnotes
- links to other relevant datasets are missing (I do not mean schema mappings)
- Section 7: rdf:label -> rdfs:label
- woud you please include a pointer to the repository in which you stored the "global" resources?
- would you please include a pointer to the web application that allows curators to clean the data?
- not all elements of the ontology are dereferenceables, e.g., http://eagle-i.org/ont/repo/1.0/Person
- regarding general concepts/classes, e.g., http://eagle-i.org/ont/repo/1.0/Person, why you didn't reuse general vocabs like foaf?
- what triplestore you are using?

Review #3
By Amrapali Zaveri submitted on 23/Jun/2013
Major Revision
Review Comment:

The article "eagle-i: biomedical research resource datasets" describes the efforts of a two-year project of gathering biomedical research data from 25 different institutions and making them available through a semantically enabled, federated search system and as linked data sets. The article consists of all the information about the data, the source, type, modeling and availability and additionally contains three interesting use cases to portray the usefulness of the data.

The effort to, as the authors state, "make these "invisible' research resources more discoverable" is definitely commendable considering the wide range of information that they make available from 25 different institutions. All the data is important and gathering it in one large repository and making it available as linked data is definitely a step in the right direction. Also, there are already users of this dataset.

The authors have covered majority of the points necessary for dataset description articles. However, the major aspects that are lacking are:
(i) Interlinks to other external data sources, including quantity, quality and purpose
(ii) Quality of the data itself, how good/accurate is the ETL process and more information about SWEET - who are the users and how is it useful? Also, report of the quality issues that the current users may have encountered.
(iii) Description (and example screenshot) of the web-based search application mentioned in Section 1 - does it support keyword search?
(iv) Explicit licensing information preferably as a VoID description of the dataset which also includes the versioning information
(v) Related work or related initiatives such as Bio2RDF etc.
(vi) It would be interesting to know about the performance of the SPARQL endpoints considering the huge amount of data that is queried.
(vii) How often is/would the data (be) updated? Does it change often? Are the older versions available?

The paper is well written throughout and I only have a few minor comments:
- I would align the numbers according to the units or either side in Table 1 and 2.
- Figure 1 is a bit unclear when printed. I recommend to increase the font a bit.
- "soft stackware", did you mean "software stack"
- The sentence " The lack of a single SPARQL query interface to search over all of the eagle-i datasets at once, but is easily overcome using programmatic access." is incomplete
- Instead of referring to the blog post in reference 5, I would add the link to the paper: http://www.carlotorniai.net/docs/integrated_pipeline.pdf

As a side note, I would like to point the authors to this paper: http://www.ncbi.nlm.nih.gov/pubmed/19397794.