TraitBank: Practical semantics for organism attribute data

Tracking #: 650-1860

Paper Data

Cynthia Sims Parr, Nathan Wilson, Katja S Schulz, Patrick Leary, Jennifer Hammock, Jeremy Rice, Robert J. Corrigan, Jr.
Encyclopedia of Life (EOL) has developed a new web-accessible repository, TraitBank (, to better serve scientific discovery. EOL's TraitBank aggregates and manages attribute (trait) data across the tree of life including morphological descriptors, life history characteristics, habitat preferences, and interactions with other organisms. This paper describes how TraitBank uses Darwin Core and other standards to ingest and subsequently manage trait data in a Virtuoso triple store in a way that leverages EOL’s extensive existing infrastructure. We add to and improve the semantics of both data and metadata in order to improve interoperability across the domains of morphology, ecology, and genomics. The system takes a semantic approach and also emphasizes practicality and ease of use for experts and non-experts. In addition to aggre-gating trait data in existing literature or databases, TraitBank contributes to community-based ontologies and sets the stage for a rapid rise in annotations about attributes on specimens and citizen science observations.
Submission type: 
Tool/System Report
Full PDF Version: 
Major revisions

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 02/Oct/2014
Minor revision
Review Comment:

The paper entitled 'TraitBank: Practical semantics for organism attribute data' presents TraitBank, an interface for and knowledge base of trait data about the millions of types of organisms that are listed in the Encyclopedia of Life. Trait information is attribute data that can be utilized for many different kinds of research studies, thus semantic representation of trait knowledge in a large publicly available repository has a great deal of potential value. The difficulty in semantic representation of traits is that an extremely wide variety of observable characteristics can be characterized as traits, and they can be expressed both in terms of genetic and environmental causes as well as phenotypic effects (e.g., wings, red color). The representation of trait knowledge, thus, provides an excellent illustration of the complexities in modeling semantics at the intersection of several distinct fields of the life sciences (including morphology, ecology, and genomics) that are all developing ontologies and vocabularies in parallel. The motivation for the paper is clear and it is well-written, with a concrete implementation and as such I think it is a valuable contribution to the special issue. However, there are some weak points and issues that I would like the authors to address in the paper.

The approach described in section 2 to utilize lightweight integration with little emphasis on the implications of formal reasoning is probably the most pragmatic approach to take when trying to create a trait knowledge base for 2 million organisms based on existing ontologies and standards. In effect, the existing ontologies are being interpreted as controlled vocabularies. As such then it is likely that logical inconsistencies will found if someone does wish to do reasoning on this trait knowledge. In this case, then I wonder what is the advantage of relying on the formal ontologies described other than as a bootstrapping of sorts? Certainly, this is the case for the provisional terms introduced by EOL staff. In the introduction the authors make the point that much trait information can be found in free text, therefore it seems plausible that text mining of unstructured text to identify widely referenced traits would be a valid approach. Why not use keywords from relevant publications, or crowdsourced tags from the community?

Many of the ontologies listed as sources are from the OBO foundry, which is largely developed from the perspective of the -omics community. E.g., using ENVO as a source of habitat information seems reasonable, except that it is a rather limited taxonomy of environmental terms. Other communities such ecologists and environmental scientists might have very different schemes for organizing habitat concepts. I am not suggesting that the authors should have implemented TraitBank differently in this version, but it would be good to get some perspective in the paper on alternative approaches and some more justification for the choices that were made in terms of which ontologies were used.

The lightweight approach advocated by the authors is highly compatible with the Linked Data initiative. The interface as described provides an access point to the data that is familiar to current users of EOL, and that part of the paper is a rather standard application description. There is very brief discussion about the possibility of setting up a SPARQL endpoint. For the community of this journal, I would like to see more discussion about possible links to relevant, existing linked data on the web and other efforts in this area, including by groups like TDWG and GBIF. How can the connections in the RDF model to elements from DarwinCore be used to link to other data sources?

Section 4 is perhaps the weakest part of the paper. Since the paper is not a full research paper but a tool/system report I can forgive the lack of a formal evaluation. The system performance description seems surprising as the amount of data is large but not so large that appropriate indexing techniques would not help, and the asynchronous downloading is not really a solution to the speed problem. This part is not really relevant to the overall paper, in my opinion, as it is more a war story about the implementation. More recent results on how TraitBank is being used (e.g., some quantitative statistics about what kinds of queries are being made) would go a long way toward strengthening this section.

Review #2
Anonymous submitted on 06/Oct/2014
Minor revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

This manuscripts describes TraitBank, a new feature of EOL released in January, 2014 ( that provides trait data annotations for entities across the tree of life represented in EOL. I feel some need to separate the manuscript from the product it describes, but even though there are some issues with both, my overarching sense is that this manuscript can be published with only minor revisions. I have made a number of specific (sticky notes) comments in a PDF that will be transmitted to the authors separately.

Regarding EOL TraitBank - I feel that there is an obvious tension here with the project as a whole, and with the ontological approach, rooted in the obvious tension field that EOL operates in generally, which is eternally walking the line between both the high-powered and the more enthusiast-oriented user communities. Quite clearly the OBO Foundry or other fully developed ontology solutions are not really there to help build TraitBank. On the other hand, building major parts from scratch and/or in isolation from other "standards" (such as OBO), can have unpredictable consequences. Maybe TraitBank is *better* than OBO, in a number of critical ways (but then, which are those? Do we know?). Or maybe it will perpetuate the undesirable features of descriptive taxonomic biology (every source speaks its own language) that ontologies were meant to overcome.

In either case, in creating TraitBank the authors seem to have made a number of highly pragmatic decisions, and clearly say so. Hence I would say the jury is still out on TraitBank's deeper merits, but no false promises are being made either. In the meantime, it is great to have an opportunity to populate and assess these tools to get a better picture.

Regarding the manuscript - Regardless of the above, I feel that the presentation of the manuscript is a little anemic, not very engaging, too scattered, and too much like an informercial. Presumably there are some infrastructure requirements, ontology requirements, data format/availability requirements, etc. - that TraitBank needs to address. And all of this is rather novel in its scope and complexity. Are there *any lessons learned?* Why do all this? How well does it seem to fare with contributors and users? What questions might we be able to address *only* with TraitBank? At times, reading the manuscript, I felt like shaking my head; other times I felt like really rooting for this effort. Certainly there is a lot of development and thought that has gone into TraitBank. But where is the *assessment* of whether the current implementation is adequate to perform the kinds of inferences it is meant to perform? I do not just want to know what features are there, or will be there, and how it all works. TraitBank is an attempt to essentially add an ontology-based trait layer onto a huge biodiversity data aggregator platform. I am not sure this has been done before. What that all means, where it can go, why it might succeed or fail in certain ways - is what I'd like to hear more about. The authors are in position to educate the community here, about TraitBank and the power of trait-based ontology approaches for biodiversity data in general.

In short, more assessment, critical higher level perspective, and conveying of enthusiasm and real (hard, soft) challenges in a revised manuscript might do the TraitBank effort more justice than the current version, which to me is too much of "a check in every square".

Review #3
By Birgitta Koenig-Ries submitted on 08/Oct/2014
Minor revision
Review Comment:

This manuscript was submitted as 'Tools and Systems Report' and should be reviewed along the following dimensions: (1) Quality, importance, and impact of the described tool or system (convincing evidence must be provided). (2) Clarity, illustration, and readability of the describing paper, which shall convey to the reader both the capabilities and the limitations of the tool.

Traits are an important topic in biodiversity research and have gained considerable interest in the last few years. Tools to ease integration of trait data are desperately needed. Thus, the paper addresses a real need of the research community.

Overall, the paper is well-written and very readable. However, in some sections, I would have liked some further explanations (see below).

1. Introduction:

The SWJ journal is read primarily by computer scientists. For those readers, providing a few examples of traits (plant height, leaf length, fur color or whatever else) might be helpful.

You mention that automated methods for trait measurement accelerate data generation. That is, of course, true - but don't they also promote standardisation?

You state that no project provides trait data across organisational groups. Is that really such a big disadvantage? Aren't traits mostly rather group specific and thus, wouldn't it suffice to have global systems for plants (like TRY) and other groups and build bridges between those? What is the advantage of having the integrated system from the start?

2. Approach:
Figure 1: Please provide an explanation of the meaning of the arrows and their direction in Figure 1. To me, for instance an arrow from Taxon to Occurence seems to indicate that I can move/reason/.. from a specific taxon to occurences of that taxon, but not the other way round. Is that the case? If so: Please provide the rationale behind it.

You build on quite a number of existing ontologies. Notably missing from this list are OBOE or BCO. Wouldn't one of those two have been a good starting point to model occurences, events, and measurements?

I would have liked to see a more complete description of the data model, in particular the associated meta data which seems to carry quite a bit of the relevant information. Please extend this description.

3. Implementation
What are "computability opportunities and requirements for future research"?
In computer science, "computability" describes the principle possibility to compute something effectively (thus is a measure of problem complexity) - I don't think this is what you are refering to here, is it?

When describing the ingest process, you do not talk about data quality issues. Do you assume that data is cleaned prior to being added? For instance, are you sure that the sample columen "body length" in a specific data sets contains values that follow exactly one definition of body length (.eg. the head-body length)?

3.5. Do you have a way to deal with changes in the defintions?

4. Evaluation
You write that the results of your informal evaluations were mostly positive. While this is good, of course, it would be interesting to also get some information on which areas could do with some improvement and what that could be.