Darwin-SW: Darwin Core-based terms for expressing biodiversity data as RDF

Tracking #: 995-2206

Authors: 
Steve Baskauf
Campbell O. Webb

Responsible editor: 
Guest Editors Semantics for Biodiversity

Submission type: 
Ontology Description
Abstract: 
Darwin-SW (DSW) is an RDF vocabulary designed to complement the Biodiversity Information Standards (TDWG) Darwin Core Standard. DSW is based on a model derived from a community consensus about the relationships among the main Darwin Core classes. DSW creates a new class to accommodate important aspects of its model that are not currently part of Darwin Core: a class of Tokens, which are forms of evidence. DSW uses Web Ontology Language (OWL) to make assertions about the clas-ses in its model and to define object properties that are used to link instances of those classes. A goal in the creation of DSW was to facilitate consistent markup of biodiversity data so that RDF graphs created by different providers could be easily merged. Accordingly, DSW provides a mechanism for testing whether its terms are being used in a manner consistent with its model. Two transitive object properties enable the creation of simple SPARQL queries that can be used to discover new information about linked resources whose metadata are generated by different providers. The Organism class enables semantic linking of biodiversity resources to vocabularies outside of TDWG that deal with observations and ecological phenomena.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 17/Feb/2015
Suggestion:
Accept
Review Comment:

This manuscript was submitted as 'Ontology Description' and should be reviewed along the following dimensions: (1) Quality and relevance of the described ontology (convincing evidence must be provided). (2) Illustration, clarity and readability of the describing paper, which shall convey to the reader the key aspects of the described ontology.

I have read the authors' Cover Letter accompanying the resubmission and re-read the revised version. I believe one challenge here is to make progress in a moving target context (taxa, taxon concepts are just one example), and that means also recognizing suitable boundaries for any present effort to become realized. The authors have achieved this, while signaling the need for continued testing, assessment and development of the Darwin-SW. My recommendation is to "Accept" as is.

Review #2
Anonymous submitted on 23/May/2015
Suggestion:
Minor Revision
Review Comment:

The paper details several applications of OWL reasoning whose structure is likely to be familiar to many readers of SWJ but not to the community of biodiversity occurrence data users whose awareness of RDF applications is largely vague about reasoning. To this extent, the work expands the potential of SWJ as a venue for that community to modernize its tools.
The use of "normalization" (and "denormalization") is sometimes confusing in the narrative. For example, Section 3.1 disavows that it is discussing "canonical normalized graph[s]" to aid the discussion of inconsistances in data integration, but then seems to do exactly that, especially in the associated Appendix. In any case "the canonical normalized graph sense" is an obscure usage where it is introduced in Section 3.1. Is it meant to refer to something well-known, or is it rather meant to simply mean the outcomes of the procedures and queries described in the applications?

Review #3
By Hilmar Lapp submitted on 06/Sep/2015
Suggestion:
Major Revision
Review Comment:

Baskauf and Webb, Darwin-SW:Darwin Core-basedterms for expressing biodiversity data as RDF

Summary
=======

The authors present Darwin-SW (DSW), an OWL ontology for biodiversity observation data. This is a highly welcome addition to the biodiversity informatics field. Darwin Core, the existing standard metadata vocabulary for biodiversity observation data, lacks well-defined machine-readable semantics, which has led to widely divergent interpretations and applications of its terminology and supposed data model, and has made it difficult to apply DwC in contexts in which strong semantics are desired. DSW has great potential to bridge that divide in a way that others in biodiversity informatics can get on board with.

However, the manuscript suffers from a variety of issues that should be addressed. These are detailed below in the order of occurrence in the text. From an overarching view, the narrative, especially as set up in the introduction, could be more coherent and consistent, in particular by focusing on leading with the core motivating use-cases and associated challenges faced by the authors. As is, the reader’s attention is often unnecessarily diverted to chronological accounts, which distracts from understanding the why - the objectives that drove the authors' modeling and design decisions. The manuscript also - entirely unnecessarily in my opinion - makes out DSW a community consensus or standard, which it is not (a discussion on a mailing list with only a handful of active participants doesn’t substitute for a community consensus or standards process). But none of DSW’s professed merits should be (nor, in my opinion, are) contingent on its community consensus status being the sole or best evidence. Along those lines, I would recommend that instead the authors present evidence from their own work to show the strengths and value of the ontology. Finally, I think some of the model description suffers from an unfortunate conflation of relational data models, and the strengths that these have, with ontology data models. Examples are in the details below.

Details
=====

1. Introduction

"RDF's use of triples as the basic unit of information removes ambiguity about the resource with which a property is associated” - How is it triples that accomplish this? Isn’t this instead the use of global unique identifiers (which RDF triples facilitate but do not require)?

"In many cases, URIs identify real resources in the wild, although example triples composed of those URIs are not necessarily asserted there.” - I don’t understand the part following the comma.

2.1. Darwin-SW model - Design considerations

"DSW sees class instances as nodes that group related properties rather than as entities that are heavily constrained ontologically (see sections 2.2.1 through 2.2.3 for specific examples). This approach differs significantly from that taken in the development of more formal ontologies such as the Biological Collections Ontology (BCO).” - I feel this is left far too ominous for this paper (which precisely presents the design of DSW) and the audience of the journal. I suggest the authors articulate from a more formal perspective what they mean by this, and why they chose this for their design principles.

"Similarly, although DSW uses terms from OWL in its definitions, it is not an ontology designed to enable extensive reasoning based on a hierarchical class structure. Nevertheless, the structure of DSW and properties assigned to its terms facilitate a number of simple but useful reasoning tasks which can be performed using SPARQL queries.” - I suggest the authors state more formally what reasoning expressivity they are aiming for. Ideally this can be an OWL profile. As described, it suggests subsumption and closure over transitive properties is the expressivity they need, and the QL profile (http://www.w3.org/TR/owl2-profiles/#OWL_2_QL) would seem like it could be fully sufficient, but it’s not clear whether this is true.

2.2. Darwin-SW model - Classes of the Darwin-SW model

The model as presented in Figure 1 is an ERD, and as such appropriate for a relational schema. An ontology is not a relational schema. (Although an abstract data model can both be put in the form of a relational schema and an ontology, see e.g. PROV: http://www.w3.org/TR/prov-overview/. However, the authors don’t describe their effort as an abstract data model that is presented in the form of a schema and an ontology.) Hence, there is clarification missing here for how concepts depicted in ERDs are meant to map to concepts in OWL, and what the caveats are for looking at the ERD for understanding the ontology. One example are the 1:1 relationships depicted in the ERD. Schema models inherently assume a closed world, whereas OWL assumes an open world (the absence of an assertion is not an assertion of absence). Hence, there is no simple equivalent to 1:1 relationships in on ontological model, save for combining an existential with a universal property restriction in a logical class definition, which the authors don’t do in the published ontology (and which can be tricky to understand in terms of inferences they do and don’t allow). The authors also (thankfully, I would say) don’t use functional and inverse functional properties to quasi-enforce those relationships (which in OWL, unlike constraints in SQL, don’t prohibit multiple relationships of the same kind but only result in instances being inferred as the same). So while the ERD suggests a rigorous compliance of data expressed in the model with certain constraints comparable to what one could expect from data validating against a relational schema, this can’t be expected to be the case for DSW data, and it should therefore be made clear that the ERD is perhaps more of a convention than an enforced model.

I would relegate the reference to the historical mailing list discussion to the Acknowledgements (where the credit to Dr. Pyle is missing right now!), and use this space to address the above instead.

2.2.1. Organism

It seems that there are two aims that this section has, one describing the semantics of the class, and the second explaining how it came into being (its provenance). As is, the section intermingles these two, to the detriment of both and confusion of the reader. I suggest to fully disentangle these, and to start with motivating the class and its semantics from the perspective of the biological data that this is trying to model. Then follow this in a separate paragraph with a summary of the provenance.

To put it differently, the merits of the ontology are first and foremost in how the ontology fits its motivating use-cases. The provenance establishes useful context (and perhaps some validation), but a deficient model with excellent provenance is still a deficient and therefore likely not a useful model.

Aside from the entanglement, I think the authors miss an opportunity here. Obviously, their efforts in devising DSW in an open development and community discussion process have clearly helped at least one parent effort (DwC) to remedy its own shortcomings. Ideas developed in a derived effort making their way back into the parent effort is one of the ultimate validations of their merits, and there’s no reason the paper should be shy about this.

2.2.2. Occurrence, 2.2.3. Taxon Concepts

Same comments as for 2.2.1.

2.3 Object Properties in Darwin-SW

"A reasoner can infer the alternate linking triple whose predicate is the corresponding inverse property if the provider does not assert that triple directly.” - This depends on the expressivity of the reasoner. All OWL-DL and many other OWL reasoners will be able to do this (but for example not the highly scalable OWL-EL reasoners!), whereas many triple store-builtin RDFS reasoners will not. Hence, this statement should be qualified accordingly.

"Because the primary objective of DSW is to facilitate the linking of real data, these object properties serve primarily as a means to facilitate one-to- many or many-to-many relationships among instances of the main classes.” - I’m not clear on how this sentence helps with understanding. Object properties by definition link instances by relationships. And OWL object properties don’t directly correspond to 1:n or n:n relationships in relational schemas, so invoking such a correspondence sows confusion rather than bringing clarity. What are the authors trying to convey?

2.3.1. Properties linking to Agents

"When the Darwin Core RDF guide is ratified, these properties will be deprecated in favor of terms in the dwciri: (http://rs.tdwg.org/dwc/iri/) namespace as suggested in the guide.” - This seems to have happened or be happening already?

2.3.2. Properties linking to evidence

The description of the class dsw:Token seems to be at odds with its definition in the ontology. The ontology says "A form of evidence derived from a dwc:Organism”, which is more narrow than the description given here. Also, the ontology is explicit about asserting that a dsw:Token is never an event or an occurrence, which seems in conflict with the statement "or human or machine observation.”

More generally speaking, based on the description given in the text it’s not clear how one would determine whether a given instance can or cannot be an instance of Token, and hence what crucially distinguishes the class from owl:Thing. And finally, given its characterization as a "key innovation” I’m wondering why the class isn’t discussed in its own right under section 2.2.

3.2. Detecting inconsistencies using ranges, domains, and disjoint classes in Darwin-SW

The goal that the authors articulate has much merit, but using domain and range constraints together with disjointness axioms to accomplish it is a rather crude instrument. OWL reasoners aren’t like compilers - they don’t complete the full transitive closure and then emit a report of inconsistent relationships. Instead, almost all of them simply stop as soon as the ontology is found inconsistent. So, using this mechanism to “debug” a data merge will be very time consuming and arduous, because only one inconsistency will be revealed at a time. Also, there is zero flexibility for what a data aggregator might consider inconsistent use of properties and what they might not. In principle, the same goal that the authors have can be achieved by QC reports in the form of SPARQL queries, without requiring domain and range constraints with disjointness axioms. Such QC reports would report *all* inconsistencies at once, in a form that the culprits can be quickly pinpointed (which is not to be taken for granted when an OWL reasoner croaks over an inconsistent ontology). The authors may not be aware, but how this can be powerfully employed for even very large biological databases has been well described for Uniprot:
Bolleman et al (2012) Catching inconsistencies with the semantic web: a biocuration case study. SWAT4LS 2012 conference. http://ceur-ws.org/Vol-952/paper_2.pdf

This doesn’t mean that the DSW authors have to do it this way too. However, as written the text is misleading in the suggestion that to achieve the stated goal, the authors’ choice for how to achieve it is the only or even the best way, and at a minimum a discussion of the downsides, and a reference to other methods, and why those were not chosen, should be included.

3.3.1. Linking duplicates

"If the specimen were collected from the same Organ- ism, but at a different time or location, Provider 3 could indicate that by asserting only:
provider3:organism3 owl:sameAs provider1:organism1."

Provider 3 could do this, but the indication would be by (likely to be obscure) convention. RDF and OWL both have an open-world assumption, and hence the absence of an owl:sameAs assertion does in no way imply that the two individuals are not the same. To do so, one should assert a owl:differentFrom relationship between the individuals.

The TURTLE examples contain assertions that are already entailed by the domain and range constraints asserted by DSW. I suggest to remove these because they only add clutter:
provider1:occ1 a dwc:Occurrence. (example 1)
provider3:occ3 a dwc:Occurrence. (example 3)

3.3.2. Discovering new derived resources and modified metadata

The authors rightfully allude to the importance of documenting provenance information. Unfortunately, however, they make no reference to the W3C recommendation for this, the PROV ontology, which includes the property wasDerivedFrom:
http://www.w3.org/TR/prov-o/#wasDerivedFrom

I’d recommend the authors at least assert dsw:derivedFrom as a sub-property of prov:wasDerivedFrom.

Perhaps this may better be added to section 2.3.2?

"This creates a powerful tool for querying because once a reasoner has constructed triples for all entailed dsw:derivedFrom and dsw:hasDerivative relationships, it becomes a simple matter to conduct queries that apply to all derivatives of a particular Organism.” - In principle this is true. However, as described here and the following example it requires and thus relies on unspecified special tooling (running a reasoner that materialized the inferred transitive closure in the triple store). In contrast, using SPARQL 1.1 property paths, only standard query syntax and easily described tooling (namely, a SPARQL 1.1-supporting triple store) is needed. The latter have become more commonly available, and I recommend the authors rewrite this section and the examples to simply rely on standard SPARQL 1.1 property paths.

4. Linking out beyond the Darwin Core data domain

"from the PATO ontology” - The proper citation for PATO is Gkoutos, Georgios V, Eain C J Green, Ann-Marie Mallon, John M Hancock, and Duncan Davidson. 2005. “Using Ontologies to Describe Mouse Phenotypes.” Genome Biology 6 (1): R8.

"Similarly, because Darwin-SW establishes a class for Organisms, it facilitates the documentation of interactions among organisms, such as predation, parasitism, and mutualism.” - This statement should probably be adapted to the fact that this class is now a part of Darwin Core?

Comments on the ontology aside from those made above:
------------------------------------------------------------------

- hasEvidence / evidenceFor: obviously these properties are very useful for the purposes the authors describe. Unfortunately, the authors define these from scratch. I recommend that for the sake of retaining interoperability with the RO, which is in wide use in biology, these properties are asserted as sub-properties of “has evidence” and “is evidence for”, respectively, in RO:
http://purl.obolibrary.org/obo/RO_0002558
http://purl.obolibrary.org/obo/RO_0002472

- A number of classes include disjointness axioms with a class declared as deprecated, most prominently with tc:TaxonConcept. I suggest to move these axioms to the deprecated class so as to keep the currently valid classes unpolluted from references to deprecated classes (and deprecated, and therefore at some point likely no longer available ontologies).

- Why did the authors choose the slash URI pattern for their terms rather than fragment identifiers? The slash URIs don’t seem to provide any tangible benefit here - dereferencing any class or property identifier retrieves the entire ontology - but make it more difficult for consuming clients to optimize retrieval (because they cannot tell from the URIs that the document they are about to get has already been retrieved). Perhaps the authors were following some example such as Dublin Core? Whatever the reason, it should be stated somewhere.


Comments