A Logical Model for Taxonomic Concepts for Expanding Knowledge using Linked Open Data

Tracking #: 637-1847

Authors: 
Rathachai Chawuthai
Hideaki Takeda
Vilas Wuwongse
Utsugi Jinbo

Responsible editor: 
Guest Editors Semantics for Biodiversity

Submission type: 
Full Paper
Abstract: 
The wide variety of classification systems and new discoveries by taxonomists have led to an increase in the diversity of biologi-cal information, especially taxon concepts. However, associations between taxon concepts across research institutes are very difficult to establish because there is no one single interpretation of a taxon concept’s name. Owing to this difficulty, further inte-gration of the increase in biological knowledge is proving very complicated when it deals with multiple sources of data or de-pends upon different taxon concepts. This research aims to develop a framework for linking multiple related taxon concepts and their evolutionary relationship across research repositories, and to preserve the background knowledge to their changes. To achieve these goals, we propose a logical model for taxon concepts in the Resource Description Framework (RDF). In this study, we implement a prototype to demonstrate the feasibility and the performance of our approach. The results of our study show that our model can publish taxon information as Linked Data with additional benefits from the Linked Open Data Cloud.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jouni Tuominen submitted on 28/May/2014
Suggestion:
Major Revision
Review Comment:

The paper presents an ontology model for representing changes in taxonomic concepts and linking the related temporal taxonomic concepts together. The motivation for this is to facilitate integration of taxon related data across heterogeneous repositories. Such data contains multiple, differing names for a taxon and the related taxonomic concepts are not usually linked to each other. Taxonomic information represented using the model helps the users to correctly interpretate the relations between taxonomic concepts in various datasets. The model is an extension of the authors' previous work on modeling changes in digital archives (CKA) and it utilizes taxonomic terms from their work on publishing biodiversity data as Linked Data (LODAC). To test the model the authors have created a prototype software for inputting the changes into the system and examining them via a browser user interface. The paper also reports a (smallish) performance evaluation of the model w.r.t. to the response times of SPARQL queries to example data.

Originality:
Though being based on author's previous, more general work and having similarities to other approaches on modeling taxonomies as ontologies, the paper has contribution by providing a practical and usable framework for managing changes in taxonomic concepts. The model's support for chaining change events together (cause and effect) and rules for inferring links between concepts based on changes are novel approaches.

Significance of the results:
The model seems to be capable of handling the basic change operations in taxonomy reasonably well, and it appears to be a nice, usable solution for linking related taxonomic concepts to each other. However, there are some potential issues in the model/paper that need to be clarified, especially distinguishing between a change in a taxonomic concept and a change in a scientific name (see the major remarks concerning page 9 below).

The related work section (in Introduction) is quite brief and should be extended heavily, as many relevant references are missing.

The evaluation reported in the paper basically tests the scalability of the SPARQL engine / triple store in terms of triple amount. It's maybe possible to conclude from this that the complexity of the model (triple amount) is not an issue for current SPARQL engines. It does not however evaluate the model in terms of its usability, added benefits, etc. The outcome of the discussions with the experts using taxonomic information in their research confirms that there is a genuine need for this kind of work. However, more thorough and formal study would be needed for proper evaluation of the model.

Quality of writing:
The language is mostly understandable, though a bit obscure here and there. The readability of the paper would strongly benefit from proof-reading by a native English speaker.

Major remarks
---------------------------------

Abstract (also in other sections) - The term "evolutionary relationship" is a bit dangerous and ambiguous in this context, as the term "evolution" has a specific meaning in biology. To my understanding, in this paper "evolutionary" does not refer to evolution but rather to changes in scientific understanding on how taxa are defined (circumscription, position in taxonomy, rank). It would be helpful if this could be clarified.

Page 2, paragraph 4 - In discussion of the TaxMeOn model: "However, the model does not support the view that an underlying knowledge of the changes is required for the correct interpretation of taxon concepts." - This claim is a bit vague and not entirely correct. TaxMeOn supports modeling the changes of taxonomic concepts (e.g., split, lump, change in classification, change in circumscription). The taxonomic concepts before and after the change event are linked to the change event instance with relations taxmeon:before and taxmeon:after. However, there is no support for linking changes together (cause and effect), thought it would be possible by introducing a single new property. Please clarify this.

Figure 1 - What about a change in circumscription? Is it somehow included in other change types or is it not covered here at all?

1. Introduction, related work - Consider adding the following references:

Berendsohn WG: A taxonomic information model for botanical databases: the IOPI Model. Taxon 1997, 46:283-309.

Page RDM: Taxonomic names, metadata, and the Semantic Web. Biodiversity Informatics 2006, 3:1-15.

Jones AC, White RJ, Orme ER: Identifying and relating biological concepts in the Catalogue of Life. Journal of Biomedical Semantics 2011, 2:7.

Sarkar IN: Biodiversity informatics: organizing and linking information across the spectrum of life. Briefings in Bioinformatics 2007, 8(5):347-357.

Schulz S, Stenzhorn H, Boeker M: The ontology of biological taxa. Bioinformatics 2008, 24(13):i313-i321.

Kennedy J, Kukla R, Paterson T: Scientific Names Are Ambiguous as Identifiers for Biological Taxa: Their Context and Definition Are Required for Accurate Data Integration. In Proceedings of the 2nd International Conference on Data Integration in the Life Sciences (DILS): 20–22 July 2005; San Diego, California. Edited by Ludascher B, Raschid L, Springer-Verlag 2005:80-95.

Also, the relevant TDWG and GBIF standards should be referenced properly (currently just one property from Darwin Core is mentioned).

Pages 5-6 - Related to the discussion of the rule for linking taxon concepts (e.g., after a merge), I was wondering if a specific time point should be given as input to the function (as in the next case - a change in relationship) - or you could mention that the triples should be filtered to contain only the changes relevant to the specific time point. This way, the user would get the taxonomic information relevant in a specific time point. Consider, e.g., the example case of merging Icterus galbula and Icterus bullockii into I. galbula and again splitting it into I. galbula and I. bullockii - after executing the rules for this data, the user gets triples representing the merge and split, but is not able to examine the situation (is it merged or split) in a specific time.

Page 9, RDF listing 1 - Why is the change of a genus of species:Nyctea_scandiaca represented as ltk:TaxonReplacement and ltk:HigherTaxonAddition, and not as ltk:HigherTaxonChange as in similar case in Fig. 4? Is it because also the species name changes (scandiaca -> scandiacus)? In general, if only the name of a species changes, there is no need to create a new taxon concept, because the concept itself (circumscription) hasn't changed.

Page 9, paragraph 4: "Moreover, LTK provides more operations that describes the attributes of a taxon concept such as dwc:scientificName" - As dwc:scientificName should contain the full scientific name (e.g., binomial name for species) according to the spec, how do you handle the representation of a changed name? According to Fig. 4, the URI of a concept stays the same when the genus of a species changes; then the species URI must have both new and old name as dwc:scientificName. How does the model then keep track on which name was valid at a certain time? (I assume that the value of dwc:scientificName is a literal.)

4.1 Performance Analysis - Did you consider testing execution of multiple parallel queries to simulate multi-user scenario? Is a single data point in the graph (Fig. 11) produced by a single query execution or did you calculate a mean value from multiple query executions?

Page 15, paragraph 1: "we implemented a prototype that utilizes the proposed model in order to publish the taxonomic information to LOD Cloud" - I did not see much discussion about Linked Data publishing, e.g., about dereferenceable URIs, in context of your prototype (apart from mentioning the SPARQL endpoint). It seems that the taxon URIs mentioned in the paper are dereferenceable, but they lead to an older service (LODAC) which does not contain the information about taxonomic changes the way they are described in this paper.

Page 15, paragraph 1: "The result of our prototype demonstrates that our approach is feasible and suitable for satisfying the need to link the large amount of taxonomic data across repositories in order to discover a broader knowledge of biology." - This is rather bold statement w.r.t. the preliminary evaluation of the model and because the model hasn't yet really been used to link data across repositories (or at least it's not reported here). The claim should be relaxed.

Appendix, ltk:SynonymLink (Example result) - Though the symmetry of the property ltk:synonym might be justified in zoology, in botany it certainly is not. In botany a synonym is a name that is not correct for the taxon, i.e., it it a synonym of a correct (valid) scientific name. The valid name is not a synonym for the incorrect name.

Minor remarks
---------------------------------

Page 2, paragraph 2: "For example, the Baltimore oriole (Icerus galbula Linnaeus, 1758) and the Bullock’s oriole (I. bullockii Swainson, 1827)." - The sentence is missing a predicate.

Page 2, paragraph 2 (2 times): "I. gulbula" -> "I. galbula"

Page 2, paragraph 2: "with time" -> "over time"

Page 2, paragraph 4: "a semantic web" -> "semantic web"

Page 3, paragraph 1: "SKOS [16] vocabularies" - If this refers to the properties of SKOS model (and not to vocabularies modeled using SKOS), singular "vocabulary" should be used.

Page 3, paragraph 4: "a change from a taxon concept" -> "a change in a taxon concept"

Page 3, paragraph 4: "Kempf" vs. "Kampf" - Check which one is correct and use it consistently.

Page 4, paragraph 2: "Flouris's theory" -> "Flouris' theory"

Page 4, paragraph 2: "Flouris's theory" - Add a reference to the theory.

Page 4, paragraph 3: "we formally propose a model" -> "we propose a formal model"

Page 5, paragraph 1: "tl:beginAtDateTime" - Namespace prefixes should be introduced when (or before) they are used for the first time. The prefix "tl" is not introduced until Table 1 in subsection 2.4.

Page 5, paragraph 2 (also in other sections): "dynamic description" and "static RDF statements" - I understand the point here (after reading further), but the terms "dynamic" and "static" are a bit vague in this context. Maybe this could be clarified somehow?

Figure 4 - What is the rdf:type of ex:theChange1? Could this be added to the figure?

Page 6 - If the notation p(c1,c2) means a triple , then the following corrections should be done:
"subClassOf(ConceptEvolution,?opr)" -> subClassOf(?opr,ConceptEvolution)
"type(?opr,?chg)" -> "type(?chg,?opr)" (3 times)
"subClassOf(RelationshipEvolution,?opr)" -> "subClassOf(?opr,RelationshipEvolution)" (2 times)

Page 6, paragraph 2: "In addition to the rule for linking taxon concepts, we also introduce a rule to transform the dynamic information into a list of static triples." - The rule for linking taxon concepts also transforms the dynamic information into a list of static triples (if I understand correctly), so this should be clarified. You could add something like "[dynamic information] of a change in relationship..." because the rules presented after the sentence are related to changes in relationships.

Page 6, paragraph 2: "Before executing the following rule, it is necessary to filter only some changes so that the input time point exists within its time range." - It is not clear to what the "its" refers to.

Page 6, paragraph 3: "For changes appearing before the specific time point, a relationship between a subject and an object after the change did not exist." - This sentence is hard to understand, please clarify.

Page 7, paragraph 2: "the RDF statement below" -> "the RDF statements below"

Page 7, paragraph 3: "relationships between genus:Columba and its allies" - What do you mean by allies? Please clarify.

Page 7, paragraph 4: "particular proposes" -> "particular purposes"

Page 7, paragraph 4: "are descend from" -> "are descended from"

Table 2 - "skos:relatedMatch" as superproperty of ltk:mergedInto, would "skos:broadMatch" be more accurate?

Table 2 - "skos:relatedMatch" as superproperty of ltk:splitInto, would "skos:narrowMatch" be more accurate?

Page 9, RDF listing 1 - The change type "ltk:HigherTaxonAddition" and property "cka:detail" should be introduced in the text before using them in the RDF example.

Page 9, RDF listing 2: "genus:Bubo ltk:majorMergedInto genus:Bubo_1999 ." - I don's see how this triple can be inferred from the original RDF statements describing the changes. Either change the property into ltk:mergedInto or add this "major" information into the original RDF statements.

Page 9, paragraph 4: "ltk:higerTaxon" -> "ltk:higherTaxon"

Page 9, paragraph 4: "operations that describes" -> "operations that describe"

Page 9, paragraph 5: "the formal model is described by the temporal change in taxonomic knowledge, and rules for executing the dynamic descriptions." - There is something weird in this sentence, maybe the "described by" should be changed to some more suitable verb.

Page 9, paragraph 5: "for specific purpose" -> "for a specific purpose"

Page 10, paragraph 4 (also on page 13): "XSD:DateTime" -> "xsd:dateTime"

Page 10, paragraph 5: "business layer" -> "business logic layer"

Page 12, paragraph 2: "then assign a concept" -> "then assigning a concept"

Page 12, paragraph 3 - In URL "http://rc.lodac.nii.ac.jp/ltk/concept.php?conept=http://lod.ac/species/B... -01-01T00:00:00Z", "conept" -> "concept", and remove the space after "1998"

Figure 10 - The instance of ltk:ReplaceTaxonConcept is both in the "Detail of change" and "Caused by" sections. Is this appropriate?

Page 13, paragraph 1 - In URL "http://rc.lodac.nii.ac.jp/ltk/concept.php?conept=http://lod.ac/species/B...", "conept" -> "concept"

Page 13, paragraph 2 - In URL "http://rc.lodac.nii.ac.jp/ltk-service/context/?concept=[taxonconcept]&ti...", "time" -> "date", and remove the slash "/" after "context" (otherwise the server replies HTTP 404)

Page 13, paragraph 2 - In URL "http://rc.lodac.nii.ac.jp/ltk-service/reason/?concept1=[taxonconcept1]&c...", remove the slash "/" after "reason" (otherwise the server replies HTTP 404)

Page 13, paragraph 2 - I tested the web service "reason" to get the background knowledge of the change of two concepts, but I couldn't get any sensible responses. The service always returns the same fixed set of triples regardless of the values of parameters concept1 and concept2. Please check that the service works as intended.

Appendix, ltk:TaxonMerger - In "ex:mb1 ltk:majorMergedInto ex:af1.", "ex:mb1" -> "ex:mb0", and add space after "ex:af1"

Appendix, ltk:TaxonMerger (also in ltk:TaxonSplitter) - In "ex:mb0 skos:closedMatch ex:af1.", "skos:closedMatch" -> "skos:closeMatch", and add space after "ex:af1"

Appendix, ltk:TaxonSplitter - In "cka:majorConceptBefore ex:ma0 ;", "cka:majorConceptBefore" -> "cka:majorConceptAfter"

Appendix, ltk:ChangeHIgherTaxon - In "ex:p2 skos:narrowTransitive ex:c1 .", "skos:narrowTransitive" -> "skos:narrowerTransitive"

References: "[7] Health T" -> "[7] Heath T"

URLs should not be hyphenated (as some of them are in the text and in References), please correct those.

Review #2
Anonymous submitted on 22/Jun/2014
Suggestion:
Minor Revision
Review Comment:

Chawuthai and co-authors report on a novel conceptual framework and software application that can represent taxonomic name and concept changes in RDF, and thus be integrated into the Linked Open Data Cloud. This research is novel and significant because it tackles a fundamental challenge of designating "identity" to units that require representation in the biological domain, where the names assigned to these units ("taxa") are not precise enough and reliable across treatments and environments. Of particular importance is the introduction of a temporal dimension and cause/effect relationships that characterizes name/concept changes.

I regard this contribution as exploratory and promising, which is not the same as saying that the system is comprehensive and ready to handle all representation challenges.

Both the Introduction and Discussion lack reference to relevant preceding works on taxonomic change representation with ontologies and logic reasoning. These omissions are not adequate and more citations of prior and related work are needed.

Moreover, the model for taxonomic change may be too simple. In addition to merging (multiple concepts joined into a larger, single one) and splitting (inverse scenario), there can be overlapping relationships among concepts. These overlapping relationships could be broken down temporally, I suppose, and thus further divided into finer merge/split/replace events. But in practice they occur "at once". Example, an early taxonomy (T1) assigns four species concepts to a genus concept. Then a later taxonomy (T2) removes two of those earlier species concepts from the corresponding genus concept, while also adding two species concepts that were not mentioned in the earlier taxonomy. The resulting relationship among the genus-level concepts at T1 vs. T2 is overlap. How can this be represented with the current RDF/LOD framework? Thus, even though the authors' examples have realism, they remain fairly simple, and possibly many larger real-life use cases will include situations in which overlap must be represented.

That said I found most sections of the manuscript to be well written and flowing naturally. The prototype application is serviceable.

The authors are maybe too uncritical in their Discussion - are there no scalability issues at all? How will users be engaged? How can large bodies of data be supplied to this system? What options for editing will exist?

In summary, I would suggest that a revised version of the manuscript include: (1) more adequate references to related work, (2) a more scrutinizing evaluation of the framework's conceptual limitations (at this point, and not meaning to take away the considerable strengths), and (3) a more critical discussion of challenges for wider adoption and integration.