Presenting and Preserving the Change in Taxonomic Knowledge for Linked Data

Tracking #: 1026-2237

Rathachai Chawuthai
Hideaki Takeda
Vilas Wuwongse
Utsugi Jinbo

Responsible editor: 
Guest Editors Semantics for Biodiversity

Submission type: 
Full Paper
Taxonomic knowledge provides scientific names to each living organisms and thus is one of the indispensable information to understand the biodiversity. On the other hand, the various perspectives of classifying organisms and the changes in taxonomic knowledge have led to the inconsistent classification information among different databases and repositories. To have the precise understanding of taxonomy, one needs to perform the integration of relevance data across taxonomic databases. This is difficult to establish due to the ambiguity in taxon interpretation. Most of the research in earlier stage employed the Linked Open Data (LOD) technique to establish the link in the taxonomy transition. However, they overlooked the temporal representation of taxa and underlying knowledge of the change in taxonomy, and it is difficult for users to gain perspective on how some identifiers of taxa are linked. To this end, this research aims at developing a model for presenting and preserving the change in taxonomic knowledge in the Resource Description Framework (RDF). Specifically, the proposed model takes advantage of linking some Internet resources representing taxa, presenting historical information of taxa and preserving background knowledge of the change in taxonomic knowledge in order to have the better understanding of organisms. We implement a prototype to demonstrate the feasibility and the performance of our approach. The results show that the proposed model is able to handle various practical cases of changes in taxonomic works and provides open and accurate access to linked data for biodiversity.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 06/Apr/2015
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The authors have crafted a substantive revision that addresses the majority of points raised in my initial revision. Many references and new information have been added. At several instances the term "living organisms" appears, when "extant and extinct taxa" is meant. Organismal *group* could be used; but "organism" seems to be misleading in this context of representing taxonomies and taxonomic changes. Other than that, I regard the manuscript as suitable for publication. It represents an innovative, rather comprehensively articulated perspective on this issue that is worth publishing and should receive a good amount of recognition and use.

Review #2
By Jouni Tuominen submitted on 24/Apr/2015
Review Comment:

The revised manuscript addresses all the the issues raised in my previous review. The authors have improved the manuscript and extended the taxonomic model satisfactorily, providing a relevant contribution on its field.

The readability of the paper would strongly benefit from proof-reading by a native English speaker.

Below are some minor remarks for further enhancement of the quality of the paper.

Minor remarks

Page 3, paragraph 4: "likability" -> "linkability"

Page 3, paragraph 4 - If you cite a reference using author name(s) and the reference has more than one author, list them all or use "et al.": "Jones -> "Jones et al.", "Schulz -> "Schulz et al.", "Flouris" -> "Flouris and Meghini" (page 6)

Page 3-4 - Split long text paragraphs to improve readability.

Page 4, paragraph 1: "TaxMeOn [6] used human-readable URIs for taxonomic checklists and local identifiers." - TaxMeOn does not specify the format of the URIs used for data instances (taxa, scientific names, etc.). The URIs of the TaxMeOn schema itself (classes, properties) itself are human-readable which is the standard way in RDF schemas.

Page 4, paragraph 1: "gab" -> "gap"

Page 6, Simple Nomial Entity: "In this research, when taxonomy is accepted in a given timeframe, it is considered as a taxon concept, otherwise it is viewed as a name." - This sentence is hard to understand, please clarify. Should "taxonomy" -> "taxon"?

Page 7, RDF listing - Why do you use dct:identifier for referring to uBio LSID? Doesn't the LSID identify the uBio's conception of the taxon and not your dataset's? If so, why not use owl:sameAs as with links to other external datasets (GBIF, LODAC)? (LSID is also a URI, though not an HTTP URI.)

Page 7, symbol definitions: "(tax) is an instance of a taxon concept" - I didn't notice the use of this symbol in any of the figures. If this is the case, the definition should be removed.

Page 9, paragraph 1 & Fig. 4 - The text states: "at time t2, Buidae is merged into Audiae", but in the Fig. 4 (if I interpret it correctly) the merge happens at time t1. If I've understood the model correctly, an event doesn't have an end time in case the outcome of the event (its changes) is still valid. So is the end time of the event relevant in this scenario?

Page 12, subsection 3.6.1: "Reusing CKA Framwork" -> "Reusing CKA Framework"

Page 12, subsection 3.6.1 - The text of this subsection could be moved to subsection 3.6.5 as they discuss same topic.

Page 13, RDF listing - Remove the empty line from the definition of ex:event1999 (line 6).

Page 13, RDF listing: "cka:cause" -> "cka:effect"

Page 14, paragraph 2: "skos:closeMatch" -> "skos:exactMatch"

Page 18, paragraph 1 - In URL format "http://[ltk_domain]/ltk-service/context?concept=[concept]&time=[time_point]", "time" -> "date"

Page 18, subsection 5.1: "Caligula boisduvalii falax" - Should "falax" -> "fallax"?

Page 18, subsection 5.1: "its two subspecies boisduvalii and jonasii were raised into two distinct genus" - Boisduvalii is not a subspecies, do you mean fallax?

Page 18, subsection 5.1: "its two subspecies boisduvalii and jonasii were raised into two distinct genus" - "genus" -> "species"

Page 19, section 6: "We discuss the values of our approach from four perspectives: knowledge representation, user engagement, system integration, and limitation." - "Limitation" as in "limitation of our approach" seems like a negative thing, maybe the expression could be changed to some positive/neutral one.

Page 21, subsection 6.1.4: "and the second part contains the event-centric model" - "second" -> "third"

Review #3
By Anne Thessen submitted on 30/Apr/2015
Major Revision
Review Comment:

I really liked this paper. I think what the authors have contributed is important and fills a need. It appears to do what it was designed to do. I would say that the originality and the significance are both very high. While the actual work is sound, the English needs a lot of proofreading. That is the only reason I have chosen "major revision" instead of "minor revision". I visited LTK at the URL in the paper and it looks good.

I just have a few comments.
1. I think it is a little strange that all datetime stamps are Jan 1. That's not really how it works because it takes time for people to read a publication and then to change the way they annotate samples. Some people may disagree for a while. Acceptance happens slowly, so there is likely to be a transition period. Data collected during the transition period would be hard to automatically assign to one concept or another. It's difficult to have a hard and fast date by which a concept starts or stops. Just listing the year might be a more "honest" way of portraying the event time. Listing the day makes it more precise than it really is.
2. Taxonomists seem to always disagree. How can this model show alternate opinions?
3. I'm skeptical that users will learn RDF. I can appreciate that user experience is next on the list. That's fine. Perhaps a good way to get folks started is to batch upload some existing resources, like the synonyms in Catalog of Life or some other source? That might give people something to look at and modify instead of them starting from the beginning.

Review #4
Anonymous submitted on 25/Sep/2017
Minor Revision
Review Comment:

I appreciate the time the authors have spent to revise their paper and I am satisfied with this first reviewing process. However I have a few more comments that is somehow related to one of my previous comments about the "novelty of quality metrics". I agree with the answer of the authors but reading the paper and checking all formulas I think there is space for additional improvements. In the following, I will provide more details about additionally reviewing the paper:

section 2
*there are a few more works to be included. The author should be able to highlight the similarities and differences with respect to your work. This is very fundamental in such study
Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGO
A comprehensive quality model for Linked Data
Literally Better: Analyzing and Improving the Quality of Literals

section 5
the main comment about the quality metrics is that they need to be better formalised. When you read this section every time there is a new concept and a new symbol added. Since this section is long and provide the main contribution of this work, it should be very easy to read and to understand. So my suggestion is to extract most of the terminology in a separate section where you introduce the definition for instance of a triple, entity, etc. and other tems and resume them also in a table with the symbolics you are going to use in section 5. Now, I will go into detail and analyse each metric:

*RC1: it is not clear from the definition what do you mean by a data level-constant? it seems to me that you are assuming from the definition that the subject and the object can be of type Uri while in the definition of a triple a subject can be of two types Uri or BN and the object can be of three types: Uri, BN or Literal.
*IO1 and IN3: you are using symbols with an overbar and I don't understand why you some time use symbols like this and sometimes not? Do you consider them as a vector?
*V1: is there only the void vocabulary to express the format? what about dcat (an altenative of void)? what is the range of the property void:feature?
*V2: so far you have referred to triples and now there is a symbol t and then t.o? what is the meaning of t.o? you should be also coherent what formaliazation language are you using?if it is a first order logic than keep it since the beginning to the end.
*P1: why should each resource have a dc:creator or a dc:publisher? I think this is more at the dataset level and only the resource referring to the dataset should have this information. Not clear at all
*P2 and U1: what is the difference between the set of entities and a set of distinct subject URIs? you should also clarify this in the background section. you should also clarify further P2. what is the weighted value of the entity? you are saying that for each entity you are giving a weight of 0.5
*U1. desc.(s,p,o) what does it mean? the dot is a typo or it has a meaning in this formalization?
*CS1: types(r) should be defined in the background/preliminaries section. Triples sometimes are expressed as SxPxO and sometimes as SPO
*CS3: now you are using within the set {t \in D. so far you have just used {t in the set. what is the difference?

section 6
*explain in table 7 what is approx chi-square and how is 999.81 interpreted? *explain also df and Sig.
*My understanding is that you use as input to the KMO and Bartlett's Test of Sphericity the quality assessment output of each dataset. How it comes that only by using the quality assessment output values you could reject the H0 hypothesis? Can you be more explicit on this?

Minor comments:
*re-write sentence: "PCA helps in finding the best possible characteristics to summarise the given data as well as possible." -> in particular the part as well as possible is very vague
*once the acronym is introduced use only that e.g. avoid this "check whether Principal Component Analysis (PCA) "