Name-based Approach to Build a Hub for Biodiversity LOD

Tracking #: 419-1542

Authors: 
Hideaki Takeda

Responsible editor: 
Guest editors Multilingual Linked Open Data 2012

Submission type: 
Dataset Description
Abstract: 
Because of a huge variety of biological studies focused on different targets, i.e., from molecules to ecosystem, data produced and used in each field is also managed independently so that it is difficult to know the relationship among them. We aim to build a data hub with LOD to connect data in different biological fields to enhance search and use of data across the fields. We build a prototype data hub on taxonomic information on species, which is a key to retrieve data and link to databases in different fields. The core of this hub is the dataset for species and taxa. We adopted the database called “Building Dictionary for Life Science (BDLS)” that contains relationship between scientific names and common Japanese names. Based on this dataset, we integrate various datasets such as domain-specific taxonomies and specimen databases.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject and Resubmit

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 27/Feb/2013
Suggestion:
Major Revision
Review Comment:

Review

This paper presents a data hub for biodiversity that was constructed by integrating several biological datasets pertaining to species from the kingdoms animalia and plantae. The approach chosen to construct the URIs in the datasets is a name-based approach, which (as far as I understand, see comments) means that the URIs for resources are generated by using their labels. As the authors clearly point out, this approach has several advantages w.r.t to data integration.

Major comments

Quality of the dataset

1) Interestingly, the dataset classifies the names of species and species themselves. Ergo, it contains both URIs for species and URIs for the name of species. This is yet ontologically questionable to say the least. For example, can a scientific name such as really have a super taxon (note that taxa are groups of organisms and not of labels for organisms; the supertaxon relation is expressed by in the dataset) as it is a name and not the URI of the species labeled with this name. Formally, I would suspect that this simply not the case.

2) The superior data integration achieved by making names to URIs also seems questionable. I would have expected an ontology which reflects the biological taxonomy Kingdom - Phylum - Class etc. without the extra classes for labels. Those should have been simply set to be the values of the "rdfs:label" property. THat would solve the main problem I pointed out above. The URIs could have been generated by concatenating the latin name and the species and its kingdom, thus solving the ambiguity problem.

3) Missing language tags for english labels. Species in japanese do not carry english rdfs:label

Usefulness

Definitely given. The main problem with this dataset is that some practical tasks (such as finding both the Japanese and English name of some species) mught be a bit tricky to solve.

Clarity and Completeness of the description
Major issues here as well. The authors use quite a bit of non-standard (at least amongst the Semantic Web community) terminology. Although I know that it is difficult to find a native speaker to review the text one has written, I'd strongly advise the authors to do so, as the paper is full of mistakes.

Questions
1 - how do you define that the relationship between two datasets is weak? Pleasebe more specific and quantify.
2 - What is the Citizen Science Program? Please add a reference
3 - Do you mean lay people by general people?
4 - What's an authorized name?

Comments to the text
Abstract
- Each field: Which field?
The relationships
- data hub with LOD: do you mean an open data hub?
- data in: data from
- enhance search: the search

Introduction
- I'm afraid the first sentence makes no sense to me: please rephrase.
- molecules to ecosystem: ecosystems
- , to ecology: no comma
- data for own field : data from its own field

Related Work
- It is a different approach from other studies: this approach differs from that of other studies
- There are two major benefits for our approach: Our approach yields two main benefits.
- since identification: since the identification
- The second is that linking to other
- to provide authorized names:

Core Dataset
- main parts, i.e., one is (remove i.e.)
- It contains 55,759 scientific names and 57,929 Japanese. Please replace the "it" with an absolute reference.
- This data has 56,590 specimen data. Do you mean data items? RDF triples? Rows? Statements?

Data Model for species information
- information is treatment on various: the treatment
- 1. A domain-specific dataset for taxon names: the Current Checklist of Japanese Butterflies (full stop missing)

Review #2
By Clement Jonquet submitted on 21/Mar/2013
Suggestion:
Major Revision
Review Comment:

Overall the paper does not address issues related to multilingual LOD. The authors mention the existence of English names in the datasets they are using (mainly in Japanese) but do not discuss or detail how they have dealt with multilingualism during the process of integrating those 3 datasets and publishing them as LOD.

The main issue with the paper is failing to position the approach (i.e., based on names as central element rather than taxon) by comparison to other models/initiatives, by the way properly identified (DwC & TaxonConcept). In addition, the paper is missing some examples along the paper.

However, the contribution in terms of published data to the LOD is clear and valuable.

Major comments by sections:

- Abstract: You do not clearly mention you contribution in the abstract. You should explicitly mention the 3 datasets that have been integrated, the data model designed to support such integration, provide some statistics as well as a link to the available LOD. Also, you do not mention here the specificity and advantages of your approach which is to use name as primary objects rather than taxa.
- If “data hub” is the name of your system/result, then identify it clearly and reuse that name along the paper. Briefly describe that term at first use in the abstract or introduction.
- S2: Related work section is a bit light. Biodiversity Informatics is not just databases about biodiversity. Final links in the section to Dwc and TaxonConcept are interesting and more discussion about these (and others) would be appreciated here. You should bring the reader to understand the limits of related work up to date and announce here the drawbacks/limits that your system addresses.
- S3: “fulfill the requirements”. You do not mention any “requirements” in S1. You gave a review of 3 diversity issues, but no requirements for a system or potential solution to have.
- S3: “fist class entities”. Your use of this expression is very ambiguous. The famous SICP book has a formal definition for this (from Strachey) but I am not sure this is what you mean.
- S3: You should clarify and provide examples which compare your work to Dwc and TaxonConcept. This aspect (considering name as the primary element/identifier) seems to be what distinguishes your approach the most, therefore, you should spend the appropriate space on detailing it. Taxon concepts cannot be always consensual, less than names, however, we all know names can be very ambiguous and yourself acknowledge the homonymy problem, so could you please backup more your choice and discuss this aspect.
- S3: What do you mean by “identification (…) can be postponed”? How it is positive?
- S4: What is a “source”, examples?
- S6: Would you say this data model is a “common data model” for the 3 described resources.
- S6: Could you provide an explicit comparison of your data model presented in fig1 with alternatives approaches such as Dwc or TaxonConcept. This is indeed needed to evaluate the contribution and relevance of your choices. What is exactly the node “species” compared to other nodes? Is this a owl:class?
- S7: as previous remark on homonymy, you mention finding 1797 issues, but do not say what you did to deal with them.
- S8: “translated the data to linked data”: could you be more specific, could you refer to LOD publishing best practices (i.e., demonstrate your dataset deserved a 5-stars mug). Especially, could you specify how did you generate the “out-going links” to other datasets, which ones (in addition to DBPedia) and why did you choose those ones?
- S9: “can be linked to each other” this is a feature approaches based on taxon (rather than names) would have also enabled. You should conclude on the value added of your approach.
- References: You have a shift between your references citation and your reference list + [7] does not exist. Seems that [1] has concatenated 2 references.

Minor comments:

- Expand the LOD acronym in the title & abstract
- Abstract: “which is a key… fields” repeats the phrase just before.
- Abstract: What do you mean by “adopted”, at that time in the reading it’s not clear.
- Abstract: “relationship”S
- Abstract: “Japanese & English names”
- S1: “Biodiversity becomes a (…) problem”. I would not say “biodiversity” is the problem. This is a concept, a measure, a science. The “scientific & social” problem are actually mass extinctions, climate changes, etc. and their effects on biodiversity.
- S1: Data lacks information about relationships as well as semantically rich descriptions (i.e., with ontologies).
- S1: Expand the DDBJ and NCBI acronyms here at first mention.
- S1: “their relationships”, do you mean interconnections, interoperability?
- S1: “from each other” + also less visible to the community.
- S1: “build a data hub” + to address the three diversity issues mentioned before
- S2: You should provide reference for “under discussion”.
- S2: You should provide reference taxonconcept. When possible you should prefer scientific references rather than links.
- S3: “It provides”, what is “it”? The policy?
- S3: “It also provides”, what is “it”?
- S3: NCBI acronyms must be expanded before in paper.
- S3: “The second”-> the second policy
- S3: “Processing on the network” what do you mean?
- S3, last sentence: English
- S4: “two main parts”, -> two main parts in BDLS
- S4: What is the “dictionary for terminology”
- S4: “It contains”, what is “it”? *2
- S4: “provenance” -> provenance information
- S4: “The analysis” what kind of analysis, that says what?
- S4: “Species2000”, reference?
- S5: Enumerated item and following paragraph start with same sentence/expression.
- S5: Missing ref for Bryophytes DB.
- S6: “expressed in named graph” + and formalize in OWL, no?
- S6: Footnote #5 is not clear.
- S7: LOD acronym should be defined at first use.
- S7: “but we can just a set”: English
- S7: “Though (…) cases”; English
- S8: “the whole (…) yet”: could you clarify, and along the paper be sure we understand what’s the “Data Hub”, what’s the “dataset”
- References: bad left-side alignment

Review #3
By Paolo Ciccarese submitted on 02/Apr/2013
Suggestion:
Reject
Review Comment:

The paper presents a method for building a hub for linking taxonomic information on species. The idea revolves around integrating scientific names - considered first class entities - using the linked data approach. The authors claim the presented approach facilitate names linking, local names linking and maintenance with a few minor drawbacks.

The authors attempt to explain the methodology and its limitations. However, the presence of just one single abstract figure is certainly the biggest limitation. Two or three concrete examples - with related figures - would have helped enormously in delivering the message.

Concrete examples should be given also to illustrate the drawbacks of the presented methodology. For instance to illustrate how homonymic (I am not sure the used term 'homonymical' exists) names can be disambiguated using data.

The result section is vague. For instance, it talks about 'one of the potential problems'. What other problems have you found?
Also the sentence 'It indicates that many of the relations between them are left to be added' needs more explanation and some examples and counter examples.

The language is problematic towards the end of the paper. For example: "We can show the results by matching names but we can just a set of taxon names but not show representative names".

The bibliography is poor.