Names Are Not Good Enough: Reasoning over Taxonomic Change in the Andropogon Complex

Tracking #: 624-1834

Nico M Franz

Responsible editor: 
Guest Editors Semantics for Biodiversity

Submission type: 
Full Paper
The performance of taxonomic names and concepts to act as identifiers of changing taxonomic content is analyzed and visualized using a novel Answer Set Programming reasoning approach. The Euler/ASP toolkit is applied to a use case of eight succeeding classifications, ranging from 1889 to 2006, of a 'complex' of grasses in the genus Andropogon in the Carolinas and surrounding areas. Based on an input of 64 constituent concepts and 131 Region Connection Calculus articulations provided by an expert taxonomist, nine pairwise and logically consistent alignments are inferred and visualized as merge taxonomies that reflect the hierarchical relationships of congruent and non-congruent taxonomic concepts. The respectively valid names are integrated with these results, thus permitting quantitative assessments of name/concept resolution. Accordingly, 65.2% of 46 possible instances of congruence were realized in the alignments. Incongruent concepts were twice as common as their counterparts. Usages of different names failed to continuously identify congruent concepts in 66.7% of 30 cases, and the same names incorrectly identified incongruent concepts in 33.9% of 59 cases. Concept-level articulations can take into account both member- and property-based information when identifying similarities and differences between succeeding taxonomic perspectives. Names and nomenclatural relationships, in turn, are limited to establishing 'identity' based on subsets of ostensively designated (type) members. Using names-based identifiers to represent and reason over changing taxonomic content in the Semantic Web domain will be similarly limited.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 13/Nov/2014
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

Overall a nice paper that addresses an important problem from the
biologist's perspective and an interesting approach from the computer
science perspective. However, I am afraid in the current form it
doesn't really satisfy either audience. I recommend to accept it with
minor revisions.

Since SWJ is essentially a CS journal, I would suggest to focus more
on readers with a background and interest in CS. For those readers, a
stronger motivation and more careful explanation of the biology
problem that you are trying to solve would be good. On the other hand,
this audience probably does not need to see the full results in the
level of detail presented right now. Discussing just the most
interesting aspects of Figures 5 - 13 might be sufficient.

Instead, devoting a bit more space to the approach and the tool used
would be of interest. How does the tool actually reach its
conclusions? How much effort (in terms of storage, cpu, ...) is
needed? How does it scale? How much preparatory human effort is
needed? Do the results reflect expert judgement?

I missed a discussion on how the results of your approach can be used
in biology - and how the approach can be adapted to other problems.

RAB is abbreviated RAD in Fig1
It might improve readability, if you used the years as identification
(as in Fig 5 -13) throughout the paper instead of mixing years and
author names.

Review #2
Anonymous submitted on 13/Nov/2014
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

In this paper the authors present evidence that taxonomic reasoning in the semantic web will require richer conceptual specification than is provided by the standard way in which taxonomic identity via formal taxonomic nomenclature. The codes of nomenclature present a level of formality in the scientific process that is not always matched in many other fields, and thus taxonomic reasoning would seem to be an excellent match for the semantic web. Nevertheless, in practice using names to identify taxonomic groups presents certain shortcomings, well-established in the field, that limit the ability to reason about distinct taxa. In this paper the authors clearly present evidence that concept-level articulations based on RCC5 are much more effective at identifying congruent and incongruent concepts. The analysis is in depth and thorough and is original within the scope of current semantics for biodiversity research.

Although the paper is very well written and well-suited to the special issue, it seems to be written primarily for specialists in biodiversity, and I would suggest the following changes to the paper, in order that it reach a wider audience. Concept drift is not a problem unique to taxonomy -- it occurs in all scientific fields and thus it would be good for the authors to discuss how the methods used in this paper for this problem can be generalized to other domains, if at all. Are there analogues to the limitations of naming in taxonomy that can be found in other domains of knowledge? Can the answer set programming tool that is utilized be used for other non-taxonomic concepts? If so, the impact of this paper could be broadened. E.g., it would be good to relate to this work on concept drift in the semantic web:

Shenghui Wang, Stefan Schlobach, Michel Klein (2011) "Concept drift and how to identify it", Web Semantics: Science, Services and Agents on the World Wide Web

The Andropogon complex seems like a suitable use case, but can the authors provide some more justification for why this particular use case was chosen? In the future research directions the authors point to some of the peculiarities of this use case, and some of the other complications that might occur in other cases. The other cases are spoken of in abstract terms, can you provide some specific examples of other use cases that you plan to investigate that have different characteristics from the Andropogon case?

The Euler/ASP toolkit uses region connection calculus to align and merge taxonomies. Although some readers will be familiar with qualitative reasoning with a connection calculus, many will not, and it would be good for the authors to provide more background information on this. Because RCC5 ignores the cases in RCC8 with touching regions, it can very easily be described in set-theoretic terms (even a small venn diagram version in a figure would be helpful). The articulations that were expert-provided presents a considerable (though not insurmountable) investment of time. How realistic is it to repeat this process given millions of taxonomic concepts? If, as alluded to in the discussion section, less fine-grained articulations can be incorporated into the toolkit, then are there information sources where this information can be automatically (or semi-automatically) lifted?

Finally, it would be good to give some concrete examples of the kinds of applications that being able to reason about congruent and non-congruent taxonomic concepts will enable. E.g., for dataset interoperability and integration, annotation of scientific publications, description of occurrence data.

Minor comment -- pg. 2 second column "The added of resource resolution..." There seems to be a missing word here.


This paper tackles the problem of congruence of changing taxonomies over time in the specific use case of the Andropogon complex in the biology domain. Authors describe the many issues biological taxonomists have when they classify specimens, especially when they deal with multiple classification principles and different (and not always consistent) proposed taxonomies. The paper proposes the Euler/ASP framework, which uses RCC5 calculus and a set of extension constraints to decide whether a pair of taxonomies are congruent, non-congruent, properly included or overlapping. This reasoning process is applied to multiple taxonomies (published from 1889 to 2006) of the Andropogon complex to study agreement between taxonomists.

The topic the paper addresses is of major interest to the Semantic Web community. Concretely, it poses a very interesting case study for (a) integration of multiple taxonomies designed with disparate criteria; and (b) reasoning on taxonomy changes over time. Authors may have devised an interesting method to study both, and I find very valuable that they stress the need of better methods to reason over taxonomy change. However, I have major concerns about the presentation of the work in the paper, namely on its structure, layout, and gap with research in the Semantic Web. Overall, the structure of the paper is very confusing to the reader, and several sections are ill-structured:

- The abstract lacks a proper motivation and problem statement, which are especially important for non-biology versed readers. The approach and results are well described and take 95% of the extension of the abstract, although the conclusions are weak. I would suggest to clearly and explicitly structure the abstract in motivation, problem statement, approach, results and conclusions.
- The introduction contains all the ingredients needed to make sense of the paper, but it reads too dense and with a too much broad scope. The first two paragraphs summarize a lot of domain dependent knowledge that set the problem space and that, in my opinion, deserve a much more careful description. I would suggest to explain the problems the codes and principles pose to the field in a more descriptive and informal way in the introduction, and then be more concrete and deep in a Background section, clearly explaining all domain dependent names and concepts and using real examples with figures if necessary (e.g. how taxonomic names are used as identifiers, what is a merge taxonomy, what is a congruent taxonomic concept, what is a concept-level articulation, etc.)
- The state-of-the-art is very spread across the paper, especially in the Introduction and Discussion. The lack of a Related Work section is very confusing. I would suggest to group all the related work in a common Section, where all the related research is clearly structured and discussed.
- Although I understand is not the goal of the paper, it is surprising that related Semantic Web areas such as Ontology matching or SKOS/OWL semantics are not mentioned, and only one paper (Tuominen 2011) cited, in paragraphs "Several alternatives..." and "Here we use the novel..." of Page 2.
- The rest of the introduction, from paragraph "Here we use the novel..." onwards, describes the actual method followed to tackle the problem of congruency in taxonomic change. To me, this requires an extensive section describing in full detail Euler/ASP, omitting implementation particularities (i.e. under a Methodology or Proposed Approach section). I would also suggest to extensively describe the RCC5 calculus and the extension constraints in the Background section, and to leave only an overview in the introduction. The choices in favour of RCC5 and the constraints need to be properly justified (i.e. instead of other approaches).
- Section 3 is entirely devoted to analyse the output of the proposed approach in minute detail. The text is too extensive and provides too little if compared with the visual outcome of the Figures. I would suggest the authors to leverage Table 3 for this section, extending it and working out interesting, relevant and summarized results with additional metrics. The reader misses a clear overview of the congruency of the taxonomy series as a whole. Moreover, no comparison to any gold standard or expert evaluation is performed, and few clues on the approach's correctness are given.
- The Discussion section is too superfluous, and it should focus on extending the analysis of section 3.2, providing insights on why coherence of the taxonomy series fluctuates the way it does, raising the limitations of the proposed approach, correctness, suitability for other domains, etc.
- A Conclusions section with clear conclusions and future work would be desirable
- Figures and tables do not meet the journal layout, see Placing all figures and tables as an annex is very confusing to the reader. I suggest to use the \begin{table*} trick for large tables in order to place them appropriately in the text body. All figures can be colspanned too if necessary.

Other minor comments:

- Page 2: "Current solutions to overcome the taxonomic name/concept discordance usually take one of two major pathways". You immediately discuss the first one. Is the next paragraph describing the second one? (if it is, say so)
- Page 2: Use the same RCC5 symbols as literature. Describe formally the semantics of each relationship in a Background section.
- Page 3: In paragraph "The consistent column depth..." I suggest the use of the word "partition"
- Section 2.3 contains too much implementation details; refer to the software documentation in the tool repository
- Page 4: Paragraph between Fig. 3 and Fig. 4 leads to confusion; is "The Hackel (1889) and RAB (1968)..." referring still to Fig. 4, or is it pointing to Fig. 2?