Names Are Not Good Enough: Reasoning over Taxonomic Change in the Andropogon Complex

Tracking #: 1027-2238

Authors: 
Nico M Franz
Mingmin Chen
Parisa Kianmajd
Shizhuo Yu
Shawn Bowers
Alan S. Weakley
Bertram Ludäscher

Responsible editor: 
Guest Editors Semantics for Biodiversity

Submission type: 
Full Paper
Abstract: 
This contribution introduces a novel, logic-based solution to the challenge of tracking the provenance of meanings across multiple biological taxonomies. The challenge arises due to limitations inherent in using type-anchored taxonomic names as identifiers of increasingly granular semantic differences being expressed in original and revised taxonomic classifications. We address the challenge through: (1) the use of taxonomic concept labels (thereby individuating name usages according to particular authors) that permit the assembly of alternative concept taxonomies; (2) sets of user-provided Region Connection Calculus articulations (RCC-5: congruence, proper inclusion, inverse proper inclusion, overlap, exclusion) among paired con-cepts represented in the alternative taxonomies; and (3) the use of an Answer Set Programming-based reasoning toolkit that ingests these and other taxonomic constraints to infer and visualize consistent multi-taxonomy alignments. The feasibility and utility of this approach are demonstrated with a use case involving pairwise alignments of 11 non-congruent classifications of Eastern United States grass entities variously assigned to the Andropogon glomeratus-virginicus 'complex' over an interval of 126 years. Derivative analyses of name/meaning identity reveal that, on average, taxonomic names are reliable identifiers of taxonomic identity for approximately 60% of the 127 merge regions obtained in 12 pairwise alignments. Name/meaning cardi-nality over the entire use case time interval ranges from 1/6 to 4/1, with only 1/36 names attaining the semantically ideal 1/1 ratio. We discuss the significance and scalability of the RCC-5 concept alignment approach in the context of building logically tractable solutions for identifying taxonomic provenance in biodiversity data and other Semantic Web environments.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Martin Pullan submitted on 03/Jun/2015
Suggestion:
Minor Revision
Review Comment:

I consider that this paper is of sufficient quality and originality to be considered acceptable for publication with some minor revisions. The paper adds well-reasoned support to the argument that biological nomenclature is insufficient for tracking cryptic changes to taxon concepts arising from taxonomic revision and therefore that taxonomic names are an inadequate basis for the analysis of biodiversity patterns over time. As such is a welcome contribution to the field.
However, in addition to the corrections detailed at the end of this review I have a number of concerns regarding the arguments put forward in the article which the authors may wish to take into account before publication.
1) While the paper provides excellent evidence for the unintended complex relationships between taxonomic concepts created as a result of the standard taxonomic revision process no evidence is provided as to how this reasoning could be applied to real word data integration problems. My concern is that any attempt to do such a real world application would not be viable for the following reasons
a. The vast majority of observation data do not have associated with them information as to the conceptual basis (i.e. classification) in which the name association was made. It would not be possible to know which mapping to apply.
b. Beyond the trivial cases of complete congruence or incongruence knowledge of the concept relationships does not allow the assignment of individual observations to the appropriate cryptic congruent sets. This could only be achieved by re-identifying the observations within the appropriate concepts. Since most observation records are unvouchered such a re-identification (i.e. applying the rules of set membership). More over since reasoning over names alone will not allow the deduction of the rules of set membership and therefore there will be no rule set to apply.
2) The technique as described relies in the first instance on the provision of a concept map by a taxonomic expert. There are 2 issues here:
a. The mapping created by the expert will be a matter of opinion and therefore open to subjective review by other experts. The existence of competing inter-classification mappings is therefore a very real possibility and creates the potential for even more confusion in the future.
b. No attempt has been made to capture the reasoning of the expert when creating their initial mapping. The expert will have presumably declared congruencies based on the presence of shared characteristics between sets. Inclusion of this information in the input set may enable the rules of inclusion for the cryptic sets to be determined through automatic reasoning. It would also allow for an independent test of the internal consistency of the congruencies declared by the expert.

The paper while acceptable in its current form is perhaps a bit long winded and would probably benefit from some editing. The full set of output graphs is probably not required or should be included as an appendix with only those relevant to the description of the results being included.

Specific corrections
Section 2
Para 2: The use of the word “valid” is appropriate at this point since it has a specific meaning within the context of biological nomenclature which does not apply here and is therefore confusing. I suggest the authors replace the “word” valid with the word “appropriate” o” applicable”.

Section 3
Para 2 Add a space between “inconsistent” and “input”

Section 4
Para 1: Change
“taxonomic records or entities can either be relevant merged”
To
“taxonomic records or entities can either be merged”

Section 5.1
Para 2: Reference to Figure 2 in “The tabular input alignment shown in Figure 2” should refer to Figure 2

Section 7.2 Quantification of name/meaning dissociation
Para 3: ”Two mutually exclusive concepts have identical names and therefore presumably refer to the same type.” This is not correct in the case of homonyms where the same name will have been applied to multiple types and in which the later name is invalid. This is a relatively common occurrence. As is the converse of superfluous names in which the same type has been independently associated with a name multiple times.
Para 9 Change “simultaneous” to “simultaneously”

Section 8.2
Para 2 Change “inept” to “inapt”

Review #2
Anonymous submitted on 16/Jun/2015
Suggestion:
Accept
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The revision of this paper has addressed the main comments in my original review. The authors have clearly articulated the value and need for concept-level taxonomic annotations. In addition, the authors have added a new section describing how the methods scale outside of the Andropogon use case and point to how RCC-5 concept alignment might be applied more broadly. The work presents an excellent foundation for future work on using RCC+Answer set programming in taxonomy alignment and related tasks. Based on this assessment I think the paper should be accepted for the special issue.

Review #3
Anonymous submitted on 29/Jun/2015
Suggestion:
Accept
Review Comment:

I fundamentally disagree with the approach presented by the authors but I find it unnecessary to impose my own view on this issue, mainly because if I did so this approach would never be considered as a viable solution to reconcile taxon names and concepts. The alignments the authors talk about regards names/strings of text, not actual concepts; these are still being interpreted by humans and not machines. Therefore, there is still an element of subjectivity in the assignment of concepts to name labels, and that can actually vary from author to author and trough time. I don't find this paper particularly original given this approach has been previously published, but again, I am not opposed to see this particular use case in print. The paper is written nicely and the results are well presented, although the impact on this issue, in my opinion, is ultimately not significant.

Review #4
Anonymous submitted on 10/Aug/2015
Suggestion:
Minor Revision
Review Comment:

This is a well-written and useful paper, articulating the need for more explicit and machine-readable approaches when creating or revising biological taxonomies. The specific semantic approaches the authors advocate couple Region Connection Calculus (RCC-5), with Answer-set programming. Much of the paper involves describing how their open source Euler/X toolkit (available from GitHub) can be used to automatically compute taxonomic affinities through taxonomic alignment, to identify congruences, inconsistencies, and overlaps. It is an interesting paper, and marks a significant contribution to an automated approach for determining compatibility of biological taxonomies, which will be very useful for querying of biodiversity occurrence data, as well as for testing alignments of legacy taxonomies with phylogenetically-informed taxonomic trees.

In the last paragraphs of sec. 7 the explanations became particularly hard for me to follow. Table 5 was hard to interpret, as well as the statement in the 2nd to last para of sec 7-- about "A. virginicus var abbreviatus being the only name that requires neither integration into a taxon concept label, nor RCC-5 articulation"-- is there a clearer way to explain why this is so?

The rest of sec. 7 similarly made statements that didn't follow for me-- given that the Andropogon UC was a "closed" one-- all taxonomies dealt with the same set of types--- assignment of ranks-- as species, varieties, subvarieties etc.-- would seem to be based on expert opinion, so the correspondence of "terms" across these through string-matching is highly subjective. The authors approach requires consultation with an "expert" (e.g. Weakley) to assert the cross-taxonomic alignments-- how dependent then are the congruence outcomes on these expert opinions? How comparable would be these alignments from different experts?

Specific comments:
In 2nd to last sentence of sec. 3, unclear what is meant by: "However, neither set of functions is needed to properly align the Andro-UC input, which already satisfies the criteria of consistency and sufficiency".

In Fig. 2, the label under "31 sec. RAD" should read "31 sec. RAB"

In Table 1, alignment 2 should be for 1983, not 1993.

In sec. 6, the important notion of MIRs is introduced. These are supposed to be included in Appendix 2, but I could not find an Appendix 2 to review. It might have not been uploaded. In any case, a bit more explanation (1-2 sentences) in the text about what are "MIR"s would be helpful.

In sec. 7, 5th para-- ref to Fig 5C should be Fig 4C for the 1950/1948 alignment

Review #5
Anonymous submitted on 12/Sep/2015
Suggestion:
Major Revision
Review Comment:

This is a very interesting paper on an extremely important issue for the future of biodiversity research, and my recommendation is that it be published. Tables 3 and 4 in particular provide some really interesting insights into how differently a single species complex consisting of only 8 species-level names can be reinterpreted by different taxonomists, and clearly demonstrate the value of taxonomic concepts as opposed to simply using species names in biological analyses. The authors show a way forward on this front, combining expert information with computational results to derive visual and logical representations of the relationships between the taxonomic concepts that are part of this taxonomic group. I particularly liked your scheme of using '==/' to represent relationships between concepts while simultaneously using '=/!=' to represent relationships between names, and the idea of "reliable" and "unreliable" names. The authors present several different ways to represent taxonomic information, including an easy-to-read tabular view, a harder-to-read-but-more-detailed concept view, a graphical representation of relationships between concepts within a single publication and between different publications.

I would recommend major revisions before acceptance for three main reasons:

1. I found this paper hard to read: this is unsurprising, seeing as understanding it requires a good grounding in both taxonomy and logic, both difficult subjects, but I think the authors could have done a better job of:

(a) Trimming unnecessary material: were all the taxonomic visualizations in figure 4, 5 and 6 necessary? Could some be moved to supplementary materials? Also, you discuss type-bearing names at the start of section 2, but I'm not sure that that's necessary to understanding name/concept relationships.

(b) Using simple language (the second sentence of the manuscript, "The challenge arises due to limitations inherent in using taxonomic names as identifiers of increasingly granular semantic differences being expressed in original and revised classifications", could potentially be simplified to "Because of the way in which they are defined, taxonomic names are inadequate to expressing fine semantic differences between original and revised classifications", for instance).

(c) using a more traditional paper opening, in which you start with a research question and hypotheses before delving into methods and results. In particular, it wasn't clear to me what you were planning to do until I got to the methods, results and discussion; I think making that clearer right at the start of your introduction would be helpful to readers.

2. I think you should emphasize your problem statement and key findings more, either in your introduction or in a conclusion. How is Euler/X useful for taxonomists, biodiversity researchers or logicians? I can see real value in being able to identify taxonomic names that are "reliable identifiers of taxonomic identity" and "merge concept regions for which there are no unique identities and names" -- are there others I missed? I think you should emphasize how your findings here differ from previous research and what's new here.

3. I think you should talk about some of your findings a bit more in the discussion, including:
- What is the taxonomic significance of "merge concept regions for which there are no unique identities and names present in the respective input taxonomies"? Should they be named by taxonomists, or are they just a necessary crutch when relating taxonomies to each other? I found it particularly interesting that Euler/X came up with 127, while figure 2 only shows 100 -- are there concepts here that taxonomists would have found hard to discover without this tool? Is Euler/X "oversplitting" in a way that no taxonomist would?
- What are the challenges towards expanding Euler/X so that it supports all 55 possible pairwise comparisons? Is this something that will become possible in the next few years because of Moore's Law, or is it unfeasible until more sophisticated logical analysers are developed over the next decade?
- Is your research of immediate benefit to biodiversity researchers? For example, could a scientist analyzing records from GBIF use your results to determine that records referencing, say, _Andropogon hirsutior_, has only had a single meaning and does not require further reconciliation, while recording referencing _Andropogon virginicus_ has had a great many meanings and so should not be used without being absolutely sure of the circumscription intended by the person who created the records.

Apart from that, the only other suggestion I have is that references like "The logic foundations for this particular approach were developed in [15,42,83]" can be hard to read -- I would suggest replacing this with "were developed in previous work by these authors [15, 83] and others [42]".

As described above, I was very impressed with this paper and would definitely like to see it published! However, I would strongly counsel major revisions to improve its readability before it is published -- this will ensure that this excellent paper will be of maximum benefit to the largest number of people!


Comments