Linking Earth and Climate Science: Semantic Search Supporting Investigation of Climate Change

Paper Title: 
Linking Earth and Climate Science: Semantic Search Supporting Investigation of Climate Change
Line C. Pouchard, Marcia L. Branstetter, Robert B. Cook, Ranjeet Devarakonda, Jim Green, Giri Palanisamy, Paul Alexander, Natalya F. Noy
Linked Science is the practice of integrating and aggregating structured data and information in physical, chemical, biological, sociological, and other traditional fields of scientific study. Much of this data does not live in the cloud or on the Web, but rather in multi-institutional data centers that provide tools and add value through quality assurance, validation, curation, dissemination, and analysis of the data. In this paper, we focus on the data in Earth and Climate Sciences and on the use of ontologies to facilitate search and integration of this data. Mercury, developed at Oak Ridge National Laboratory, is a tool for distributed metadata harvesting, search and retrieval. Mercury currently provides uniform access to more than 100,000 metadata records; 30,000 scientists use it each month. We augmented search in Mercury with ontologies, such as the ontologies in the Semantic Web for Earth and Environmental Terminology (SWEET) collection. We use BioPortal, developed at Stanford University, as an infrastructure to store and access ontologies. We use BioPortal REST services to enable faceted search based on the structure of the ontologies, and to improve the accuracy of user queries.
Full PDF Version: 
Submission type: 
Application Report
Responsible editor: 
Reject and Resubmit

Review 1 by Ola Ahlqvist

In general, this paper presents an interesting implementation of some key techniques and resources to address semantic web challenges. I think the described technology architecture has a lot going for it, but the paper does not provide adequate foundation in the existing literature, and the experiment/evaluation seems limited and is poorly described. As such, the paper does not achieve the contributions listed towards the end of section 1, consequently the scientific value of the paper is limited.

Section 1 lacks any citations to frame and situate the paper properly in the existing literature. E.g. at the end of the first paragraph it would be appropriate to have a reference or two to some summary article on linked science.

The first few sentences in the second paragraph up until "This process..." on the scientific process can be shortened significantly, we know what that process is.

In the introduction on p. 1 it could be useful to get a more concrete example of how metadata could help "meaningful use" of data as an illustration of what linked science could look like.

The case scenario in section 2 is very helpful but could be improved by clarifying the objective better "validating model simulation trends", which model? trends over what time perspective? is "river channel" equivalent to a watershed? In the same section, it is unclear what you mean by the sentence "Changes in the chemistry of rivers from two different causes are relevant to climate change."

Bioportal is included in Fig.2 and featured in Fig. 3, both in section 3. But the BioPortal is not explained until the end of section 4. There needs to be better organization of the material so there is a logical flow of information for readers not familiar with these resources and concepts.

In section 3 you also state that "This centralized repository of metadata with distributed data sources provides extremely fast search results to the user..." how fast is extremely fast? and why is this important?

In section 4, when you say "We have applied semantic technologies—ontologies, in particular..." in section 3, there should be citations to relevant publications. It would also be helpful with some citation that clarifies the idea of the semantic plurality and synonymy problems, and the term foundational ontology should be explained.

You should explain what "a faceted search approach" means in section 5

It is not clear what type of assessment you are interested in. In section 4 you mention accuracy, but is that the same as relevance? Recall and precision is mentioned in 5.1 & 5.2 but how are they measured? How does the fig.5 evaluation tell you about relevance, accuracy,recall, precision?

Review 2 by anonymous reviewer


I read this paper with a lot of expectations raising from the title. To put it shortly: I was expecting that authors will explain how exactly and to which extent semantic search would support climate change research.

Now here are my comments:

1) content of the paper

First of all, the motivation for doing the effort is quite nicely written. It is indeed challenging to find out relevant datasets from different domains for such an interdisciplinary study as climate change research is. The outcome which is a typical query expansion/faceted search over metadata sounds appealing.

However, there some several problems in the paper about the results.

The authors claim that the system improves both recall and precision. This claim is done without any evidence. If "carbon" and "offset" provides two results, it does not mean that precision has improved. What if both of these two results turn out to be wrongly annotated? Then the precision would be 0%. To have evidence about precision improvements you would need to carry out a proper study. Similarly for recall: it does not automatically mean that if you get more results, then your recall would be improved. Perhaps users do not want to have the datasets annotated with sub-classes of search concepts. Perhaps they do in some cases, but perhaps not. Similarly as with precision, only an evaluation could show the real outcome of this study.

I am afraid that I cannot recommend this paper to be published before the precision & recall analysis of results is properly done.

Also, I did not understand the argument in section 6 why "structured data" in their approach could not be thought as "documents". In many information retrieval settings "document" is defined to be different kinds of things to be retrieved: photos, music, or people just to give a few examples.

2) writing style

Overall, the writing is mostly pretty clear, and well done. The authors, however, use a mixture of words like "concept", "term", "entities", "class", "keyword" and "string-based keyword" without defining what they mean with them. Perhaps this is intuitively clear for many readers, but do authors e.g. distinguish between "keyword" and "term" or "concept" and "keyword"? In case of a semantic search paper I would like to see a more harmonized and motivated use of these words.

3) references

In terms of references to existing work, the writing is very sloppy. Many acronyms are mentioned (including SKOS, RDF, OWL, ...) without references, or without even just a footnote with a web address for more information.

Authors seem to restrict themselves only to data, but perhaps they want to check again what they want to mean by Linked Science. I mean that these papers below present Linked Science and talk about linking all scientific resources together (methods, data, tools, ...) and not just data.

- Tomi Kauppinen and Giovana Mira de Espindola. Linked Open Science—-Communicating, Sharing and Evaluating Data, Methods and Results for Executable Papers. The Executable Paper Grand Challenge, in proceedings of The International Conference on Computational Science (ICCS 2011). Elsevier Procedia Computer Science series, Singapore, June, 2011
- Tomi Kauppinen, Alkyoni Baglatzi and Carsten Keßler. Linked Science: Interconnecting Scientific Assets. In Terence Critchlow and Kerstin Kleese-Van Dam (Eds.): Data Intensive Science. CRC Press, USA, 2012.

Review 3 by anonymous reviewer

While I believe that this is an important contribution on a topic that becomes more relevant by the minute and, thus, is surely worth publishing, I see the need for some major revisions and clarifications.

First, the notion of linked science is not really explained nor motivated in the paper and especially in section 1. Most of this work has been discussed in the eScience literature before. Hence, it seems that the usage of linked data is the only difference. The executable paper, for instance, was proposed and partially implemented before. The same is true for the use of ontologies to document metadata, models, and also scientific workflows. Personally, I think that it is very important to motive new terms or paradigms such as linked science to ensure that they do not end up as pure buzzwords.

With respect to section 4, the semantic annotation has been implemented in systems such as Kepler before. The authors need to make a clear case how their contribution differs or at least situate their work in the existing research landscape. I would also be interested to read about experience made with SWEET as this ontology turned out to be too coarse (and not well maintained) for other applications.

For section 5, I assumed that it would clearly show the added value promised in the first part of the paper. I would propose that the authors go into more detail here. This is especially the case for the evaluation provided in 5.1 and 5.2. I am skeptic whether simple subsumption queries are suitable for testing precision here as this would require some sort of comparison to the initial conceptualization of the scholar performing the search.

While section 7 provides lessons learned, I would propose to add a part about further work and challenges, e.g., to section 8.

Overall my impression is that the second part of the paper only partially addresses the interesting proposals of the first part. Thus, I would recommend to rework the paper especially with respect to the line of argumentation for linked science, a clear evaluation, and a showcase of added value beyond simple query expansion.



This paper was submitted as part of the 'Big Data: Theory and Practice' call.