EARTh: an Environmental Application Reference Thesaurus in the Linked Open Data Cloud

R. Albertoni, M. De Martino, S. Di Franco, V. De Santis, P. Plini
The paper aims at providing a description of EARTh, the Environmental Application Reference Thesaurus. EARTh represents a common general terminology for the environment, which has been published as a SKOS dataset in the Linked Open Data cloud. It promises to become a core tool for indexing and discovery environmental resources by refining and extending GEMET, which is considered the de facto standard when speaking of general-purpose thesaurus for the environmental domain in Europe. The paper illustrates the main key characteristics of EARTh as a guide to its usage. It clarifies (i) the methodology adopted to define the EARTh content; (ii) the design and technological choices made publishing EARTh as Linked Data; (iii) the information pertaining to its access and maintenance. Descriptions of EARTh applications and future relevance are also highlighted.
Resubmission after a "reject and resubmit" in round one and also in round two. Round one reviews are beneath the round two reviews.

Solicited review by Natasha Noy:

I commend the authors for taking care to consider and address the reviewers' comments. The submission is much stronger now. However, I think it still requires improvements and clarifications in order to be published in SWJ.

First, the links, through GEMET, to other resources are very nice and indeed make it part f the "cloud" rather then an island. However, I would have liked to see more discussion about the additional, non-GEMET, links that were discovered through SILK and verified by domain experts. How many of these? How good were the SILK mappings? What fraction of EARTh is now linked to other thesauri?

The examples and the queries are nice and give a better idea of what's inside. However, for the SPARQL queries, it would be nice to give some intuition as to what these queries are supposed to find? What is specific about these datasets? For Query 3, you might want to focus on the links only as this is what (I think) you want to show off.

My major concern remains the usage of the dataset. Section 4 mentions several projects that link to LOD version. The only "use" however, seems to be creating the exactMatch links. How are these links used by those projects? How did having the EARTh thesaurus available in the LOD cloud helped those projects. Being able to link is probably not a goal per se. What can they do now that these links are available that they couldn't do before? I think without such discussion, it is hard to argue on the usage of the dataset.

Also, the submission "teases" with some technical details but never quite provides enough information. For instance, you mention that there are a number of additional relations among the classes. What are they? Table 1 lists only the "usual" ones. There is a mention of materializing relations. Which ones? It's a bit hard to understand exactly the content of the dataset without such details. I would suggest being much more precise about these types of details so that the readers can get a very clear idea of what to expect and what types of relationships they would see.

Solicited review by Marta Sabou:

The authors have significantly improved the EARTh LOD dataset by linking it to several other datasets (DBPEDIA, AGROVOC, EUROVOC UMTHE). Additionally, the paper has been extended and improved, especially in terms of examples and a clear description of where and how EARTh is used (section 4). Therefore, I suggest accepting this re-submission as is.

Solicited review by Tomi Kauppinen:

I checked the new version, and the reply to reviewers. All the points have been fully addressed, so I recommend accepting the paper.

Round one reviews:

Solicited review by Natasha Noy:

The paper describes the publication of a thesaurus of bi-lingual environmental terminology as a linked dataset.

The authors spend a fair amount of time describing the database structure, which is perhaps less relevant for this special issue. Instead, the readers might be better served if there were some examples of the terms, perhaps topics that are covered, etc. I think the paper does not contain a single environmental term other then Earth anywhere in the discussion.

I would have also liked to see some discussion of linking this dataset to some other publicly available datasets. It seems that at the moment it is an "island" and the authors never discuss what, if anything, they might gain by using the linked data. It seems that it is more of a format-publishing decision than a true linked dataset.

Finally, it would have been nice to read a bit more about the relevance and applications. It is one of the key criteria for the special issue, and the authors don't do a very good job of convincing the reader that the others have found the dataset relevant. How is it used in projects? The only reference seems to be to another project by the same authors [11]. Does anyone outside of the authors' group use or plan to use the dataset? If they do, it would have been nice to have a description.

Solicited review by Marta Sabou:

The paper presents the EARTh LOD dataset, a thesaurus in the environmental domain derived through the refinement and extension of GEMET and enabling tasks such as indexing and discovering environmental resources. Based on the criteria of the call, I judge this paper as follows.

Quality of the dataset.
High. The paper reports on exposing a thesaurus (EARTh) that has been obtained through refining GEMET from as early as 2001. EARTh is used in several projects, which testify its quality and usefulness for the field. The EARTh data has been exposed using the D2R server and it provides both human accessible semantic descriptions and a SPARQL endpoint. In terms of linking, currently there are 4000 links to GEMET, which is a side-effect of the refinement process that lead to EARTh rather than of a link-establishing procedure. Given the broad coverage of EARTh as well as the aim to use it for indexing, linking it to multiple other LOD sources (for example, the AGROVOC data set submitted to this call) would be a major benefit. The authors conclude their paper with envisioning future work in link creation. Could they extend this part with some concrete LOD datasets that they are considering for the linking process?

Usefulness (or potential usefulness) of the dataset.
High. Based on section 4, it is evident that this dataset could play a pivotal role in the environmental domain, not just for indexing documents but also by becoming a hub for interlinking with other thesauri in this domain (in the NatureSDIplus project). Section 4 should be improved by making it more concrete. For example, the authors use the rather generic term of "recognize" in relation to important institutions/projects. What is concretely meant here? Did these organizations/projects commit to use EARTh? For what purposes exactly? Such clarifications will greatly increase the quality of the paper.

Clarity and completeness of the descriptions.
Good. The authors provide enough details about the dataset and the publishing process, but the paper contains several typos and the URL's in the footnotes cannot be clicked (plus they do not print properly either, probably due to some font inclusion issue).

Minor comments and some typos:
* the references are formatted according to different styles, e.g., the publication year appears sometimes after the authors and other times at the end of the reference.
* abstract:
** "main key characteristics" => keep either main or key but not both
** "made publishing" => "made when publishing"

*section 2 - revise second sentence of intro text, currently it does not make sense

*section 3.1:
** "has been adopted" => "have been adopted"
** "Naturals Keys" => "Natural Keys"

*section 5: "evolves as result" => "evolves as a result"

Solicited review by Tomi Kauppinen:

Authors had the goal of providing a thesaurus called EARTh online as Linked Data. Authors state that the EARTh content is accessible via HTTP deferenceable URIs. However, the content (i.e. relations to other concepts) is only delivered as HTML and not as RDF. For example, the URI only serves HTML version of the description of relations. Authors of course provide SPARQL endpoint and RDF dump, but it would be useful to get RDF directly served from the URIs as well. Authors should thus clarify this issue. Moreover, loosely speaking a thesaurus could perhaps be considered as a Linked Data dataset, but then it would be essential to have some convincing linkage to a variety of other thesauri. In other words, the concern is that EARTh seems to be not linked to other data in the Linked Open Data cloud, except via sharing the use of properties from the SKOS vocabulary. Authors mention linkage to GEMET, but it is not served as Linked Data at all when tested (e.g. Taking all this, I am not convinced that this paper should be published in the SWJ special issue, at least not before the above mentioned issues are taking into account.

One minor issue:

- The references list is quite messy: many references lack details, and they are formatted in a variety of different ways. Please consider polishing them.