Linked Brazilian Amazon Rainforest Data

Paper Title: 
Linked Brazilian Amazon Rainforest Data
Authors: 
Tomi Kauppinen, Giovana Mira de Espindola, Jim Jones, Alber Sánchez, Benedikt Gräler, Thomas Bartoschek
Abstract: 
The Linked Brazilian Amazon Rainforest Data contains observations about deforestation of rainforests and related things such as rivers, road networks, population, amount of cattle, and market prices of agricultural products. The Linked Data approach offers thus to combine ecological, economical and social dimensions together. Our aim has been to 1) dramatically shorten the time needed to collect information for a research setting concerning the Brazilian Amazon, and 2) via the linkage between datasets enable novel types of transdisciplinary research for the scientific community.
Full PDF Version: 
Submission type: 
Dataset Description
Responsible editor: 
Pascal Hitzler
Decision/Status: 
Minor Revision
Reviews: 

Submission in response to http://www.semantic-web-journal.net/blog/semantic-web-journal-special-ca...

Solicited review by Michiel Hildebrand:

The "data field" that is covered in this paper is fascinating. The authors have access to a large number of data sets that cover different aspects related to deforestation of the Aamazon rainforest. I agree with the authors that efforts to make this data machine accessible and interlinked can be of great value to several research fields.

However, the paper as such fails to convince that the linked data that is created is of sufficient quality to be valuable. In summary, the potential usefulness of the data is high, but because the clarity and the completeness of the paper is low it is difficult to judge the quality of the linked data that is produced. Below details about the topics that should be clarified.

Heterogeneity of the data
==
I get the impression that the data is collected from different providers. However, the different datasets are introduced throughout different sections of the paper. It is unclear what the original data is and how they are related. The paper should include a section that clearly lists the original "unlinked" datasets. More details that explain the value of this data are also welcome, e.g. who provides the data, why is it created and what is the status?

Data modeling and vocabulary (re)use
==
I have doubts about the modeling of individual datasets. The paper is incomplete about what type of information is in the original datasets and how this is modeled as linked data. Currently, the authors only provide a number of data excerpts for which the modeling choices are not motivated. For example, assuming that amazon:DEFOR_2004 is an RDF property, why did you decide to model the observation characteristics as properties of this property?

There are two vocabularies used for modeling the data TISC and OLA. The OLA (Open Linked Amazon) vocabulary sounds like and seems to be developed specifically for this purpose (there is also no reference to other usage of this vocabulary). Therefore, the modeling decision made in this vocabulary should also be explained.

The paper should provide a clearer overview of the modeling decisions made for each dataset. It is nice to have an overview table, but currently the content of the table is not explained at all.

Is the value of amazon:TIMEPERIOD for the DEFOR_2004 variable really 2007?

Links
==
The table lists the numbers of links to DBPedia. It is unclear how these links are created, what percentage of the resources are linked, and what the quality is. Aren't there other datasets that are more relevant for "deforestation"?

As the original claim was to integrate different datasets, I am surprised to read little about the interlinking between them. If the spatial relationships are the only method of interlinking I would suggest elaborating on this topic. I expect that most Linked Data enthousiasts, including myself, are not familiar with the geospatial terminology and techniques.

Usefulness
==
A potential strong point of the paper is the example use case. The online tutorials that the paper refers to are impressive, they give a useful introduction from a technical perspective. The paper will benefit from having an explanation of the type of use cases that can be supported. What has the linked data enabled or simplified?

Finally, the conclusion adds very little to the paper.

Solicited review by Francois Scharffe:

The paper present a dataset localized on the amazon rainforest. The dataset aggregates various statistical data on the rainforest. The paper is clearly presented, introducing various aspects of the dataset. The motivation for the dataset is strong with some possible impact on correlating agricultural and economical development with environmental issues.

As stated it would have been a good thing to reuse the datacube vocabulary instead of creating a new one.
The figures for the links to DBPedia are questionable. What are the 608896 links to crops in DBPedia represent ? certainly not links from crops varieties to crops varieties. Then probably links from a crop instance to the corresponding crop in DBPedia. Is this then an instantiation link ?
A class "Cattle" seems to be introduced. Is there at least a mapping between this class and the corresponding one in DBPedia or another related vocab ?

There is a missing sentence in Section 4: The main problem was the …

Solicited review by Dave Kolas:

* Quality of the dataset

The dataset appears to be derived from authoritative sources (Brazil's INPE), so the data has a high quality provenance.

The modeling of the data is interesting. The respresentation of the Variables is verbose, but seems robust in the face of many data sources. It is not obvious how it would be reduced without losing utility. The Observation format derived from the spreadsheet data seems very useful for merging various data sources.

* Usefulness (or potential usefulness) of the dataset

The dataset presented would be of great utility to anyone studying the Brazilian Amazon rain forest, though I do not know how large that community is. The cross-domain nature of the dataset allows analysis that is often not available to researchers due to contstraints on data retrieval.

I think this dataset is also useful as a template of how to approach cross-domain linked data science. Some of the key features, such as the observational model and the access to statistical packages, should prove particularly useful.

* Clarity and completeness of the descriptions

The paper is clearly written and explains the key parts of the dataset. There are good examples of how the data is formatted. The paper includes appropriate references, and a good plan for the future.

Tags: