Global Weather Sensor Dataset

Tracking #: 877-2087

Authors: 
Hoan Nguyen
Hung Ngo
Manfred Hauswirth1
Danh Le Phuoc

Responsible editor: 
Pascal Hitzler

Submission type: 
Dataset Description
Abstract: 
The National Oceanic and Atmospheric Administration has recently published their database to address a pressing need for an integrated global database of hourly land surface climatological data. The database is quickly becoming one of the biggest weather dataset which contains approximately 350 gigabytes for over 20.000 weather sensor stations over the world, with observation data from as early as 1900 to present. In this article, we will describe how NOAA dataset can be transformed and published as linked data (Linked NOAA dataset with over 177 billion triples), with the target of making these data publicly accessible and linkable with other linked data sources, by applying Linked Data principles, accessing and processing methods.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Thomas Narock submitted on 18/Nov/2014
Suggestion:
Minor Revision
Review Comment:

Overall, this is a well written paper describing a potentially very useful Linked Dataset. In its raw form, the NOAA global weather dataset is already an important resource for climatological studies. The authors take an important, and necessary, step in bringing this dataset into the realm of Linked Data and semantic e-science. The paper is clearly written and provides a timely contribution to the literature. I do, however, have a few comments regarding the completeness of the paper and the quantity of links to other datasets.

The choice of modeling - in particular the reuse of SSN ontology - is well grounded. Yet, I do question the authors' comments (section 2.4) that there is a lack of standard ontology for describing observed values and measurements. There is an ISO international standard (19156) addressing observations and measurements for geographical information. This standard has been published as a formal ontology - see for example:
http://ceur-ws.org/Vol-1063/paper1.pdf It would make a valuable contribution to the paper if the authors acknowledged this standard and mentioned if it is applicable, and if not, why.

In terms of connections to other datasets, I feel there is one dataset in particular that is not mentioned that could significantly enhance the utility of this NOAA dataset. There is ongoing work to convert the U.S. Global Change Information System into Linked Data. See:
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6558476
http://www.sciencedirect.com/science/article/pii/S1364815214002254
http://data.globalchange.gov/
The climate change information and provenance would be a valuable resource to link to in the NOAA dataset.

The authors note that their implementation relies on the EAGLE system. There does not appear to be a reference for EAGLE. Can the authors provide a reference and modest background material?

In the Shortcomings section the authors mention limitations in system performance and inability to handle queries that are "too complicated". There is a mention of hardware limitation; however, few details are given. I feel the reader would benefit greatly if this section were expanded. Does limitations in system performance refer to query execution time? If so, what might a typical user experience in terms of performance. Can "too complicated" be quantified into a certain subset of SPARQL features or in someway tied to the logic of the ontology? What can a user not query on?

Finally, given the continual updating of the NOAA dataset, do the authors have intentions in utilize the W3C Provenance Ontology?

Grammatical Errors
- in the Introduction - "effects to natural live" should this be "effects to natural life"?
- in the Introduction - "NOAA roles a supplier" should be "NOAA's role as a supplier"

Review #2
By Raúl García-Castro submitted on 25/Nov/2014
Suggestion:
Reject
Review Comment:

The paper describes the Linked NOAA dataset, a dataset that publishes weather data from the National Oceanic and Atmospheric Administration (NOAA).

The presented ontology contains few new classes and properties and the paper just enumerates the main classes reused. Therefore, in terms of the ontology there is not a big contribution.

There are no details on how the ontology has been developed. For example, how the ontology has been modelled from the schema of the original data source, whether it covers the original schema, how optional values are being modelled in the ontology, etc.

In the introduction one of the problems highlighted from the original data is that their use requires significant knowledge. How is this problem solved with the ontology?

The ontology is not available online.

The URI naming scheme used is not valid. Sensors are identified using inside the URI a hash to provide the location of the sensor. This causes that if the sensor moves that hash will change and, therefore, the URI will change.

Furthermore, In observation URIs there is also a problem. The URI naming scheme will not work if there are two sensors of the same type in the same station.

There are no details about the resulting data. For example, are the data in RDF the same as in the original data source. Is there any addition or removal?

The authors should provide a complete example of how the data presented in figure 6 has been transformed into RDF.

One challenging aspect of the dataset if the size and dynamicity of it. However, the authors in section 3 just present the tool that has been used for storing and publishing the dataset (EAGLE) but not how the problem has been tackled and solved.

Right now the dataset is updated hourly, but it takes from 1 to 3 hours to be generated. What are the reasons for such variability in processing time? What will happen when the data size increases?

The maturity of the presented dataset is not clear. Apart from the previous comments there are other issues described next.

The URI used in section 2.3 does not contain any resource (http://graphofthings.org/resource/sensor/gu9gdbby_WindSensor_noaa_10010), gives a 404. Trying other URLs from the ontology and the data also gives errors.

In section 2 it is mentioned that the URL http://noaa.graphofthings.org/contains dereferenceable information about the vocabulary used and metadata about the vocabulary. That URL just points to a web page and those explicit entities are not found there.

The paper does not state the license of the NOAA data; besides, it doesn't state the license of the generated data (in table 1 the licensing information is missing). A proper analysis of the licenses and how they are presented in the RDF data is needed.

It is not clear whether the authors have published metadata about the dataset and whether these metadata appear in registries so people can discover the dataset.

In the paper the terms "weather" and "climate" are used as synonyms, but they are different terms. The authors should review the use of these terms in the paper.

Some things in the paper could be presented before to facilitate its understanding, such as the format and structure of the original data.

In the conclusions the authors mention that "This application is public at http://noaa. graphofthings.org/sparql/", but that URL is not valid.

The writing of the paper must be reviewed.

Check the use of capital letters in references.

Review #3
By Adam Shepherd submitted on 26/Nov/2014
Suggestion:
Minor Revision
Review Comment:

The dataset described in this paper, through its application of the Semantic Sensor Network ontology, has great value to a geosciences community that is in need of improved data discovery and access for accomplishing next generation science. The SSN ontology, with its maturity and sound engineering, provides the level of granularity necessary for this dataset to be useful by other linked datasets that vary in scope and perspective.
From the ocean sciences community, I can identify a number of linked datasets that could find value in linking to this dataset across concepts of geographic location, sensor, and measurement. Notably, the SAMOS initiative (http://samos.coaps.fsu.edu/html/) at Florida State University and their efforts to expose meteorological linked data collected aboard the U.S. academic fleet of research vessels (http://meetingorganizer.copernicus.org/EGU2014/EGU2014-11865.pdf). Also, the Rolling Deck to Repository (R2R) initiative (http://www.rvdata.us/) with its linked data about underway shipboard data (http://data.rvdata.us/). Because of the flexibility of the SSN ontology, and the well-implemented NOAA data described here, the interlinking between SAMOS and R2R to this NOAA dataset should be fairly straightforward while providing enormous value to the oceanographic science community. This dataset alludes to the generative value the ocean science community has hoped would be a derivative of the linked data effort.
It was excellent to see that http://noaa.graphofthings.org/ resolved to something in a browser. And the working SPARQL endpoint was also great to see working. I was hoping to discover some triples using the ssn:observationResult predicate, but maybe my query wasn’t correct. I also expected to see more predicates available through a query like: “SELECT DISTINCT ?p WHERE { ?s ?p ?o }”. A helpful improvement to the paper would be to discuss how to access the data that is returned from executing SPARQL queries in Section 4 Example 2 & 3. After executing those queries I tried finding the data that generated those, but maybe they are fueled by Blur? Also, there are a handful of spelling error and grammatical mistakes throughout the paper.
Overall, this excellent work .

Review #4
Anonymous submitted on 08/Dec/2014
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Data Description' and should be reviewed along the following dimensions: (1) Quality of the dataset. (2) Usefulness (or potential usefulness) of the dataset. (3) Clarity and completeness of the descriptions.

-The SPARQL endpoint was not available when I tried it (404 error).

-The paper describes an important dataset that is a great new resource for the linked data cloud, NOAA-published weather data, of considerable size (177 billion triples) and frequent (hourly-ish) update. It will be really interesting to see what kind of applications are written for this. However, the authors make it clear that the size and update frequency is creating performance problems, and I hope that this publication will encourage the development of solutions for this case. Aside from this consideration, I think the dataset will be very useful.

- The authors missed several related work papers that need to be addressed : http://knoesis.org/ssn2014/paper_5.pdf in ssn2014, http://ceur-ws.org/Vol-904/paper10.pdf in ssn2012 and http://www.semantic-web-journal.net/sites/default/files/swj281_0.pdf in this journal 2011.

- There are many readability problems that need improvement-- far too many for recording here.

- The process of conversion from the original form is well described and might help others with similar problems.

- page 2 4 stars: you have misinterpreted the second star as it refers to the ontology, not this instance data. Nearly all of your ontology is reused-- but this star should apply to your bit (e.g. got:StationarySensor).

- Many ontologies are reused -- fine, but they are not all sufficiently well identified to the prefix you use, the purpose, the paper description/citation, and the source file -- I suggest you put this in a simple table for all of them.

- figure 1 you have a system with a hasSubsystem that is an instance of a aws:TemperatureSensor that is (presumably) a Sensor. Similarly for your WindSensor. Instead, you should be using a SensingDevice instead of Sensor here so that it is also a System and the hassubSystem property is used as intended between Systems.

- you need a citation for the geohash technique

-fig 4 your use of MeasurementCapability is wrong; however your air_temperature is properly an instance of a ssn:Property.

-fig 4 you should probably define your own class (es) of ssn:SensorOutput with the right units -- so you don't need to repeat it for each instance of an observationResult. Same for your Observations -- create a subclass for the specific sensor and observedProperty and SensorOutput subclass, so that you don't need to repeat this for every observationResult and ObservationResultTime. Although this requires either use of a reasoner to get the RDF assertions you would want, or a smart client to figure it out -- and this maybe is impossible with your scale? Please consider!

-sec 4 ; what is blur?

- sec 4 query example - misleading variable name- You attached haslocation to systems in fig 1, not to sensor.
For each of your queries, please describe their purpose (could be in caption text).

- "ASCII" surely UTF-8?
- references need formatting improvements.