Transforming Meteorological Data into Linked Data

Paper Title: 
Transforming Meteorological Data into Linked Data
Authors: 
Ghislain Atemezing, Oscar Corcho, Daniel Garijo, José Mora, María Poveda, Pablo Rozas, Daniel Vila-Suero, and Boris Villazón-Terrazas
Abstract: 
This paper describes the process followed in order to make some of the public meterological data from the Agencia Estatal de Meteorología (AEMET, Spanish Meteorological Office) available as Linked Data. The method followed has been already used for publishing geographical, statistical, and leisure data. The data selected for publication are generated every ten minutes by the 250 automatic stations that belong to AEMET and that are deployed across Spain. These data are available as spreadsheets in AEMET data catalog, and contain more than twenty types of measurements per station. These spreadsheets were retrieved from the website, then processed with Python scripts, transformed to RDF according to an ontology network that reuses the W3C SSN Ontology, published in a triple store and visualized in maps with map4rdf.
Full PDF Version: 
Submission type: 
Application Report
Responsible editor: 
Krzysztof Janowicz
Decision/Status: 
Major Revision
Reviews: 

Review 1 by Simon Scheider

The revised version of this application report addresses some of the issues raised in the reviews. Basically, the authors refer to some more literature, explain the underlying ontologies in more detail, and add some more discussion of the approach.
However, I still doubt that the article is suitable for publishing in its current state. It did not convince me that simply publishing meteorological data as linked data is more than just producing "merely more data" (see the paper of Jain et al. from 2010). Or, to cite another paper by Bechhofer et al 2010, why should "linked data be enough" for scientists? What is the "Quality, importance, and impact of the described application", as required for SWJ reports?
An application report needs to provide good answers to these questions by demonstrating the use and evaluation of the application in terms of application scenarios. This includes first of all application scenarios (which are missing), but also scalability issues, technical update cycles and assurance of data quality, and maybe also things like provenance and reusability. The authors claim that "reusing the data is easier" (p.10), but do not provide evidence for this claim. Many of the more interesting statements of the paper (e.g. "assumptions made on the properties of the data presented are very few" p.10; "Some limitations have been detected during the agile development process"; "experience suggests that the process should be fairly generalizable" p.5) are not discussed or backed up empirically. Arguments for using SSN on p. 6 ("The SSN Ontology ... is more appropriate as the ontology can be used for sensor perspective") are not really discussed in detail (why is the modelling of sensors useful in the first place?). Also, the critique of related work in Section 2 asks for an "iterative and incremental data life cycle", but the paper does not address this issue either. As an example for how to address some of these issues in an application report about linked data publishing, the authors may have a look into (http://www.semantic-web-journal.net/content/linkedgeodata-core-web-spati...) in this journal.

Review 2 by Michael Lutz

The revised manuscript is slightly improved in comparison with the previous submission. However, the goal and intended category of the paper is still not entirely clear. Unfortunately, no summary of changes was submitted with the revised manuscript, in which the authors could have explained their intentions with the paper and how they addressed each of the comments.

For a Full paper (containing original research results. (...) These submissions will be reviewed along the usual dimensions for research contributions which include originality, significance of the results, and quality of writing) the analysis of previous work and the discussion on how the presented approach can be generalised are still not sufficient.

Furthermore, not all of the detailed comments were addressed by the authors (see below). Again, the reasons for not addressing some of the comments raised by reviewers should have been explained in an accompanying "change summary" document.

As already stated in the previous review, in the revised version the authors should analyse the generic problems associated with publishing spatio-temporal (sensor) data as linked data (how to deal with space, time, phenomena etc.) and compare their proposed solutions to existing work. Also, they should focus less on the exact method they followed, but rather present the final recommendations/results and explain why these were chosen. Again, where possible, the results should be generalised.

Detailed comments:
- The authors have included a short section with some related work. They then conclude that "From all these works, none of them consider how to design the URIs of the resources coming from the sensors or the reuse of well-known available ontologies for modelling sensor networks. None of them follow either an iterative and incremental linked data life cycle in the process." I get the impression that the authors have deliberately selected those papers that do NOT address the topics they want to cover, in order to provide a rationale for the present paper. In a revised version, the "previous work" section should be more thorough and also include previous work that IS related to the presented approach (e.g. see some of the references suggested in the previous reviews) and discuss how the presented approach differs and goes beyond them.
- p2, last para: "more precisely in measurements" --> "more precisely on measurements"
- The description of the data in section 3 is still not very extensive. IMO, the whole passage from "Data from the different stations …" to "… and the value recorded." is not very relevant for the paper and could therefore be strongly condensed or deleted.
- The pattern proposed in the last 2 paragraphs of section (now) 4.3 is still not very clear. Why have you chosen composite strings rather than a
hierarchical URI?
- My previous comments on table 1 have not been addressed:
* What is the difference between DateTime and Instant? In
general, it would be good to introduce the ontology used (at least at
a high level) before explaining the details on how to create the
URIs.
* Explain the Interval pattern in the text. Why do you use
"tenMinutes" in the local ID? How do you encode an
"from-to" interval?
- Section 6.1, last 2 paras. Some of the sentences in this part are not proper English and hence difficult to understand. This seems to be a crucial part in the paper, so please rephrase.
- Section 6.1: It is still not quite clear how the analysis of the
presented ontologies fits in the methodology presented in section
6.2, where, from the ontologies discussed in 6.1, only the SSN ontology seems to be used, while some other base ontologies are not elaborated (ssn) or not discussed at all (time, wgs84_pos) in 6.1.
- Fig. 1, p.7: "Also, a mapping be- tween the relationships "aemet:locatedIn" and "wgs84_pos:location" has been defined." -- What is the exact relationship between the 2 properties? Where can this be found in Fig. 1?
- The authors still don't provide a discussion of the benefits of the SPARQL-based approach presented in Section 7 over other, non-linked data, approaches, e.g. using an OGC sensor observation service (http://www.opengeospatial.org/standards/sos).
- My previous comments on the SPARQL query have not been addressed:
* Why do you use a URI for the location rather than
coordinates? In this context, the draft GeoSPARQL standard
(http://portal.opengeospatial.org/files/?artifact_id=44722), which
defines a vocabulary for representing geospatial data in RDF, and it
defines an extension to the SPARQL query language for processing
geospatial data, may be relevant.
* Why are all the ?dateTime statements (e.g.
?dateTime w3ctime:hour ?h) included, when the relevant filter
expression is using the xsd data-time format?
- p9: The model used for observed properties should be explained in
more detail (this is one aspect where the paper could make a
generalisable contribution).

Review 3 by Arne Bröring

The authors have incorporated most of my comments from the first review.

The main concern of relating the work to the area of 'Semantic Sensor Web' research has been addressed. Still, I recommend referencing also the following highly related paper: http://www.tandfonline.com/doi/abs/10.1080/17538947.2011.614698.

Also, the concern regarding Section 4.1 regarding external ontologies has been addressed.

Hence, I still suggest to accept this paper, since it nicely describes an approach for bringing a meteorological data set into the linked data cloud. The approach is generic enough to be applied to other domains too.

The reviews below are for the initial submission, the pdf is an updated version

Review 1 by Simon Scheider

I am unsure about the concrete goals of this paper. It does not seem to be intended as a research paper, since the authors content themselves with describing the process of publishing Spanish weather service data as linked data using already existing principles and ontologies.

It may be acceptable as a ``description of ontologies'', but the authors do not provide a sufficiently rigorous discussion of their synthesized ontology (Section 4.2). Why exactly do they use concepts additional to the W3C SSN (semantic sensor network) ontology? E.g. "Measurements" and "Locations", even though both concepts seem already contained in the SSN ontology (the first concept may be captured by the sensor output, observation value, or property, while the second one seems a spatial region value of the sensor platform (http://www.w3.org/2005/Incubator/ssn/wiki/Report_Work_on_the_SSN_ontology). This is of course not to say that SSN should not be extended, but the authors need to provide reasons for doing so. Furthermore, it remains unclear which meteorological concepts are used and how the concepts from these different ontological sources are formally connected (Fig. 1 is just a superficial sketch), i.e. by which relations from which ontologies (is "observed by" a relation invented by the authors?).
So we have the situation that the ontology design decisions as well as the ontological concepts (e.g. about properties added) and ontological commitments at the conceptual links remain totally obscure. In addition, the work presented builds largely on the SSN ontology, while the authors do not cite a single (!) SSN reference, as e.g. the paper [K. Janowicz and M. Compton: The Stimulus-Sensor-Observation Ontology Design Pattern and its Integration into the Semantic Sensor Network Ontology]. The reference list mainly cites very general linked data research, but misses almost all the research that has been done in the area of semantic sensor networks.

It may be even more acceptable as an application report. But the same requirements mentioned above apply here, too. Additionally, for an application report, some evaluation of the application quality needs to be provided, such as scalability issues, technical update cycles and data quality, and the like. These are completely missing.

To summarize: The authors need to decide first of all about the type of paper they intend to write. The rather extended discussion about URI and RDF generation is rather uninteresting, since it just applies well known principles. It can be shortened in order to focus on the ontological aspects or the application evaluation. Then the authors should make transparent and discuss their ontological design decisions and concepts used, especially how they interlink the distinct ontologies used. Or the should provide an evaluation of the application. But in its current state, the paper needs to be rejected.

Review 2 by Arne Bröring

The paper presents the process of creating linked data from the publicly available data sets of the AEMET organization. The up-to-date meteorological data coming from 250 weather stations in Spain are currently made available as spreadsheets on AEMET's FTP servers. The authors developed an approach to access the data, generate RDF (based on an ontology designed by the authors), and publish the RDF triples. The authors explain in detail how this process was carried out, show how the linked data can be used by presenting a mapping application, and finalize their paper by describing the lessons learned and giving conclusions.
Overall, the presented approach of bringing meteorological data into the linked data cloud is nicely described and generic enough to be applied to other domains, too. Hence, I suggest to accept this paper.

Considerations to the main evaluation criteria:
- the significance of this work to this field of research is high (see my considerations above).
- the work is very relevant for the journal.
- the methodology is clearly layed out. It is aligned with previous work of the authors [2].
- the literature review seems good. Here and there, additional references could be included (see below).
- the writing style/clarity is very good. The paper is well structured and easy to read.

Minor comments:

Section 2.
For consideration of referencing, I'd like to point the authors to another paper which have dealt with the design of meaningful URIs for sensor data:

Janowicz, K., Bröring, A., Stasch, C., and Everding, T., 2010a. Towards meaningful URIs for linked sensor data., Towards Digital Earth: Search, Discover and Share Geospatial Data, Workshop at Future Internet Symposium, September 20th, 2010, Berlin, Germany., 640.
[or its successor: Janowicz, K., Bröring, A., Stasch, C., Schade, S., Everding, T., and Llaves, A. (2011; in press): A RESTful Proxy and Data Model for Linked Sensor Data. International Journal of Digital Earth. http://geog.ucsb.edu/~jano/RESTfulSOS.pdf]

Also, the above mentioned papers address the bridging between linked (sensor) data and the idea of the Sensor Web. Other papers have dealt with this notion before as well. I think it would be valueabl if the authors would relate their work to this context. Hence, links to those works may be included here, too:

Patni, H., Henson, C., and Sheth, A., 2010a. Linked sensor data, in: 2010 International
Symposium on Collaborative Technologies and Systems, IEEE, 362""370.

Phuoc, D.L. and Hauswirth, M., 2009. Linked open data in sensor data mashups, in: Proceedings of the 2nd International Workshop on Semantic Sensor Networks (SSN09), CEUR, vol. Vol-522, 1""16.

Page, K., De Roure, D., Martinez, K., Sadler, J., and Kit, O., 2009. Linked sensor data: Restfully serving rdf and gml, in: Proceedings of the 2nd International Workshop on Semantic Sensor Networks (SSN09), CEUR, vol. Vol-522, 49""63

Section 4.1.
In this section other ontologies relevant for the area of work are presented. It stays a bit unclear what the authors derive from the acquired knowledge about the related ontologies. Are they reused? Are there links introduced to point to parts of those ontologies? It should be better pointed out whether that's the case.

Review 3 by Michael Lutz

The paper describes the process of publishing meteorological sensor data as linked data. The paper is well written and covers a relevant and timely topic. However, it does not analyse or discuss how the problem at hand (publishing meteo sensor data) and the chosen approach can be generalised. Furthermore, the paper does not discuss how the proposed approach relates to other work in this area. Thus, the scientific contribution of the paper is difficult to judge.

In the revised version the authors should analyse the generic problems associated with publishing spatio-temporal (sensor) data as linked data (how to deal with space, time, phenomena etc.) and compare their proposed solutions to existing work. Also, they should focus less on the exact method they followed (e.g. in secion 4.2, p6, second column), but rather present the final recommendations/results and explain why these were chosen. Again, where possible, the results should be generalised.

Detailed comments:
- p1: "Among all of the data made available there,we have focused on surface meteorological observing stations." -- It should be discussed, what the specific characteristics of the selected data are and how the chosen approach might have to be adapted for other types of (sensor) data (e.g. coverages of temperature or pressure distributions, iso-lines etc.).
- Section 1 and 2. Before going into the details of the proposed solution in section 2, the source data should be described and analysed, e.g. which parameters are included, how are they represented, etc. (There is some material on this at the end of section 4.1, which should be included in such a section). The problems identified in this analysis should be generalised where possible.
- p2: URIs local IDs --> URIs' local IDs (or: the local IDs or URIs)
- p3: differentiated --> distinguished; payed --> paid
- p3: "design patterns proposed by the community" - which community do you refer to (meteo or semantic web)?
- Section 2.2, last 2 paragraphs: The proposed pattern is not very clear. Why have you chosen composite strings rather than a hierarchical URI?
- Table 1: What is the difference between DateTime and Instant? In general, it would be good to introduce the ontology used (at least at a high level) before explaining the details on how to create the URIs.
- Table 1: Explainn the Interval pattern in the text. Why do you use "tenMinutes" in the local ID? How do you encode an "from-to" interval?
- p4: "convenient solution for the naive user" -- which user do you refer to?
- Section 4.1: It is not quite clear how the analysis of the presented ontologies fits in the methodology presented in section 4.2, where a number of other ontologies are presented that seem to be much more relevant for the presented work.
- p5: visibility like events --> visibility as events
- p5: the restriction to providing measurements in 3h, 6h and 24h aggregates seems quite limiting. Why is this approach chosen?
- Section 4.1: The descr
- p6: Table 2 --> Table 3
- p6: There are some spaces that should be removed between namespaces and class names, e.g. "ssn:: Platform"
- p6: "However, the AEMET ontology was not completely aligned with the SSN ontology as the meteorological observation properties, obtained from the non ontological knowl- edge resource DESCRIBE_VAR.csv21, were modelled mostly as attributes instead of following the SSN model." -- This is not quite clear and should be better explained, maybe using an example.
- p6, Observations Ontology: Be more precise on what the "non-ontological resources" are and how they were transformed.
- Fig. 1: A more detailed figure showing the relationships between classes (rather than ontologies) would be useful.
- Section 5: The main problem addressed here is how to query observation data. This is a more generic problem, and could be solved by other approaches than linked data, e.g. using an OGC sensor observation service (http://www.opengeospatial.org/standards/sos). Discuss how the presented approach is different and what its benefits are in comparison to others (like SOS).
- p8, SPARQL query: Why do you use a URI for the location rather than coordinates? In this context, the draft GeoSPARQL standard (http://portal.opengeospatial.org/files/?artifact_id=44722), which defines a vocabulary for representing geospatial data in RDF, and it defines an extension to the SPARQL query language for processing geospatial data, may be relevant.
- p8, SPARQL query: Why are all the ?dateTime statements (e.g. ?dateTime w3ctime:hour ?h) included, when the relevant filter expression is using the xsd data-time format?
- Table 3: It might be useful to include a column explaining what has changed from one iteration to the next (and why).
- p9: Why not use ?obs ssn:observedProperty directly?
- p9: The model used for observed properties should be explaind in more detail (this is one aspect where the paper could make a generalisable contribution).
- p9: How to query on space and time are other generic aspects that should be addressed by the paper. See also relevant publications in this area:
[1] C. Stasch, S. Schade, A. Llaves, K. Janowicz and A. Broering (2011, in press). Aggregating Linked Sensor Data. Semantic Sensor Network Workshop 2011, Bonn, Germany. Available from: http://geog.ucsb.edu/~jano/SSN2011aggragation.pdf
[2] K. Janowicz, A. Broering, C. Stasch, S. Schade, T. Everding and A. Llaves (2011). A RESTful Proxy and Data Model for Linked Sensor Data. International Journal of Digital Earth. Available from: http://geog.ucsb.edu/%7Ejano/RESTfulSOS.pdf
[3] K. Janowicz, Schade, S., Bröring, A., Keßler, C., Maué, P. and Stasch, C. (2010). Semantic Enablement for Spatial Data Infrastructures. Transactions in GIS, 14(2), pp 166-178. Available from: http://www.geovista.psu.edu/publications/2010/janowicz_etal_tgis_semanti...
- p10: The authors should mention what the overall effort was for developing the application. This could be used to back up some of the claims in the conclusions.
- p10: "Never forget to use ..." - where should this be used?
- p10: "make decimation weeks or months" -- this is unclear - rephrase!
- Conclusions: There are a number of claims here that are not backed up by evidence: "The advantages for the regular user may not be obvious, by using semantic technologies and well established standards the devel- opment time is reduced, and the products obtained are more easily maintainable and reusable."; "for example less time would be required for the development of applications based on this data (as we have already shown with the visualizer)." (this would need to be backed up at least by a statement on the overall effort - see comment above)
- p10: "This limitation ... in this paper." -- This is a really long sentence and unintelligible - rephrase.