LinkedGeoData: A Core for a Web of Spatial Open Data

Paper Title: 
LinkedGeoData: A Core for a Web of Spatial Open Data
Authors: 
Claus Stadler, Jens Lehmann, Konrad Höffner, and Sören Auer
Abstract: 
Data integration on and off the web requires comprehensive datasets and vocabularies to enable the disambiguation and alignment of information. Many of such real-life information integration and aggregation tasks are impossible without comprehensive background knowledge related to spatial features of the ways, structures and landscapes surrounding us. In this paper, we contribute to the development of a spatial Data Web by elaborating on how the collaboratively collected OpenStreetMap data can be interactively transformed and represented adhering to the RDF data model. We describe how this data is interlinked with other spatial data sets, how it can be made accessible for machines according to the Linked Data paradigm and for humans by means of several applications, including a faceted geo-browser. The spatial data, vocabularies, interlinks and some of the applications are openly available in the LinkedGeoData project.
Full PDF Version: 
Submission type: 
Other
Responsible editor: 
Decision/Status: 
Accept
Reviews: 

Review 1 by Simon Scheider

The revised version of the paper addresses most of the review critique in an appropriate way.
One could still ask for a better motivation and embedding into recent research on spatial information integration. What makes LinkedGeoData special with respect to common approaches, like gazetteers, illustrated by example?
However, since I consider the paper rather a report on tools and systems, this lack does not seem a major burden to publication.

Review 2 by Prateek Jain

My recommendation is based on the following actions taken by the authors for my original comments

Comment-1. The table explaining the conversion between LGD and Geonames dataset in my opinion is extremely useful, considering the shallow ontology which Geonames provides. It will be interesting if the SPARQL Endpoint for LGD can support queries over Geonames using the mappings, which have been constructed. It will be useful to the community both from the perspective of getting access to Geonames via SPARQL and also a way around the Geonames modeling issues.

Action taken: I haven't noticed anything done by the authors with respect to this point. However, this isn't a major issue and it was a suggestion to increase the usability of the work. Hence, this is a minor point.

Comment 2. It will be quite interesting if LGD can create links to other datasets beyond owl:sameAs links. There is a brief discussion about part of relationship creating issues with respect to mapping. Geonames provides a property "parentFeature". Perhaps a technique can be incorporated in the overall architecture which can use the parentFeature link to map part of relationships. While it's a straightforward extension, it will make LOD richer with relations beyond owl:sameAs.

Action taken: The authors have explained in detail their views/actions with respect to linking to other dataset such as MusicBrainz. They also discuss briefly about issues with respect to modeling other relationships.

Comment 3. The evaluation with manual verification of 6526 is fairly comprehensive and in the absence of an existing benchmark, probably the best authors could have achieved.

Action taken: None required

Comment 4. This comment is more about the overall state of datasets present in LOD, rather than just the paper. The authors have given examples of applications, which are using the dataset. However, majority of the applications are academic research lab applications. I am eager to see an application, which is using LOD datasets in applications beyond those constructed in academic labs. Only example I have seen is perhaps use of DBpedia by Watson.

Action taken: I am happy to see a very detailed discussion and explanation of the different real life applications which are using Linked Geo Data. This is exciting overall for the LOD community itself.

Comment 5. It will be a worthwhile discussion about plans to link LGD to other LOD datasets.
Action taken: Addressed as part of one of the comments above.

Review 3 by Dalia Varanka

Accept as is, some minor editorial corrections are suggested.

The reviews below address a previous version of the manuscript.

Review 1 by Simon Scheider

This paper describes a well-recognized contribution to the development of a spatial data web. It gives an overview of solutions that were developed to publish OSM data in the form of RDF, spanning from OSM-RDF mapping, ontology building, methods of data access, interlinking with Geonames and FAO data, live synchronizations, and tools built on LGD.

Although the paper is obviously not intended as a research paper (it may actually be listed as a ``report on tools and systems''), it is nevertheless required that the authors refer to and discuss the relevant state-of-the-art. And this is my main point of critique. The authors take a semantic web perspective on VGI, but fail in many parts to take existing research in GI Science into account. They take a tabula rasa approach to GeoInformation, ignoring work that could be valuable for comparison or reference. This can be seen already from the reference list: With few exceptions ([4], [8]), GI Science research does not really appear.

For a report on tools and systems with a demonstrable value, the paper may be acceptable provided the authors address the issues mentioned. So I recommend conditional accept.

These are the more specific points of critique:

1) Introduction: I also believe that LGD could be a valuable core for a spatial data web. But a claim like ``many real-life information integration and aggregation tasks are, however, impossible without comprehensive background knowledge related to spatial features...'' needs references. The tasks mentioned are treated in various research on location base services and GI web services. In the conclusion, the terms ``geo-data syndication'' and ``semantic-spatial searches'' appear the first time without explanation or reference.

2)Interlinking (6): ``Only LinkedGeodata nodes are used for matching [between Geonames and LGD] as they have names as well as positions'': Besides the fact that ways actually have positions that are regions, I wonder whether the authors are aware that there are more possibilities to calculate a similarity between two arbitrary spatial geometries than just the distance between two reference points. From the very beginning of GI research, complex spatial operators like point in polygon or topological relations (9 intersection) have been discussed. They are available in every postgis database. And the decision to leave more complex geometries out of the business actually turns out to be a major problem: The big variance of the factor c on p. 9, the maximum distance that two points describing the same object are reasonably expected to differ, is of course largely influenced by the complex geometry hidden underneath. For example, if we match Germany in both databases, then the DBpedia point may be located far away from a centroid of the respective OSM polygon, e.g. in Berlin, while a point-in-polygon test may nevertheless be able to correctly infer similarity. The same for matching roads. Since I can't see any arguable reason for this decision, it seems rather an ad-hoc approach. This might be acceptable if the authors had referred to any existing work for remedy. There are numerous papers of the last 10 years on matching gazetteer footprints, starting with Linda Hill: "Core Elements of Digital Gazetteers: Placenames, Categories, and Footprints", or Wu, Winter: "Inferring Relevant Gazetteer Instances to a Placename", or Janowicz, K. and Keßler, C. (2008): "The Role of Ontology in Improving Gazetteer Interaction". There is also work on combining spatial and thematic similarity measures that could be cited, e.g. Janowicz, K., Wilkes, M., and Lutz, M. (2008): "Similarity-based Information Retrieval and its Role within Spatial Data Infrastructure".

3) LGD Browser (9.1) and spatial query optimization: I do not understand the sentence ``This is due to the fact that the database can only use either the longitude or latitude index''. A spatial database like postgis or oracle spatial is able to handle any form of spatial index. And what is the authors' reason for choosing a "quadtile" index (a name obviously invented by the OSM community)? I can't see any difference to the well known "quadtree index" (why then use a name not common in science?). Furthermore, how do they know that an R-tree is not better suited? There is also extensive research on spatial indices that may be cited, have a look into H. Samet: "Foundations of Multidimensional and Metric Data Structures".

Some minor suggestions:
- Figure numbers 2,5-9 seem wrong since not quite matching with text.
- There is more than one grammatical error, e.g. "all to points, as every point may at some point be connected to way" on p. 13

Review 2 by Prateek Jain

The work presents a description of LinkedGeoData dataset and the methodology employed for the creation of the same. LinkedGeoData is a geographical dataset constructed by converting data from Open Street Maps (OSM) to RDF. The paper describes the methodology for creation of these datasets and the applications, which have been constructed using the datasets. The dataset is extremely useful for the Semantic Web community and the efforts put in to create the datasets are laudable. Due to the nature of the work and the details presented, the work has been evaluated as an Ontology Paper, as specified in the call for papers provided at http://www.semantic-web-journal.net/reviewers. With respect to each of the criterion for ontology papers, here is my comment

(a) Quality and relevance of the described ontology (convincing evidence must be provided) : The paper describes in detail about the applications which have been built using the ontology, so its definitely provides enough evidence and details. Besides the data set is one of the major and prominent data sets about geographical information available on LOD. The authors seem to have followed a fine technique for the construction of the ontology and having personally used it, I can vouch for the quality of the ontology as well.

(2) Illustration, clarity and readability of the describing paper, which shall convey to the reader the key aspects of the described ontology. : The paper is very well written and describes the key aspects and issues in details w.r.t construction of the data set.

I have few comments with respect to the paper.

Comments

1. The table explaining the conversion between LGD and Geonames dataset in my opinion is extremely useful, considering the shallow ontology which Geonames provides. It will be interesting if the SPARQL Endpoint for LGD can support queries over Geonames using the mappings, which have been constructed. It will be useful to the community both from the perspective of getting access to Geonames via SPARQL and also a way around the Geonames modeling issues.
2. It will be quite interesting if LGD can create links to other datasets beyond owl:sameAs links. There is a brief discussion about part of relationship creating issues with respect to mapping. Geonames provides a property "parentFeature". Perhaps a technique can be incorporated in the overall architecture which can use the parentFeature link to map part of relationships. While it's a straightforward extension, it will make LOD richer with relations beyond owl:sameAs.
3. The evaluation with manual verification of 6526 is fairly comprehensive and in the absence of an existing benchmark, probably the best authors could have achieved.
4. This comment is more about the overall state of datasets present in LOD, rather than just the paper. The authors have given examples of applications, which are using the dataset. However, majority of the applications are academic research lab applications. I am eager to see an application, which is using LOD datasets in applications beyond those constructed in academic labs. Only example I have seen is perhaps use of DBpedia by Watson.
5. It will be a worthwhile discussion about plans to link LGD to other LOD datasets.

Minor Comments

1. On page 3, "as shown in Figure 2"→ "as shown in Figure 1"
2. Page 14, "omit approximately 20mio triples"→ (Probably) "20 million triples".

Overall the work is a good description of a Geographical dataset, the methodology, applications built using it. The dataset is already a valuable contribution to the community. The related paper will provide further benefits to developers, and researchers related to the community.

Review 3 by Dalia Varanka

The paper reports on extensive and advanced work creating the linkages between Open Street Map and the Semantic Web. A full life-cycle of multiple steps of the project is explained in detail. The paper is richly detailed with adequate supporting documentation for specialist issues such as multi-lingual data, and spatial dimensions. The application solutions are well-respected and highly interesting, and internationally valuable. The paper falls within the scope of the journal as described n the home page.
The only weakness of the paper is that the authors do not articulate a research framework or context for the project. For example, research issues of a broad scope are not identified or discussed and the work focuses narrowly on solutions to the specific application in question. Some evidence for this is that a section devoted to related or similar projects appears at the end. Other work is written descriptively and without analysis, thereby failing to draw the work of these authors into very much context. This special issue of the Semantic Web Journal, however, welcomes papers describing highly applied work.
I recommend accepting the paper with some revision; editing the paper for fewer technical details (the paper reads a bit like a technical users guide) and expanding the discussion of the solution implications relative to the broader state of the Semantic Web and its current research topics.

Tags: 

Comments

Sorry for the late requested review.

This paper describes developments in the OpenStreetMap (OSM) project that enhance its place in the semantic open data environment. OSM is the poster-child for volunteered geographic information, a community initiative to map the globe. It represents one of the richest geospatial data sources, and its position in the open linked data environment is very significant, as it essentially provides a free geospatial layer that is globally available. The enhancements include the addition of semantic formats and interfaces to the data, such as a RDF encoding, a lightweight ontology, SPARQL end-points for static and live data, a RESTful interface, as well as links to related data such as GeoNames and DBpedia, and tools such as an enhanced browser. The significance of the work as well as the technical content, make this paper very suitable for consideration by SWJ.

The paper begins with an introduction to OSM and its enhancements, proceeds to describe each enhancement in detail, including performance statistics for some of the new functions. Some tools developed by the authors, and others, are then described—these operate over the data and make use of the supplied formats and interfaces. The related work compares the OSM dataset to others, and discusses database to rdf mapping efforts as well as efforts to map rdf instances with minor ontology support. The paper concludes with some scalability problems that need solving.

Because the application is significant and interesting, and the technical discussion is informed, the paper is worthy of publication with some moderate to major changes, to address the following issues:

1. Language: the manuscript needs to polish its use of English, to correct some small grammar issues and some awkward constructions. For example, “allows to” should be replaced with “enable…” or “allows one to”. I have attached an annotated manuscript with comments and suggestions, and recommend the paper is thoroughly proofed by a native speaker.

2. Context: the paper does not provide enough context, explanation, or references, for many items. It is as if the paper strives to be comprehensive and therefore cannot afford the space to explain each item, even if the items are secondary, or it assumes the reader has very high familiarity with some of these items. Some short but clear explanation around any newly introduced thing would help greatly. See the annotated ms for examples.

3. Structure: as written, the paper reads more like an engineering report, and not enough like a scientific paper. The authors do a reasonable job of describing the enhancements, but provide little motivation for these enhancements. Thus the original contribution of the paper gets diluted. We learn about new access to important data, but not why this is significant and how it is new with respect to the previous version, and not enough about the differences with other related systems. The paper should at least identify the gaps in the previous version, as well as the gaps in other systems, why it is important to overcome these gaps, and make clear the benefits that would ensue from doing so. One or more use-cases that illustrate the gaps and that highlight the need for the new functions would go a long way here. The paper might then describe the new functions in terms of overcoming gaps (and use-cases), which will make it read less like a catalog of new developments. An implementation section could show how the data and tools are used to accomplish this—this is missing, or wrapped into the description.

4. Some Figure and Table numbers do not correspond between the text and caption (e.g. Fig 5-6, but also others). Please check.

In summary, this paper deserves to be published, but first needs to polish its language usage, improve clarity in places, and most importantly clarify the original contribution through better motivation and some re-structuring.