A Foundation for Spatial Data Warehouses on the Semantic Web

Tracking #: 1412-2624

Authors: 
Nurefsan Gur
Torben Bach Pedersen
Esteban Zimanyi
Katja Hose

Responsible editor: 
Mark Gahegan

Submission type: 
Full Paper
Abstract: 
Large volumes of geospatial data is being published on the Semantic Web (SW), yielding a need for advanced analysis of such data. However, existing SW technologies only support advanced analytical concepts such as multidimensional (MD) data warehouses and Online Analytical Processing (OLAP) over non-spatial SW data. To remedy this need, this paper presents the QB4SOLAP vocabulary which supports spatially enhanced MD data cubes over RDF data. The paper also defines a number of Spatial OLAP (SOLAP) operators over QB4SOLAP cubes and provides algorithms for generating spatially extended SPARQL queries from the SOLAP operators. The proposals are validated by applying them to a realistic use case.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Grant McKenzie submitted on 11/Sep/2016
Suggestion:
Minor Revision
Review Comment:

This submission entitled “A foundation for spatial data warehouses on the semantic web” presents a vocabulary for accessing and analyzing RDF data stored in multi-dimensional data cubes. The authors introduce QB4SOLAP which augments existing QB4OLAP through the inclusion of “S,” spatial operators for OLAP.

Overall the paper is well written and organizationally sound. The structure of the submission is straight-forward to follow, which is appreciated, given the complexity of some of the content. In many ways, this document reads like a technical white paper with quite a few examples outlining the value and purpose of such an approach. It should be noted that some of this work is also contained in a previous conference publication (as mentioned in the introduction). However, the content presented in this submission is adequately novel, including formal semantics for the spatial operations.

A number of questions, concerns and recommendations are outlined below.

1. In Definition 1 (Section 2.2), a number of spatial aggregation methods are presented, one of which is “Buffer.” Buffering on its own is not a spatial aggregation method but simply a reclassification based on a spatial distance. Buffering does not “combine two or more spatial objects” as the text in this section states.

2. The use of “gnw:customer” in the location level specification is somewhat misleading (Example 4). Perhaps I misread this example, but it would appear that this is not the actual customer, but instead the customer location or customer level? As the customer is of type LevelProperty then this is referring to the level of customer rather than the specific customer instance? Perhaps the term gnw:customerLevel would be more appropriate? Some clarification would be useful.

3. Example 9 shows an example of how the spatial levels work from an “instance” perspective. In the example, it is shown that “customer_1” has a customerGeo which is presented in WKT. Given that there could possibly be multiple customers at the same location, would it not be better to reference a spatial object that has the WKT geometry as a property? This would remove the redundancy of storing (possibly very complex geometry) multiple times in the dataset.

E.g.

gnw:customerGeo gnw:customerGeo geo:geom12345 .

geo:geom12345 geo:hasGeometry “POINT(13.099...” .

The same could be said for all geometries (e.g. city instances). This would significantly speed up spatial queries as well since (for example) multiple customers could reference the same geometry and that geometry would only need to be intersected with a query polygon once. Just a thought.

4. In Example 13, two methods for executing a “spatial dice” are presented. While it is great to show that there are multiple approaches to how one could execute such a query, it might also be useful to state which would be the most efficient given the sample dataset.

5. As stated above, in general some idea of the complexity or time required to execute these types of queries on a sample dataset (e.g., GeoNorthWind) would be useful. I fully realize that the purpose of the paper is to present and formalize this new approach, but it may be useful to the reader to offer some numbers/stats on some of these spatial queries.

Minor note: There are a number of small spelling and grammatical errors in this paper. E.g. Section 3, GeoNorthwind is misspelled and “data is” in the abstract. Please make every effort to fix these issues.

Review #2
By Benjamin Adams submitted on 19/Oct/2016
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper presents QB4SOLAP, a new vocabulary for describing spatial data cubes using RDF, and spatial OLAP operators that can be translated into spatial SPARQL queries. It is a substantial re-working of a previously published conference paper with much new material. The authors present a very thorough set of formal definitions for the elements of the QB4SOLAP vocabulary built on the previously defined QB4OLAP vocabulary, and SOLAP operators over these elements. The SPARQL generator functions are well-defined as well -- a link to working service that performs these functions would be a nice addition to this work, but is not currently included.

Overall, the formalisms appear sound and are clearly written, but my main critique of the paper is that it remains unclear to me what the key advantages of translating SOLAP operations to SPARQL are? To me this is a key missing component of the paper. Some kind of evaluation is required that demonstrates that using QB4SOLAP is a viable alternative to existing SOLAP technologies, either in terms of efficiency of processing or in terms of expressivity. If it is the case that you can just reproduce the kinds of operations that can already be done in a traditional SOLAP tool that is built using spatial indexing, then it needs to be shown that it is efficent to do so also with spatial SPARQL queries over RDF. Otherwise, some examples of the kind of reasoning that is only possible with a QB4SOLAP defined MD data cube is in order. Without these, it begs the question why would someone go to the trouble of encoding their spatial datacube in RDF in the first place?

With regard to the organization of the paper, putting the related work section at the end instead of after the introduction, and including almost no references in the introduction, gives the impression that the Spatial OLAP operations being defined are new to this work. In fact, SOLAP has been around for over 15 years with many of these operators defined previously, in some cases quite formally. The authors need to differentiate their work from this previous work more clearly. Many key references for SOLAP are missing with this regard, including:

Sonia Rivest, Yvan Bédard and Pierre Marchand "Toward Better Support for Spatial Decision Making: Defining the Characteristics of Spatial On-Line Analytical Processing (SOLAP)"

Joel da Silva, Valerria C. Times, Ana Carolina Salgado "A Set of Aggregation Functions for Spatial Measures"

Leticia Gómez, Sophie Haesevoets, Bart Kuijpers, Alejandro A. Vaisman "Spatial aggregation: Data model and implementation"

Ines Fernando Vega Lopez, Richard T. Snodgrass, and Bongki Moon "Spatiotemporal Aggregate Computation: A Survey"

Overall, this is a nicely written paper but as stated above, I believe it requires some more justification for why this work will have impact on users of spatial data warehouses.

Minor comments:

A summary table listing all the main definitions (and their number), operators, and generators would aid the readability of the paper.

pg.1 "This does not allows" -> "This does not allow"

pg.2 "Third, we provide algorithms for generation spatially" -> "Third, we provide algorithms for generation of spatially"

pg.2 "This paper extends a previous conference paper very significantly..." -- Not sure this sentence needs to be in the paper, seems more like a comment to the editor.

pg.2 I take issue with the classification of Intersection, Union operators as "Spatial Aggregation" operators, although this appears to be usage in other papers as well. These are not aggregations in a spatial sense, rather in the sense of record aggregation.

pg.2-3 Need citations to RCC8 and DE-9DIM

pg.18 delete box character at end of paragraph in column 2.

Review #3
By Kristin Stock submitted on 21/Nov/2016
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper presents an extension of QB4OLAP to include spatial operators in order to represent spatial data cubes, a set of spatial operators to query data using the representation, and a set of algorithms to implement the spatial operators using SPARQL. The paper combines semantic web (through RDF and SPARQL) with OLAP/data cubes in a spatial context.

The paper is well written, clear and well structured. The definitions and formalisations are thorough.

I have two main comments. Firstly, the example is fairly limited. It uses a very small case study with only a few data items, so is only validated in a very limited context. I think it would be better to present a more realistic case study (ideally with real world data) to validate the work. The methods are very thoroughly presented, but given that this is proposed as a tool for dealing with complex, potentially large volumes of spatial data (of the kind that would use data cubes). It is also implied in p25, among criticisms of previous approaches, that this work might be suitable for RDF data that is frequently updated. However, it is not shown that this method would work well for that kind of data.
Secondly, I think the benefits of this approach need to be demonstrated more clearly. Certainly it includes Semantic Web technologies, which can be used for various purposes, but I think the benefits of doing that need to be demonstrated more clearly. A more extensive case study might do this, but it could also be explained more clearly. You are bringing together several different standards/approaches, but I think the benefit of this needs to be made clearer. What does using RDF and SPARQL give you that you didn’t already have?

Some other minor comments:

I’m not sure that it is appropriate to class intersection, convex hull, MBR as spatial aggregation operators. They primary purpose is not aggregation, intersection being the difference and convex hull and MBR being modified/summary versions of the source geometry.
p5. refers to Capitalised terms (in Figure 5), but this is slightly misleading. Capitalised implies the entire word is capitalised. Better to use the term upper camel case and lower camel case, or leading capital perhaps.
The example in the top right column of page 7 shows four uses of dimensionProperty, one or which has a leading capital, and three of which don’t. Should they all have a leading capital?

There are minor typographic errors throughout.

I am slightly confused by the notion that a fact is spatial if it relates several levels, where two or more are spatial. Does this mean that a hierarchy could consist of several levels, some of which are spatial and some not? How would this work? If one level of the hierarchy is spatial, are not all spatial, otherwise they are not part of the same hierarchy? Are you alluding to the case that you have a hierarchy of city, county, country, continent (for example), but you only have geometries for city and county, but not country and continent (so the object they represent is still spatial, but you just don’t have data for them?). I may be misunderstanding this, but either way, I would suggest clarification of this point in the paper (it is mentioned in several places).