Multidimensional Enrichment of Spatial RDF Data for SOLAP

Tracking #: 2418-3632

Authors: 
Nurefsan Gur
Torben Bach Pedersen
Katja Hose
Mikael Midtgaard

Responsible editor: 
Boyan Brodaric

Submission type: 
Full Paper
Abstract: 
Large volumes of spatial data and multidimensional data are being published on the Semantic Web, which has led to new opportunities for advanced analysis, such as Spatial Online Analytical Processing (SOLAP). The RDF Data Cube (QB) and QB4OLAP vocabularies have been widely used for annotating and publishing statistical and multidimensional RDF data. Although such statistical data sets might have spatial information, such as coordinates, the lack of spatial semantics and spatial multidimensional concepts in QB4OLAP and QB prevents users from employing SOLAP queries over spatial data using SPARQL. The QB4SOLAP vocabulary, on the other hand, fully supports annotating spatial and multidimensional data on the Semantic Web and enables users to query endpoints with SOLAP operators in SPARQL. To bridge the gap between QB/QB4OLAP and QB4SOLAP, we propose an RDF2SOLAP enrichment model that automatically annotates spatial multidimensional concepts with QB4SOLAP and in doing so enables SOLAP on existing QB and QB4OLAP data on the Semantic Web. Furthermore, we present and evaluate a wide range of enrichment algorithms and apply them on a non-trivial real-world use case involving governmental open data with complex geometry types
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Alberto Abelló submitted on 23/Mar/2020
Suggestion:
Major Revision
Review Comment:

The paper presents four algorithms to enrich OLAP cubes with geographic relationships. The paper is original and contains a very good piece of engineering. Nevertheless, a more in depth analysis of what is really happening is needed. It is claimed that "QB4SOLAP enables SOLAP operations". However, the information is actually assumed to be there ready to be analysed and explicited. The really relevant discussion then is (similar to materialized views selection) whether it is worth to precompute them, or better be discovered on the fly.

The way of obtaining "development cost" in Table 6 is not clear at all. Moreover, its comparison in the different scenarios looks unfair. The time to develop the ETL could be compared against that of developing the annotation algorithms, for example, which corresponds to the data preparation. Then, once the data is prepared in both systems, you can write the corresponding queries and compare how hard that is.

Regarding the discrepancies in the results of the different systems, they should have been clearly introduced much earlier. Section 6.4 comes as a surprise. Then, in 6.6, it is said that "RDF2SOLAP demonstrated accurate results". However, it is not clear which part of the (in)accuracy comes from the presented algorithms and which from the library being used.

Regarding the writing, it is well written, but contains some unnecessary overwhelming details. For example, many details of the implementation are not needed after section 4. These could be either, summarized, integrated in section 4, moved to an appendix, or simply kept in a longer report version. Only 5.5 is really relevant, and comes too late. Conclusions of almost one page looks excessive. On the contrary, the distinction between topological and aggregation relations should be more clear from the very beginning in the introduction.

- Looks like there is a missing arrow in Figure 5, between parish_8648 and water_159.
- The concept of phases, suggests some sequentiality, which is not the case. Steps could be more appropriate, for example.
- Page 23, line 1, "contant"
- Page 25, line 36, "the performance our approach"
- Table 6, row 5, units are missing

Review #2
Anonymous submitted on 29/Apr/2020
Suggestion:
Major Revision
Review Comment:

The paper presents an approach to automatically enrich multidimensional RDF data compliant with the QB4OLAP vocabulary with triples that explicitly indicate the spatial relationship between members of the cube. These relationships are derived from the geometries that describe such members; the algorithms to extract spatial relationships are formalized and evaluated in terms of both effectiveness and efficiency.

The paper discusses an interesting research topic at the crossroads between the areas of semantic web and (S)OLAP analysis. The contribution of the paper is not groundbreaking, but it presents and evaluates a framework for spatial enrichment of RDF data which is worth considering for publication. The quality of the paper is good, both in terms of presentation and self-containment; related work is accurately discussed. Nonetheless, I found some issues that require revision by the authors.

First of all, Section 4.1 and Algorithms 3-4 seem to describe a relatively simple process in a quite complicated way. If my understanding is correct, algorithm 3 simply retrieves the couples of linked level members where both are described by geometries and verifies the spatial relationship between the latter; Algorithm 4 does the same, with the only difference that level members are not directly linked, but they belong to different levels within the same hierarchy. Therefore, 1) in both cases, a verbal description that explains the intuition behind the algorithms is missing, 2) wouldn't algorithms 3 and 4 be better represented by using a simple and concise SPARQL query rather than a notation-heavy process?

Second, I question the relevance of Section 5. Except for section 5.5 (which makes interesting observations about the state of the art of spatial technologies for semantic web), this section deeply describes the code of the implementation. What is the scientific relevance of this part? Considering that the implementation carefully follows the algorithms presented in the previous section and that the code is available on Github, I don't see the point of making such a discussion. I would advise to 1) significantly reduce the discussion and maintain only the aspects (if any) that are interesting from a scientific/research perspective; 2) possibly move to an appendix the detailed discussion if the authors believe it should be absolutely kept in the paper. Otherwise, please provide a solid motivation for discussing the implementation code in a core section of the paper.

Finally, I have some doubts about the soundness of the evaluation in Section 6.2.
In the comparison of Algorithms 3 and 5 (both of which are based on explicit relationships), I would have expected to see an execution time proportional to the number of relationships; the results clearly prove this expectation wrong. I suspect this is due to the different nature of the relationships considered in the two algorithms. Unless I missed this, the authors should give more details to explain these results.
But my main concern is about Table 6. It seems unfair to compare exact results on run time with (what appear to be) rough estimates on the development cost. The authors need to provide further details about these development costs and about how they have been obtained. Also, it is not clear whether the user's expertise has been taken into account. Please explain this part with more details, highlighting the critical phases (conversion? loading?) and discerning the objective data (i.e., actual times) for subjective aspects (i.e., user expertise).

Other remarks are indicated in the following:
- Fig.1: not clear what the semantics of arrows is.
- Check the use of acronyms across the paper. For instance, MD is introduced in p.1 line 40 left and then re-introduced in p.2 line 22 left; similarly, SOLAP is introduced in p.1 line 51 right and p2. line 24 right, but not used in p.2 line 5 right. This issue is all over sections 1, 2, and 7.
- p.3 Contributions: at this point, it is not clear what the meaning of "explicit" and "implicit" hierarchy steps (and fact-level relations) is. Either explain this before or change the contributions to make them more general.
- Fig.7: still not clear what the semantics of arrows is. The figure is introduced as representing a "process flow", but it looks more like an architectural view of the framework. Also, it appears from the figure that queries on the triplestore are never formulated by the user, not even through GeoSemOLAP. Is this correct? What is the meaning of arrows outgoing from the "queries" module? If they are the response from the incoming arrows, wouldn't it be better to have single double-ended arrows? What is the legend of the symbols? Please revise this figure.
- Maybe I missed it, but what are "p" and "k" in p.10 lines 17,22 right?
- p.11 line 45 left: "The output of the helper function (Vs(ac)) keeps the spatial attribute values of the child level member idI(lmc)". Isn't this an abuse of notation, since vs(a) has been defined in p.10 line 40 right as a set of literals?
- Section 6: please convert execution times from seconds to minutes where necessary for easier reading; Table 6 has high run times in seconds (e.g., 2622) and dev.costs in minutes (e.g., 5 minutes = 300 seconds)? Please present the results consistently.

Minor comments & typos:
- p.1 line 39 left: which allow
- p.2 line 12 left: for instance?
- p.3 line 4 left: should begin with "An illustration of..."