Knowledge Graph OLAP: A Multidimensional Model and Query Operations for Contextualized Knowledge Graphs

Tracking #: 2269-3482

Christoph Schuetz
Loris Bozzato
Bernd Neumayr
Michael Schrefl
Luciano Serafini

Responsible editor: 
Harald Sack

Submission type: 
Full Paper
A knowledge graph (KG) represents real-world entities and their relationships with each other. The thus represented knowledge is often context-dependent, leading to the construction of contextualized KGs. Due to the multidimensional and hierarchical nature of context, the multidimensional OLAP cube model from data analysis is a natural fit for the representation of contextualized KGs. Traditional systems for online analytical processing (OLAP) employ cube models to represent numeric values for further processing using dedicated query operations. In this paper, along with an adaptation of the OLAP cube model for KGs, we introduce an adaptation of traditional OLAP query operations for the purposes of working with contextualized KGs. In particular, we decompose the roll-up operation from traditional OLAP into a merge and an abstraction operation. The merge operation corresponds to the selection of knowledge from different contexts whereas abstraction replaces entities with more general entities. The result of such a query is a more abstract, high-level view on the contextualized KG.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Aidan Hogan submitted on 29/Nov/2019
Minor Revision
Review Comment:

This paper describes an OLAP-style approach to modelling contextual knowledge graphs. Much like in traditional OLAP settings, the authors consider a multidimensional cube in which slice-and-dice and roll-up operations are supported. Unlike traditional OLAP settings where the cells of this cube are typically numeric values, in this setting, OLAP cubes are knowledge graphs. The authors propose that the dimensions of the cube can then serve as contextual dimensions, thus using an OLAP-style representation to manage context for knowledge graphs. The authors first define and formalise their "KG-Cube" model, covering the overall schema of the cube, the dimensions and levels, and the cells themselves; they then define the "knowledge modules" contained in each cell, which are described using an "object language" (examples are defined using DL syntax). Thereafter they describe the main query operations considered, including two contextual operations: slice-and-dice and merge (roll-up); as well as three graph operations: abstraction, pivoting and reification. The authors then briefly describe a proof-of-concept implementation using an off-the-shelf SPARQL store (GraphDB); using this, they outline some experiments using artificially generated data for an air-traffic-control use-case, over which SPARQL translations of the key operations defined earlier are introduced; the results in general show a linear relation between the cost of the principal operations and the size of the data.

The paper is of clear relevance to the journal, and tackles an important problem (managing contextual data) with an interesting and technical approach. The paper is well-written, and has a good balance of motivation (the air-traffic-control use-case is very appealing), intuition, formal definitions, abstract examples, and concrete examples. The approach itself I found (pleasantly) surprising: I was expecting another graph-esque representation of OLAP, but having OLAP where cells contain graphs is really something new for me and something that captured my attention; also the authors establish an interesting relation between OLAP and context that, though obvious once pointed out, I had not seen before and is, for me, a valuable observation. I also appreciate the provision of experiments for the various operators.

In summary, I like the paper quite a lot!

There are, however, some (minor-ish) points to improve upon:

* The paper never directly addresses the issue of incompleteness, which is one of the characteristic features of knowledge graph-style applications (though perhaps not the specific use-case chosen). For example, what could be done if some dimension members are not known for a particular cell? While this might be something for future work, I think the paper would benefit from some discussion on what completeness is assumed, and how incompleteness could be handled (either now, or in the future).

* I found the mix of expressiveness a bit distracting. In Figure 6, though K0 appears to be pure RDFS, in K1, some more expressive DL constructs are used, such as datatype facets. But these are not supported in OWL RL (Section 3.2.3). Again the expressiveness drops in the experiments where only RDFS is considered. I understand that this is not central to the objectives of the paper, but ideally this switching of expressiveness could be cleaned up somehow.

* Some of the discussion could perhaps be made more concise. A particular case of this is Example 9 and Example 10, which are, respectively, the second and third examples for abstractions. I am not sure what they add versus Example 8, which already exemplifies all of the different types of abstraction. I would suggest to either clarify at the start what the example additionally contributes so the reader knows what to be looking out for, or otherwise (if just another example of the same idea), remove it/them.

* Regarding Definition 5 and the merge operators, since RDF is used as the serialisation format, and since reification steps are considered, I think it would be worthwhile to mention something about blank nodes, either to simply say that the framework does not consider them (e.g., considers that they have been skolemised to constants beforehand), or to add the RDF merge operator [] as a merge option.

* The appendix has some useful material (e.g., to get an idea of the queries), but it's not included at the end of paper and took me quite some time to find it. I understand that it is very long (admittedly I only skimmed it for this reason), but maybe some part of it could be included at the end of the paper with the rest being posted online? Admittedly I'm not sure I have a good suggestion on how to handle this.

In summary I like the paper a lot. I think the above comments can be addressed within a minor revision. Please also review the following (more) minor remarks.


* "A knowledge graph (KG) serves organizations to represent real-world entities ..." Slightly awkward phrasing.

* "The Resource Description Framework (RDF) is the standard representation format for KGs ..." I personally think this should say "a standard representation format"; even though there is no standard, this gives the impression of RDF being the one and only representation format.

* "The roll-up operation that sums up ..." Awkward and difficult to read. Rephrase.

* "in [the] form of messages"

* "using the cases of ATM" -> "following the use-cases of ATM"

* I am a bit confused as to why D (dimensions) is defined to be a set of atomic *roles*. I would have imagined a set of dimension names, like "Importance", "Location", etc. I thought that maybe the dimension role maybe is the dimensional ordering, but this is actually defined separately. Perhaps this could be clarified/explained better.

* "s.t. for j with ..." Awkward phrasing.

* "called [the] object language"

* "we assume that [the] meta and object level[s]"

* "is then [a] DL language"

* "come in various fashions" -> "come in various forms" (The current phrasing is slightly off and a bit distracting; one could perhaps also say "Note that various fashions of slice-and-dice operations ..."

* "serve as [a] grouping property"

* "RDFpro allows for the specification of queries across different graphs, a feature needed for the reasoning ..." SPARQL also supports this (through FROM, FROM NAMED and GRAPH).

* "causing a vast number of lower-level cells to be merged" What does "vast" mean? Please be specific.

* "regardless [of] contextualization"

* "different serializations and publication formats of traditional OLAP cubes rather than KGs" In other words, a traditional OLAF cube represented as a graph is not a KG (it does not "represent real-world entities and their relationships with each other")? I think perhaps the argument here can be rephrased to avoid this unnecessary implicit claim (e.g., end the sentence after "cubes").

* I appreciate that the link in footnote 5 provides material to reproduce experiments, but the link should be made more prominent; from the context of the text where the footnote is referenced it's not clear at all what is in the link.

Review #2
By Sven Groppe submitted on 10/Feb/2020
Major Revision
Review Comment:

The strong points of the paper are:
- It is easy to read and contains a great introduction into OLAP and other relevant basics
- It covers a strong formalalization of the discussed problem and solution
- It introduces KG-OLAP as variant of OLAP not being discussed so far in the existing literature
- The research focus on aggregation is very important and interesting, as aggregation hasn't been extensively investigated in the Semantic Web community so far (at least from my impression)
- The paper discusses an extensive use case from the real world (in the area of air traffic management), which helps much to get a feeling for the significance of the proposed approach in daily life.

The weak points of the paper are:
- The authors don't detail on the additionals benefits KG-OLAP has in comparison to traditional OLAP and Graph-OLAP. The authors don't motivate this point in the use case, where it should be clarified in concrete "steps" in the running example, what can be done in KG-OLAP, which cannot be done in OLAP/Graph-OLAP, or which can be done in a simpler way (by utilizing the additional abstraction layer of the ontology).
- How would the running example be modeled in OLAP/Graph-OLAP? This could help to show which is additionally possible or easier in KG-OLAP...
- There is some integration of ontology and its inference into KG-OLAP (see Knowledge modules), but the integration hasn't been brought to its full potential:
- It looks like that the abstraction operator takes over some functionality, which could be also achieved by taking inference of a suitable ontology into account
- Some operators like the roll-up operator should be rethought as well: The roll-up operator could e.g. roll up to the instances of super classes (organized in a suitable ontology). Furthermore, the authors should think about and discuss what other OWL features (besides subclass relationships) can be integrated into KG-OLAP and its operations.
- In the performance evaluation, the runtimes of only the authors' implemented system is measured and is not compared with any of the other existing systems. As it is the first system for KG-OLAP, the authors could use systems for traditional OLAP and Graph-OLAP for comparison issues. Furthermore, please check if there are any existing OLAP benchmarks, which could be applied (after adapting it to KG-OLAP) in the experiments, too.
- The runtimes of the KG-OLAP operations are relatively bad and should be only in the area of some few seconds to be used in daily life. The authors should think about improved approaches with a better performance. It would be nice to see efficient approaches scaling up to something like billions of statements (instead of "only" 35 million statements), such that we could call it a Big Data approach.
- It is quite disappointing that only a subset of RDFS is considered in the experiments and not OWL (or one of its fragments/profiles like OWL2RL/OWL2QL/...).

Minor comments:
Figure 1: The example in b) is more general than the one in a), but should be analogous to a) representing the same information (only in triple form), such that readers get a direct comparison in the example.

Review #3
Anonymous submitted on 04/Mar/2020
Major Revision
Review Comment:

This work presents KG OLAP, and approach for representing multidimensional data based on contextualized knowledge graphs. The authors present the semantics of the KG OLAP Cube Model and its relationship with the Contextualized Knowledge Representation (CKR) Framework. Then, the semantics of the contextual operations – slice and dice, union merge, and intersection merge – and the graph operations – abstraction based on triple, individual, or value generation, pivoting, and reification – are presented. The manuscript reports on the performance of a proof-of-concept implementation of KG OLAP to show the feasibility of contextualized KGs using synthetic data.

Reviewing Criteria:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

(1) Originality
This work presents an approach for representing and querying multidimensional data based on the RDF data model. While the proposed approach KG OLAP is sufficiently novel with respect to other solutions discussed in the Related Work, the novel research contributions of this work with respect to the authors’ previous work should be clearly presented, especially with references [15] and [36] in the manuscript.

(2) Significance of the results
The proposed solution has been showcased in the knowledge domain of air traffic management. Furthermore, it is easy to see how the proposed solution can be applied to other domains that require multidimensional representation.

The manuscript reports on theoretical results (i.e., Theorem 1), yet, no formal proof has been provided. A sound empirical evaluation of a proof-of-concept implementation is presented in the manuscript. The experimental results clearly show the tradeoff between repository size, dimensionality, number of contexts, and the performance (runtime) of the proposed OLAP operations.

(3) Quality of writing
The manuscript is very well written. The authors have included extensive examples that clearly illustrate the proposed concepts. It was a pleasure to read this work.

In summary, the research contributions of this work are of great value to the journal and the Semantic Web Community. Nonetheless, there are a few crucial inaccuracies that should be resolved before the manuscript is accepted. Further details below.

Major Comments/Questions:

1. Please summarize the novel contributions of this manuscript with respect to the authors’ previous work.

2. The manuscript frequently refers to the appendix for additional theoretical properties, empirical results or other relevant information about the approach. Unfortunately, the appendix is not included in the paper. Is this an online appendix? Given the prominence of this appendix in the presented work, the authors should clearly provide the references to the additional information.

3. How is the notion of context coverage from KG OLAP language formally defined in the CKR core language? The manuscript only mentions that “context coverage is a partial order relation in R” (page 11).

4. In Definition 3 (KG-OLAP cube model), the symbols d, d_A, m are not defined in conditions (iv), (v), and (vi), respectively.

5. In Definition 3 (KG-OLAP cube model), condition (vi) presents \mathcal{M} \entails c_1 \preceq c_2 as a condition. Shouldn’t this condition simply be c_1 \preceq c_2? Please check.

6. Theorem 1 has not been formally demonstrated in this work. In particular, the proof for the CKR extensions that include the computation of cell coverage relation has not been proved.

7. In Definition 8 (Reification), the assertion hasObject(R-a-b, r) should be hasObject(R-a-b, b). Please check.

Minor comments:
- (page 1) “The Resource Description Framework (RDF) is the standard representation format for KGs”. RDF is not a format but a data model and it is a recommendation by the W3C to publish KGs using (semantic) web technologies.

- In Figure 6, axiom (1), is the intersection of Runway and Taxiway a subconcept of Runway Taxi? (the comma is ambiguous here).

- In Figure 7 (b), after merging, should c1 have a new name? A similar comment applies to Figure 9 with the context c.

- For the sake of self-containment, the notions of complex concepts and complex roles (used for example in Definition 6) should be introduced in the paper.

- In the evaluation, please clearly explain in the text the slight difference in the size of the repositories when varying the number of contexts observed in the plots of Figures 17 and 18.