Morph-KGC: Scalable Knowledge Graph Materialization with Mapping Partitions

Tracking #: 2924-4138

Julián Arenas-Guerrero
David Chaves-Fraga
Jhon Toledo
María S. Pérez
Oscar Corcho

Responsible editor: 
Elena Demidova

Submission type: 
Full Paper
Knowledge graphs are often constructed from heterogeneous data sources, using declarative rules that map them to a target ontology and materializing them into RDF. When these data sources are large, the materialization of the entire knowledge graph may be computationally expensive and not suitable for those cases where a rapid materialization is required. In this work, we propose an approach to overcome this limitation, based on the novel concept of mapping partitions. Mapping partitions are defined as groups of mapping rules that generate disjoint subsets of the knowledge graph. Each of these groups can be processed separately, reducing the total amount of memory and execution time required by the materialization process. We have included this optimization in our materialization engine Morph-KGC, and we have evaluated it over three established benchmarks. Our experimental results show that, compared with state-of-the-art techniques, Morph-KGC presents the following advantages: i) it decreases significantly the time required for materialization, ii) it reduces the maximum peak of memory used, and iii) it scales to data sizes that other engines are not capable of processing currently.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 30/Dec/2021
Major Revision
Review Comment:

The paper tackles the problem of knowledge graph (KG) construction. It proposes planning and execution techniques to speed up KG construction pipelines specified using mapping rules in [R2]RML. The proposed method relies on the partition of mapping rules so that the evaluation of the groups in the partition, reduces the duplicated generation of RDF triples and maximizes the parallel execution of the mapping rules. The proposed techniques are implemented in MorphKGC, an RML-compliant engine; the behavior of MorphKGC is assessed in three existing benchmarks. The reported results suggest that the proposed methods can accelerate the execution of mapping rules as the ones composing the studied benchmarks.

Overall, the paper is relatively well-written and presents an efficient solution to a relevant data management problem. The experimental results provide evidence of the benefits that planning the execution of the mapping rules brings to the process of KG construction. Moreover, the outcomes of the empirical evaluation put in perspective the improvements achieved by the proposed planning techniques. However, the current version of this work suffers from several issues that considerably reduce its value. First, the problem is not formally defined, the definitions are not well-formulated and several relevant concepts are idle defined. Second, the complexity and correctness of the proposed algorithms, and the conditions that prerequisite their performance and soundness are not even mentioned. Third, the experimental study is conducted over a limited set of test cases, preventing, thus, the reproducibility of the reported outcomes. Finally, the state of the art is superficially analyzed, and the proposed techniques are not positioned with respect to similar data management techniques.

All these issues prevent from a positive evaluation of the current version of the work. The recommendation is for a major revision addressing all the following comments.

Mathematical Pitfalls:
• Definition of equivalent triples maps. The conditions to be met to decide when two triples maps are equivalent are not formally stated. Note that two triples maps can produce equivalent results even if they are defined over two different logical sources. However, the informal definition presented on page 3, suggests that both triples maps should be defined over any given data source. Moreover, the property of “two equivalent term maps with different positions are not equal”. The differences between equivalent and equal must be differentiated. What is the complexity of the problem of deciding if two triple maps are equivalent?
• “Because of RDF set semantics”. An RDF document is formalized as a graph, please, clarify and reference to which RDF set semantics the authors refer to.
• It is never defined when an RDF triple is generated from the evaluation of a triples map over a logical data source. Please, formally state the results of a triple map evaluation.
• Definitions of position(.), type(.), value(.), and literaltype(.) incorrectly assume in the domain a single element T. Contrary, the domain in all these cases must be a set of term maps.
• Self-joins are used without any definition in the context of mapping rules.
• Definition 1. The concepts [R2]RML document, equivalent normalized [R2]RML document, and [R2]RML documents without self-joins are not used without any previous formal definition. Consequently, the description of canonical [R2]RML document is mathematically incorrect.
• Definition 2. A partition of a set X is a set of subsets of X, and the relationship of the parts G_i of P is not defined. Also, please, note the time complexity of enumerating all the possible subsets of a set X.
• Definition 3. A template should be defined in terms of the different structures that it may take, and then a prefix can be defined.
• Definition 4. An invariant is wrongly defined. T is a term map, while an invariant I is a string, and making I equal to T is a type mismatch. Also, the “if” conditions only state sufficient prerequisites. Please, provide sufficient and necessary conditions of the values of an invariant I.
• Definition 5. Term maps are considered sets, even though a term map is never presented in this way
• Definition 6. The definition of Disjoint mapping rules relies on the definition that a triple set is generated from a mapping rule which has never been defined.
• Definition 7. It presents a property instead of the definition of a Disjoint mapping Groups. Also, it relies on definition 6 (which is incorrect) and assumes that a triple map is a set.
• Definition 8. It defines a Maximal Mapping Partition of an [R2]RML document as the one with the largest number of mapping groups. This is exactly the one where each group is a singleton set unless another missing condition needs to be satisfied. What is the complexity of finding a Maximal Mapping Partition?
Proposed Algorithms:
• Algorithm 1 simply replaces an object reference for the template definition of the corresponding parent triples map. Algorithm 2 and 3 use functions which are not defined.
• Heuristics implemented by Algorithm 3 are not defined and because the problem solved by Algorithm 3 is not defined, it is impossible to demonstrate its correctness.
• The complexity of Algorithm 3 requires to be demonstrated.
Empirical Evaluation
• The empirical evaluation does not consider engines that also implement similar planning techniques, e.g., SDM-RDFizer 4.0 [1].
• The meaning of Table 1 is not clear, and the reported results do not have any connection with the rest of the outcomes presented in this section. Please report the results if the experimental study was also conducted over all these datasets. Otherwise, eliminate this table. It is misleading for the reader.
• The absolute values of figures 4 and 5 should be reported.
• Detailed discuss of the conditions to be met by a data integration system to benefit from the proposed techniques. For example, what happens when the joins between triples maps are not self-joins, and the selectivity change? The authors claim that they have discussed the parameters that impact the execution of the KG construction process in previous work. However, only a few of them are considered in this study, reducing, thus, reproducibility of the results in more general testbeds.

Related Work
• This section simply describes tools instead of analyzing the data management techniques proposed by each of the existing approaches. Please, provide a deeper analysis of the problems solved by the approaches reported in the literature, their pros and cons of the proposed techniques, and position your techniques with respect to them.

Minor comments
• The benchmark Genomics – TIB is mentioned but then, the COSMIC testbed is used. Are they both the same? Do the authors refer to the benchmark available at [2] or at any other benchmark?


Review #2
Anonymous submitted on 31/Dec/2021
Review Comment:

This paper presents, to the best of my knowledge, an original work of partition-based graph materialization using RML approaches.
The results presented in this paper are useful and the paper can find its applications in several interesting domains for RDF materialization. The paper is generally clearly written, however it is not self-contained,
a brief introduction to vital methods e.g. normalization can be included in the paper itself.

The source is provided at GitHub and is well documented.

For the evaluation, some engines are eliminated for some benchmarks without any justification e.g. ontop was only used in GTFS Madrid bench,
while SDM RDFizer was not evaluated in NPD. it feels like the engines were randomly selected.
Also in many cases the performance of SDMRDFiuer remained closed to MorphKGC, which is not sufficiently discussed.

Review #3
By Sherzod Hakimov submitted on 21/Jan/2022
Minor Revision
Review Comment:

The paper presents a method for materialization of triples that are used for knowledge graph construction. Two different algorithms are presented for mapping partitions, which group mapping rules that create triples. The presented algorithms are evaluated on three benchmark datasets and compared against four existing systems. The experimental results suggest that the presented sequential and parallel processing solutions perform better than the compared systems in terms of memory usage and execution time, respectively.

Originality: the paper addresses an important problem of reducing memory usage and execution time for knowledge graph construction. The novelty aspect of the paper lies in the presented two ways of using mapping partitions: parallel or sequential execution.

Significance of the results: the authors compared to existing systems on multiple benchmark datasets. The presented methods seems to perform better in both memory usage and elapsed time. Even though the shown results are better than those of compared systems, two different configuations (parallel or sequential execution) are highlighted for the elapsed time and memory usage, respectively. The authors have compared them side-by-side in Figure 3 where advantages/disadvantages of each system can be seen. What is the performance of each configuration Morph-KGC^p+ and Morph-KGC^p on GTFS, COSMIC, and NDP datasets? For instance, the Figure 4a on GTFS datasets shows only Morph-KGC^p+ (parallel) and Figure 4b shows Morph-KGC^p (sequential). How does each configuration (parallel or sequential) compare with other systems for both memory and execution time? The conclusion how one can combine each configuration to build more efficient materialization for KGC is missing. And what cost the execution time suffers for keeping low memory usage or vice versa when two of these configurations are merged? These points are not made clear in the paper.

Also, five different configurations of Morph-KGC are mentioned but only three are compared/evaluated.

Quality of writing: The paper is written very well with clear structure and adequate language.

Overall: The results section and conclusion about the drawbacks of each highlighted configuration needs to be mentioned as well, where one can see them from the results sections. Besides this point, the paper has addressed an important challenge, presented significant contributions.