Morph-KGC: Scalable Knowledge Graph Materialization with Mapping Partitions

Tracking #: 3090-4304

Authors: 
Julián Arenas-Guerrero
David Chaves-Fraga
Jhon Toledo
María S. Pérez
Oscar Corcho

Responsible editor: 
Elena Demidova

Submission type: 
Full Paper
Abstract: 
Knowledge graphs are often constructed from heterogeneous data sources, using declarative rules that map them to a target ontology and materializing them into RDF. When these data sources are large, the materialization of the entire knowledge graph may be computationally expensive and not suitable for those cases where a rapid materialization is required. In this work, we propose an approach to overcome this limitation, based on the novel concept of mapping partitions. Mapping partitions are defined as groups of mapping rules that generate disjoint subsets of the knowledge graph. Each of these groups can be processed separately, reducing the total amount of memory and execution time required by the materialization process. We have included this optimization in our materialization engine Morph-KGC, and we have evaluated it over three established benchmarks. Our experimental results show that, compared with state-of-the-art techniques, Morph-KGC presents the following advantages: i) it decreases significantly the time required for materialization, ii) it reduces the maximum peak of memory used, and iii) it scales to data sizes that other engines are not capable of processing currently.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Sherzod Hakimov submitted on 18/Mar/2022
Suggestion:
Accept
Review Comment:

The paper presents a method for materialization of triples that are used for knowledge graph construction. Two different algorithms are presented for mapping partitions, which group mapping rules that create triples. The presented algorithms are evaluated on three benchmark datasets and compared against four existing systems. The experimental results suggest that the presented sequential and parallel processing solutions perform better than the compared systems in terms of memory usage and execution time, respectively.

Originality: the paper addresses an important problem of reducing memory usage and execution time for knowledge graph construction. The novelty aspect of the paper lies in the presented two ways of using mapping partitions: parallel or sequential execution.

Improvement: The authors have taken into account the comments regarding the inclusion of additional configurations of the presented model for some experiments, inclusion of a discussion section to finalize the findings from the presented solution. The authors have shared the resources in Zenodo for the reproducibility.

Overall, the paper is well written with clear structures, after the requested changes from the initial reviews.

Review #2
Anonymous submitted on 07/May/2022
Suggestion:
Minor Revision
Review Comment:

This document reports the second version of the work on planning and execution techniques for [R2]RML mapping rules. The proposed method relies on the partition of mapping rules. Evaluating the groups in the partition, reduces the duplicated generation of RDF triples and maximizes the parallel execution of the mapping rules. The proposed techniques are implemented in MorphKGC, an RML-compliant engine; the behavior of MorphKGC is assessed in three existing benchmarks. The reported results suggest that the proposed methods can accelerate the execution of mapping rules as the ones composing the studied benchmarks.

Overall, this second version addresses some comments on the previous version. The new experimental results provide evidence of the benefits that planning the execution of the mapping rules brings to the process of KG construction. Moreover, the competitive behavior with other engines puts the critical role played by logical and physical planning into perspective. Despite the improvements, this new version of this work still presents imprecise statements which need to be addressed in a new version of the paper.

Definitions:

Definitions 1 and 2 are still not well-formulated; please, reuse the notation and conventions in 2.2. The formal proof of whether given a data source when these two mapping documents will produce the same set of RDF triples is necessary; the correctness of the proposed approach dependents on that.

I do not agree with this statement
“Answer: We have included an additional function const(.) in Section 2.2 to solve this. Note that `if` here does not refer to logics, we have replaced it with `when` to avoid reader confusion. As the values of a term map must be {constant, template or reference}, the invariant is well-defined with the bullets that consider the three possibilities.”
Since “when” and “if” represent logical conditionals, clarify the sufficient and necessary conditions in Definition 5.

Proposed Algorithms:

Please, clearly state the assumptions under which Algorithm 3 is able to generate the Maximal Mapping Partition of an [R2]RML document.

Empirical Evaluation
Please, include absolute values and explain the impact of the selectivity of the joins in the performance of the compared engines.

Minor comments
SDM-RDFizer v4.1.1 and Chimera v2.1. are interpreters of RML and not only parsers. Please, clarify this point.