KG-MDL: Mining Graph Patterns in Knowledge Graphs with the MDL Principle

Tracking #: 3339-4553

Francesco Bariatti
Peggy Cellier
Sébastien Ferré1

Responsible editor: 
Bernardo Cuenca Grau

Submission type: 
Full Paper
Nowadays, increasingly more data are available as knowledge graphs (KGs). While this data model supports advanced reasoning and querying, they remain difficult to mine due to their size and complexity. Graph mining approaches can be used to extract patterns from KGs. However this presents two main issues. First, graph mining approaches tend to extract too many patterns for a human analyst to interpret (pattern explosion). Second, real-life KGs tend to differ from the graphs usually treated in graph mining: they are multigraphs, their vertex degrees tend to follow a power-law, and the way in which they model knowledge can produce spurious patterns. Recently, a graph mining approach named GraphMDL+ has been proposed to tackle the problem of pattern explosion, using the Minimum Description Length (MDL) principle. However, GraphMDL+, like other graph mining approaches, is not suited for KGs without adaptations. In this paper we propose KG-MDL, a graph pattern mining approach based on the MDL principle that, given a KG, generates a human-sized and descriptive set of graph patterns, and so in a parameter-less and anytime way. We report on experiments on medium-sized KGs showing that our approach generates sets of patterns that are both small enough to be interpreted by humans and descriptive of the KG. We show that the extracted patterns highlight relevant characteristics of the data: both of the schema used to create the data, and of the concrete facts it contains. We also discuss the issues related to mining graph patterns on knowledge graphs, as opposed to other types of graph data.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 10/Mar/2023
Review Comment:

This paper proposes KG-MDL, a method for pattern extraction from knowledge graphs (KGs). The method is based on the Minimum Description Length (MDL) principle, which states that the quality or effectiveness of a representation model is the one that encodes the data at a minimum possible description, i.e., the one that compresses the data. The paper discusses the differences between KGs and other kinds of graphs that are usually encountered in graph mining, and shows how to adapt an existing MDL-based method called GraphMDL+ to KGs. The paper evaluates KG-MDL on small and medium-sized KGs and shows that it can effectively extract a minimum set patterns that describe the structure and possible knowledge in the graph.

*Strong points
S1. The paper is well written and easy to follow
S2. The paper addresses an interesting problem; that of graph mining for KGs. KGs nowadays are becoming very complex, and hard for humans to detect interesting facts.
S3. The authors offer a well-documented open source implementation of their methods.

*Weak points
W1. The paper has limited contribution for a journal paper. Its core modelling and algorithmic parts build on previous work from the authors.
W2. The paper does not provide any comparison with other related efforts for KG graph mining or summarization.
W3.The paper misses some important RW for graph mining and RDF schema extraction

*Detail comments.
D1. Intro: The contributions stated by the article in the intro are not clear. E.g., the MDL-based method and the search algorithm are variations of the graphMDL+. The discussion of important distinctions between graph mining and KGs are not convincing, since they boil down to replacing rdf:type edges with labels and splitting literal nodes. The evaluation is performed on very small graphs, which are not nowadays the common case in KGs.
For those reasons I think this article should be drastically improved and resubmitted.
D2. RW. You miss some works from the semantic web and database communities, which is worth referring in your work. For example, relevant works include traditional graph mining such as “Graph Pattern Mining and Learning through User-Defined Relations @ 2018 IEEE International Conference on Data Mining (ICDM18)” or “GRAMI: Frequent Subgraph and Pattern Mining in a Single Large Graph@VLDB14”, RDF mining “RDFRules: Making RDF rule mining easier and even more efficient @SemanticWeb Journal 2021”, RDF summarization such as “RDF graph summarization for first-sight structure discovery@VLDBJ 2020”
It would be good to include and position your work in terms of more relevant works and stress on what type of patterns your methods are good at. As presented now, your framework seems to offer generic capabilities for very different types of graph mining, which confuses the reader.
D3. Section 4-Fig3 left. An RDF graph will not necessarily refer to the same node for a literal that is of the same value. E.g. two subject nodes with a literal as a value are not necessarily represented as graph pointing to the same node. It would be good to show -steps of your parsing algorithm- how you move from triples to your KG-MDL representation.
D3. Section 4. It is not clear what is the overall goal for the adoption of the KG-MDL. Is it to speed up pattern extraction? Is it to reduce the number and space needs for the graph patterns? And if so; is it for speeding up the SPARQL querying on these? There is a general statement of the MDL principle , but what is the overall goal, what do you achieve with this encoding?
Maybe this should be stated more clearly in the intro and become part of the justification for your contributions.
D4. Figure 5. Are the edges directed in the rewritten graph?
D5. 4.2.3: It is a bit hard to follow, especially the first part. Provide an example on how you compute the description length of a pattern in CT.
D6. 4.3: Search of a pattern. It is not clear how you consider that a node is a port? E.g. sink or root nodes? Please explain.
D7. There is no complexity analysis of the search algo. Also, a figure example on how a small CT is gradually constructed will help.
D8. The search algo does not seem to scale (and maybe exhibit exp complexity for complex graphs), I think you already mention that in your considerations and the conclusions. However, it is sth that limits the contribution of your work, as KGs nowadays are huge. I think you could come up with a greedy or heuristic variation that can improve the complexity of your algo.
D9.The row cover time out is not a principled way to prune the space of your algo, as it depends on the initial size of the graph (you present this in fig 7.) and the HW resources (mem, cpu, etc). I would propose to experiment with other metrics, such as the % of patterns added in a CT, or the diff in compression ratio (%) is below a threshold, etc.
D10. Experiments: Comparison with other approaches is missing. Even comparison with previous works of yours would offer valuable results.
D11. Table1 . You report on the triple, but not on the equivalent nodes and edges of the RDF graph that you use as intermediate representation for the KG-MDL graph. If possible put the |V| and |E| of this graph. Also, why |V|<<|V_L|? Is it because of the rdf:Type (or partofSpeech) predicate?
D12. P12 ”when looking at the data in details… This is consistent with the idea of finding patterns to describe the structure of the data, without caring about the concrete values of the different properties..”
KGs do follow a schema or an inherent schema. Have a look e.g., in the paper, “Relational schema optimization for RDF-based knowledge graphs @Information Systems 2022” which shows this property through the notion of the extended characteristic sets. So essentially, what is the interestingness part of this finding and also please refer in your RW to such approaches that aim at extracting the inherent structure from the KGs.

Review #2
Anonymous submitted on 07/Jun/2023
Major Revision
Review Comment:

The authors introduce a pattern mining approach for knowledge graphs (KGs) based on the Minimum Description Length (MDL) principle. It aims to mine a relatively small number of compact patterns, which can represent frequent patterns in the KG. Informally, a pattern is a labelled graph that can be embedded into the KG via an injective function with its vertex and edge labels subsumed by those of the mapped vertices and edges in the KG. For example, "a book is authored by a person" is a pattern which may have many occurrences (or embeddings) in a KG about books and authorship. Graph mining methods based on MDL have provided ways to calculate the description lengths of patterns for the selection of sets of patterns with minimum lengths. Yet, no MDL-based pattern minding methods have been proposed for KGs previously.

The authors adapt and extend the graph mining method GraphMDL+ [18] for KGs, as GraphMDL+ does not restrict the shapes of the patterns. To achieve this, the proposed approach first preprocesses the KG and then searches for patterns based on the adapted MDL metrics. Intuitively, the search starts with a large number of primitive patterns and iteratively combines them to form more representative patterns. While the algorithm looks intuitive, I have two major concerns. First, it appears that the code table, i.e., the list of patterns, keeps increasing as new patterns are iteratively added (Alg. 1, l9). I would expect the combined patterns would replace the original patterns, instead of simply being added. If the code table only increases in size, how can the description length reduce (Alg. 1, l10)? Maybe I missed something important here.

My another concern is that the search performed in such a straightforward manner appears to be inefficient. The initial code table is already very large, which contains the singleton patterns corresponding to all the vertices and edges of the KG. Then, the number of combinations between each pair of existing patterns is also huge (Alg. 1, l6), while many of the combinations may turn out to be undesired. At some points, I assume it needs to decide whether two patterns in the code table are equivalent through graph isomorphism, and maybe even pattern redundancy through injective homomorphism. Moreover, the computation of description lengths after each new pattern is added to the code table (Alg. 1, l10) is rather expensive. I wonder how the algorithm could scale over medium-sized KGs. The only heuristic mentioned is used to select pattern combinations (Alg. 1, l6), where only a reference is provided. It is unclear to me whether such a heuristic has a major impact on the efficiency of the search algorithm.

The computation of description lengths is rather involved, but it is left unexplained what such a complex computation achieves. That is, the intuition behind the equations in Fig. 6 should be explained. For instance, I wonder why a simple count of the occurrences of symbols is insufficient. I am sure the design of the computation has much merit, which is based on the research on MDL principle. Yet, it should be explained for readers who are unfamiliar with MDL literature.

For the experiments, the choice of datasets could best further explained. It is unclear why some widely used datasets extracted from common KGs are not considered, such as FB15K237 from Freebase and NELL-995. Again, the numbers of patterns in Tab. 2 suggest that during the search, besides adding new patterns, some old patterns are removed. It is also interesting to know the range of sizes of individual patterns in terms of vertices and edges.

Overall, the idea of computing representative patterns of relatively small numbers and sizes is interesting, and the experiment results also look promising. Although the approach is based on GraphMDL+, the authors have clearly explained the novel contributions compared to GraphMDL+. The paper is generally easy to follow, but I have some doubts on the pattern search algorithm as mentioned above. Also, some clarification on the experiment design and results would be desired.

Some detailed comments are as follows:
- p1, l23, the power-law may worth a brief explanation when it is first mentioned.

- p1, l27, a "human-sized" set of patterns is mentioned several times but it does not sound right. It sounds like the size of a human being. Maybe "human-comprehensible"?

- p2, l6, it says existing graph mining algorithms were designed for small undirected graphs such as molecules. I do not think this claim holds for the significant amount of research on network mining, such as social networks.

- p6, l26, rdf:nil as subject => as an object.

- p6, l29-30, can this property of the converted KGs be stated formally?

- p7, l27, port vertices are undefined, and are easily confused with ports.

- p12, l26, "more than 80% of the labels are described by a pattern" => by the patterns