Hierarchical Blockmodelling for Knowledge Graphs

Tracking #: 3698-4912

Authors: 
Marcin Pietrasik
Marek Reformat
‪Anna Wilbik‬

Responsible editor: 
Maria Maleshkova

Submission type: 
Full Paper
Abstract: 
In this paper, we investigate the use of probabilistic graphical models, specifically stochastic blockmodels, for the purpose of hierarchical entity clustering on knowledge graphs. These models, seldom used in the Semantic Web community, decompose a graph into a set of probability distributions. The parameters of these distributions are then inferred allowing for their subsequent sampling to generate a random graph. In a non-parametric setting, this allows for the induction of hierarchical clusterings without prior constraints on the hierarchy's structure. Specifically, this is achieved by the integration of the Nested Chinese Restaurant Process as well as the Stick Breaking Process into the generative model. In this regard, we propose a model leveraging such integration and derive a collapsed Gibbs sampling scheme for its inference. To aid in understanding, we describe the steps in this derivation and provide an implementation for the sampler. We evaluate our model on synthetic and real-world datasets and quantitatively compare against benchmark models. We further evaluate our results qualitatively and find that our model is capable of inducing coherent cluster hierarchies in small scale settings. The work presented in this paper provides the first step for the further application of stochastic blockmodels for knowledge graphs on a larger scale. We conclude the paper with potential avenues for future work on more scalable inference schemes.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ioannis Dasoulas submitted on 23/Jul/2024
Suggestion:
Minor Revision
Review Comment:

This paper explores the usage of stochastic blockmodeling for inferring a hierarchical entity clustering on knowledge graphs. The approach combines different stochastic processes like nCRP and the Stick Breaking Process to split entities in a tree-like structure of entity communities and model the interactions between the communities, showing the relationships of clusters. The paper is well written and I appreciated the diagrams and examples used by the authors to explain their modeling approach and evaluation results.

The total time complexity of the sampling procedure makes the approach infeasible for large knowledge graphs. The authors respond to this by modifying the collapsed Gibbs sampling they use and propose some future work that can also be leveraged for working with larger knowledge graphs.

The authors present quantitative evaluations, comparing the work with prominent baselines and qualitative evaluations, by viewing the infrared clusters as taxonomies and analyzing their quality by comparing them with their neighboring clusters. Their approach underperforms when it comes to the SBT dataset, which contains only two predicates and its entity relations are drawn at higher proportions between neighboring communities at lower levels of the hierarchy. However, it shows good results with the Wikidata and FB15k-237 that have more predicates and clustering at a high level is simpler.

They show that their clustering depends a lot on a set of hyper-parameters chosen. In the context of the problem that they are trying to address in an unsupervised manner, there is not a clear way on how to adjust the hyper-parameters. They also experience conciseness problems, with different communities sharing similar entities. This depends a lot on the dataset and is crucial for the results obtained.

My main remarks regarding this work are the following:
1. Related work offers an analytical review of hierarchy induction models, either by discovering subsumption axioms for KG classes or by leveraging embeddings. I would expect a comment for comparing between the two, when is the first approach better to follow and when is the second better. What do the authors expect by going with the first option?
2. I would expect a short discussion about the possibility of the existence of literals in the knowledge graph. Is there a way that probabilistic graphical approaches like these can take literals into account as well? How could something like this be done? For example, embedding approaches can leverage literal values. Would they be better in knowledge graphs with literals?
3. Why was L=4 used in the quantitative evaluation? Is this enough to assess complex hierarchies for large datasets? How would larger L affect evaluations in their opinion?
4. In evaluation, I would expect a discussion regarding the authors’ estimation about why their approach underperforms when it comes to the SBT dataset, but has better performance, in comparison with baselines, when it comes to the other two datasets. Is it because it has less predicates? Or because of the higher density of predicates at lower level communities. In general, I would expect a discussion regarding the kinds of datasets that their approach is a better fit for?
5. In the Github repository, I would appreciate a more detailed description regarding what the repository contains and more detailed installation commands and installation alternatives for C++/Boost without a Matlab license. Also, are additional/different configurations required for different operating systems?

6. The right side of Figure 1 seems wrong to me, since the t1 community is not associated with the r0 property.
7. Figure 2: The description mentions patron p6, who is then described as e6.
8. Figure 2: The table t2 is occupied, so the probability function given in the figure caption seems wrong.
9. Page 11: “Thus, in order generate…”. I assume it should be: “Thus, in order to generate…”
10. Page 21: “it is possible it perform…”. I assume it should be: “it is possible to perform…”
11. Page 22: “it necessary…”. I assume it should be: “it is necessary…”
12. Page 27: “[making]”. Word inside brackets.
13. Page 28: “advantage our mour”. Probably a mistake
14. Page 28: It would probably be better to change the font, when mentioning entities like “nationality” or use quotes. This applies to the full paper.

Review #2
Anonymous submitted on 10/Oct/2024
Suggestion:
Minor Revision
Review Comment:

In their submission, 'Hierarchical Blockmodelling for Knowledge Graphs', the authors present a novel approach to capturing a knowledge graph's inherent structural properties and performing hierarchical clustering based on probabilistic graphical models.

I consider the topic relevant to the scope of the Semantic Web Journal. Furthermore, I think the authors' work is a valuable scientific contribution. The submitted manuscript is well-written and self-contained. However, to be self-contained, quite a few mathematical foundations needed to be introduced in detail, which made the manuscript quite lengthy and technical.

In terms of correctness, the manuscript reads plausibly. The applied techniques are well-motivated, and related work is discussed. However, I am not a mathematician and, thus, cannot reliably verify the correctness of every detail in the more extensive formulas, like Eqn. 16 onwards. The scalability issues of the proposed approach are discussed, and the authors plan to investigate modifications to the presented approach to remedy them. To get a better idea of its practicability, it would be nice to have the overall runtimes and hardware specifications as part of the evaluation results.

The Matlab/C++ source code to reproduce the experiments is provided. However, I could not test it in depth due to the lack of a proper Matlab/C++ development environment.

Minor issues and typos:

- Throughout the paper, 'WikiData' is used (camel case) instead of 'Wikidata'
- p. 3, l. 3: "The learning process is then infer" -> 'to infer'
- p. 5, l. 35: "Henry Ford's occupation is an engineer" -> 'is being an engineer'
- p. 6, l. 33: "the chance observing" -> 'chance of observing'
- p. 8, Caption Fig. 3: "dashed lines indicate indicate"
- p. 8, l. 25: "This principle become relevant" -> 'becomes'
- p. 8, l. 25/26: "when controlling for the branching factor" -> 'when controlling the branching factor'
- p. 9, 20: "hasn't" -> 'has not'
- p. 9, l. 25: "in the tree , the probability" -> remove space before comma
- p. 11, l. 19: "these community relations are modelled with respect to a predicate in the knowledge graph": I think this needs a bit more explanation at that part of the paper.
- p. 11, l. 21: "in order generate" -> 'in order to generate'
- p. 11, l. 41/42: "empty communities and removed" -> 'are removed'
- p. 12, l. 8: "it's" -> 'it is'
- p. 12, l. 41: "it's" -> 'it is'
- p. 16, l. 7: "Firtly" -> 'Firstly'
- p. 17, l. 50: "which a constant" -> 'which are constant'
- p. 18, l. 45: "it's" -> 'it is'
- p. 19, l. 44: "it's" -> 'it is'
- p. 21, l. 35: "Gamma forms the Beta function" -> 'of the Beta function'
- p. 21, l. 37: "it's" -> 'it is'
- p. 22, l. 15: "effect on model likelihood" -> 'on the model likelihood'
- p. 22, l. 26: "it necessary" -> 'it is necessary'
- p. 22, l. 51: should be a proper URL with https:// instead of the www.
- p. 25, l. 33: "it's" -> 'it is'
- p. 25, l. 43: "we note a dips": mix of singular and plural
- p. 26: Fig. 9 appears on p. 26 but is referenced on p. 28, after the references of Fig. 10 and Fig. 11; maybe reorder figures
- p. 27, l. 31: "isn't reflected" -> 'is not'
- p. 28, l. 32: "obtained by out method" -> 'our method'
- p. 28, l. 40: "there is no prior constraint on the this structure"
- p. 28, l. 46ff: maybe use \emph{} to highlight predicates and classes
- p. 28, l. 47: "advantage of our mour method"
- p. 28, l. 49: "Footballers" -> 'footballers'
- p. 28, l. 50: "than athlete" -> 'athletes'
- p. 28, l. 50: "Nations" -> 'nations'
- p. 29, l. 17: "presenting a novel and principled for qualitative": word(s) missing
- p. 29ff: Not all acronyms in capital letters in the references (Icml, Rdf, ...)
- p. 31, l. 36/37: "Lecture notes for..."