MADLINK: Attentive Multihop and Entity Descriptions for Link Prediction in Knowledge Graphs

Tracking #: 2748-3962

Russa Biswas
Mehwish Alam
Harald Sack

Responsible editor: 
Dagmar Gromann

Submission type: 
Full Paper
Knowledge Graphs (KGs) comprise interlinked information in the form of entities and relations between them in a particular domain and provide the backbone for many applications. However, the KGs are often incomplete as the links between the entities are missing. Link Prediction is the task of predicting these missing links in a KG based on the existing links. Recent years have witnessed many studies on link prediction using KG embeddings which is one of the mainstream tasks in KG completion. To do so, most of the existing methods learn the latent representation of the entities and relations whereas only a few of them consider contextual information as well as the textual descriptions of the entities. This paper introduces an attentive encoder-decoder based link prediction approach considering both structural information of the KG and the textual entity descriptions. A path selection method is used to encapsulate the contextual information of an entity in a KG. The model explores a bidirectional Gated Recurrent Unit (GRU) based encoder-decoder to learn the representation of the paths whereas SBERT is used to generate the representation of the entity descriptions. The proposed approach outperforms most of the state-of-the-art models and achieves comparable results with the rest when evaluated with FB15K, FB15K-237, WN18, WN18RR, and YAGO3-10 datasets.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 13/Apr/2021
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This work integrates the embeddings of entity text descriptions and paths into knowledge graph embedding.

This type of work is encouraged, especially to explore different ways to improve KGE, however, a good comparison to the state of the art is required. Besides, for a fair evaluation, it is highly recommended that all methods release their code or at least submit it as supplementary data.

It is not clear how the proposed path embeddings help to learn of entities since it considers "any" paths that entities lie on. Especially, in the context of the link prediction task, the head and tail are at a distance of 1 and possibly most of the paths including one of them will include the other one. However, to learn negative samples this method can be useful. DistMult is a comparably old method and has difficulty differentiating inverse relations, I suggest using a newer score, which might improve the overall model.

They are other methods cited in the paper and some not cited like,
"Attentive Path Combination for Knowledge Graph Completion", which are using path learning, and this work can relate to them. A discussion is required to differentiate this work from theirs. The same for embedding using entity text descriptions (for example the paper: "Zero-Shot Entity Linking by Reading Entity Descriptions"). Their existence challenges the novelty of this paper, and therefore, firstly their similarity and differences must be explained to understand the novelty of the work.

The related work and result section is outdated. Reported results of RotatE are lower than their reported results from their paper. If that method is handicapped in the comparison, it should be explained how, and its reason must be clear. Besides, they are other methods with better results that are missing, for example, QuatE(NeurIPS 2019), MDE(ECAI 2020), and TuckER(EMNLP 2019).

(3) The text does not convey the intended message properly. I would suggest, remove redundant clear details, like the details of the definition of the path and the encoder, decoder mechanism and focus on how/where/why they improve the method.

Note for improving the text: r, R, and r_e are used in different places with different meanings all over the paper. This is confusing. You can use other alphabets to explain the idea.

Section 4.1 fails to explain the mechanism of path selection, and instead, it explains what a path is. After reading this section it is still not clear how the paths are selected and what are their properties, length, ending starting node selection criteria, etc.

Review #2
By Vit Novacek submitted on 12/Jul/2021
Minor Revision
Review Comment:

The paper presents an approach to link prediction and/or knowledge base completion based on a method for computing knowledge graph embeddings not only using single triples as "first class citizens" and de facto data points, but entire paths in the input graph. This is assumed to provide embeddings that better reflect the structure of the graph. The embeddings are further augmented by incorporating the representations of the textual information associated with entities as well.

The paper might benefit from another polishing round to resolve many minor language imperfections here and there, but overall it's well written and clearly presented. The validation involves several standard benchmarks and recent approaches, pretty much following the state of the art methodology in this domain, and it shows improvement over related models. I don't seem to have found a link to the code in the paper, but the description of the method, design decisions and hyper-parameters is rather extensive and should be a good basis for reproducibility.

My major critical remark is related to the evaluation - it appears triples that do not contain entities with textual descriptions were removed from the experimental data, similarly to the DKRL model. This doesn't seem to be entirely correct to me - without further justification it looks like adjusting the benchmark to the comparative strengths of the presented framework. In the revised version of this work, all experiments should either be performed in a manner that does not introduce possible positive bias for the validated method, or it should be very precisely explained why the chosen approach does not provide an unfair advantage to the authors' method.

Another comment (not really specific to this paper but rather the whole host of similar approaches) is related to the relevance of the presented evaluation - the fact that a tool outperforms others on the chosen standard (generic and rather artificial) benchmarks doesn't really say much about how good the tool might be for a practical scenario. So adding an actual predictive problem motivated by a realistic need and validating the approach in that context would have provided for a much stronger contribution, if possible at all. But this is not a remark that would need to be addressed for making this submission acceptable, more a general/personal complaint about the way things are being done in general in this area of computer science research.

Review #3
By Bassem Makni submitted on 30/Jul/2021
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

The authors propose an approach for knowledge graph embedding that combines the contextual structural information and the textual descriptions of the entities. The motivation of the work is well established, and the paper is clearly written. However, the baseline experiments need to include more relevant approaches, and the ablation study needs to be more thorough.

For the baseline experiments, I encourage the authors to compare their approach with other KG embedding approaches that incorporate textual information, such as LiteralE [1] and SSP [2]. The only approaches that incorporated textual information that they compare with are DKRL and Jointly (ALSTM). The comparison is not across all benchmarks.

In the ablation study, it is not clear why there is a discrepancy between the results of MADLINK between Table 4 and Table 6. I assume that the results in Table 6 do not include the paths information (As hinted in the last sentence in the paragraph "Impact of Text."). If this is the case, the caption of the table should clarify this.
More importantly, the authors set a hyperparemeter for the length of the paths to be 5. I would highly encourage the authors to study the impact of the path information by varying the length of the paths.

The description of the approach does not clearly explain how the set of paths (P1 ... Pn in Fig. 2) were sorted before being encoded and fed to the neural network model. Neither do they address the strategy they use when the number of paths is less than the threshold they set at 1000. I assume the authors randomly sorted the different paths but this should be clearly stated.

I would highly encourage the authors to publish the code for their approach and experiments. It would even be better if the authors can implement their approach in one of the KG embedding frameworks such as PyKEEN, DGL-KE, GraphVite, or others in order to facilitate the comparison with other approaches.

Minor suggestions:
The authors can include a discussion about encoding the textual information of texts that have multiple textual literals such as labels, description, summary etc. Would a concatenation of the literals be enough or do certain literals- such as the label- need to have higher weights?
Also, encoding the textual information of the relations using their labels can be added to the discussion/future work.

Minor comments:

"BERT outperforms most of the SOTA results for a wide variety of tasks [36]." This statement is outdated.

"The transformer encoder reads the entire sequence of words at once which allows the model to learn the context of a word based on its surroundings, whereas the other models read the input sequentially." This statement is partially correct. Some models, like word2vec (skip-gram)- which read the input sequentially- can also learn the context of a word based on its surrounding.

"the nodes marked in green would have greater attentions than the ones marked in yellow" I would encourage the authors to visualize the attention weights for this example to confirm this claim.

Missing references:
[1] Incorporating Literals into Knowledge Graph Embeddings
Agustinus Kristiadi, Mohammad Asif Khan, Denis Lukovnikov, Jens Lehmann, Asja Fischer

[2] SSP: Semantic Space Projection for Knowledge Graph Embedding with Text Descriptions
Han Xiao, Minlie Huang, Lian Meng, Xiaoyan Zhu

[3] Utilizing Textual Information in Knowledge Graph Embedding: A Survey of Methods and Applications



Page 1:

Abstract: comprise of -> comprise
whereas -> , whereas

39: growing containing -> growing, containing
42: inter-connectivity -> interconnectivity
43: LOD -> the LOD
35: manually-curated -> manually curated
48: To-date -> To date,

Page 3:
20: whereas achieves -> , whereas it achieves

structure based representation -> structure-based representation
description based representation -> description-based representation

Page 4:
23: aforementioned models -> aforementioned models,
40: represents relation -> represents the relation
45: comprise of textual -> comprise textual

Page 5:
36: loose -> lose
36: domain specific -> domain-specific
37: fine tuned -> fine-tuned

Page 6:
25: fixed length -> fixed-length
25: called as -> called a

Page 8:
49: is -> are

Page 9:
Link Prediction can be defined by a mapping function which -> Link Prediction can be defined as a mapping function that
Page 10:
18: is already -> are already

Page 12:
34: very less information -> much less information

34: Therefore, the two research questions addressed in Section 3 is tackled as the use of contextual information plays a vital role in the link prediction task. -> This whole sentence is not clear and needs to be reformulated.

Review #4
Anonymous submitted on 03/Aug/2021
Major Revision
Review Comment:

The authors present MADLINK, an encoder-decoder-based approach with attention for link prediction, which considers both structural and textual information for learning entity representations. The structural information in the form of paths is integrated via a GRU based on seq2seq, while the textual information is encoded using SBERT. Both vectors are concatenated and fed, together with the learned relation vectors, to a DistMult scoring function. Experiments are conducted on the standard benchmark datasets (and subsets of) Freebase, WordNet, and YAGO, and the results show comparable or superior performance compared to baseline methods.

Most existing methods only consider 1-hop or n-hop information from a KG but do not include information from textual descriptions. However, there already exist some approaches that take multimodal data (such as text, images, dates, geometries) into account for learning vector representations. MADLINK combines existing concepts and algorithms (seq2seq, SBERT, attention layer, and DistMult) to form a new method for link prediction. Here, paths in the KG are considered as sentences, which serve as input to the seq2seq-based model.

Significance of the results:
The results for the tasks link prediction and triple classification show comparable or superior performance of MADLINK compared to baseline methods. However, not all results are available for the other baselines, especially the two methods DKRL and Jointly (ALSTM), which also make use of textual entity descriptions.
It should be discussed in more detail to what extend the research questions posed in Section 3 have been answered by the experiments. Since the other methods also use contextual information (sometimes implicitly rather than explicitly incorporating n-hop neighbors), how can we evaluate the influence of contextual information in MADLINK (see RQ1)?
To answer RQ2, an ablation study is shown in Table 6, which only compares MADLINK with DistMult, where all results are already included in Table 4. It cannot be judged whether the improvement comes from the textual descriptions or from the overall different architecture in MADLINK. It would be necessary to compare MADLINK without the description vector and MADLINK with the description vector to make a more accurate statement about the impact of the text information.

Quality of writing:
The writing is mostly clear, and the paper is well-structured. Some sentences could benefit from paraphrasing for better readability, and the punctuation should be checked again (e.g., comma before “respectively” or after introductory phrases).
Some comments on writing:
- The title could be slightly reformulated to include a noun after “multihop”, which is an adjective, or changed to “Attentive Multihop Entity Descriptions …”?
- In some cases (when the word + based is an adjective), there should be a hyphen before “based”, e.g., “translation-based”, “attention-based”. This also applies for other two-word adjectives.
- “h” is used for head (entity), distance to neighbors (h-hop), hidden states, and hidden layers, which might be confusing.
- The capitalization of the captions (figures/tables) is not consistent, also sometimes in the text (e.g., “link Prediction” on p.9 line 46 and “Link Prediction” on p.9 line 50).
- p.9 second column: MRR is the average of the reciprocal ranks of the “correct” entities, not the “predicted”.
- The use of American English and British English is mixed, e.g., optimize/optimise. vectorise, initialise/initialize.
- It is recommended to integrate equations (e.g., equations (1) – (3)) into the text for improving the reading flow.
- p.5. line 36.: loose -> lose
- It would be nice to include the notation (x_t, H, etc.) in Fig. 2.

Further comments:
- There is no code provided for the implementation of the method. For better reproducibility, it is recommended to add a reference to the implementation.
- It is stated that very few methods for link prediction take neighborhood information into account (p.2 line 23). The area of graph neural networks (GNNs) is not mentioned at all, even though GNNs can capture relational information via spectral decomposition or message passing over several hops. Especially methods like R-GCN (Schlichtkrull et al. – Modeling Relational Data with Graph Convolutional Networks. ESWC 2018) and CompGCN (Vashishth et al. – Composition-based Multi-Relational Graph Convolutional Networks. ICRL 2020) are specifically developed for KGs. These methods should be included in the related work section and preferably in the experiments as well.
- How are semantic-based (p.3 line 30) and structure-based (p.3. line 36) models defined?
- p.9 line 22: It is stated that all entities are implicitly given equal importance because “the number of properties and the paths are the same for all of them”. Why are the number of properties and the paths the same for all entities?
- p.10 line 19: As an explanation for why MADLINK performs better on datasets without inverse relations, it is stated that only “directed paths are considered”. Are the inverse relations not also part of directed paths?

A revised version should address the following points:

- Revision of the text for better readability, correct punctuation/spelling, and clearer notation (see comments on quality of writing)
- Provision of source code implementation
- Complete experimental results for the two methods DKRL and Jointly (ALSTM), which also make use of textual entity descriptions, for all datasets and tasks (right now, only one out of five link prediction experiments contain results for both methods), or comprehensible explanation why there are no experiments conducted
- Specific answers and discussions for RQ1 and RQ2 from Section 3; otherwise, the research questions could also be reformulated to fit the experiments and results
- RQ1: How can we evaluate the influence of contextual information in MADLINK? How do the experiments answer RQ1?
- RQ2: The ablation study is shown in Table 6, which only compares MADLINK with DistMult, where all results are already included in Table 4. It cannot be judged whether the improvement comes from the textual descriptions or from the overall different architecture in MADLINK. It would be necessary to compare MADLINK without the description vector and MADLINK with the description vector to make a more accurate statement about the impact of the text information.
- Inclusion of the area of (relational) graph neural networks in the related work section
- Inclusion of at least one (relational) graph neural network method in the experiments (e.g., R-GCN, CompGCN; CompGCN seems to perform slightly better than MADLINK), or explanation why there are no experiments conducted
- Some clarifications with respect to the content (see further comments)