End-to-End Learning on Multimodal Knowledge Graphs

Tracking #: 2727-3941

Authors: 
Xander Wilcke
Peter Bloem1
Victor de Boer
Rein van 't Veer

Responsible editor: 
Guest Editors DeepL4KGs 2021

Submission type: 
Full Paper
Abstract: 
Knowledge graphs enable data scientists to learn end-to-end on heterogeneous knowledge. However, most end-to-end models solely learn from the relational information encoded in graphs' structure: raw values, encoded as literal nodes, are either omitted completely or treated as regular nodes without consideration for their values. In either case we lose potentially relevant information which could have otherwise been exploited by our learning methods. We propose a multimodal message passing network which not only learns end-to-end from the structure of graphs, but also from their possibly divers set of multimodal node features. Our model uses dedicated (neural) encoders to naturally learn embeddings for node features belonging to five different types of modalities, including numbers, texts, dates, images and geometries, which are projected into a joint representation space together with their relational information. We implement and demonstrate our model on node classification and link prediction for artificial and real-worlds datasets, and evaluate the effect that each modality has on the overall performance in an inverse ablation study. Our results indicate that end-to-end multimodal learning from any arbitrary knowledge graph is indeed possible, and that including multimodal information can significantly affect performance, but that much depends on the characteristics of the data.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Dagmar Gromann submitted on 11/Apr/2021
Suggestion:
Major Revision
Review Comment:

This article proposes a multimodal message passing network to learn embeddings from knowledge graphs, a method that considers their multimodal node features instead of only relying on relational information. The five types of modalities considered are numbers, text, dates, images, and geometries that are represented in a single space together with relational information. By considering all these different types of information in embedding training, this approach seeks to increase information types available for individual entities. It is tested on a synthetically generated dataset as well as five available multimodal datasets on node classification and link prediction.

Overall evaluation:
This paper presents an interesting study on multimodal knowledge graph embeddings, considering several datasets and two tasks. Both, originality of idea and quality of writing are definitely high, even though the structure could be improved, particularly in terms of presentation of tables and results. While it is highly appreciated that the signficance of results is supported by proper statistical significance tests, the results show such radical differences regarding the impact of modalities on the analyzed tasks (node classification and link prediction) that detailed analyses of all factors would be required. As it is now, little discussion or detailed analysis on this point is offered. Instead, general assumptions with little or now evidence are offered. These difference could be due to a large number of different factors including the chosen architecture, encoding strategy, optimization procedure building on a potentially biased synthetic dataset, etc. Some experiments or at least detailed manual analysis of results would be needed to allow any conclusions about different types of modalities and ensure that the differences are not caused by other factors. Comparisons to other, baseline models generally also help with this process.

Baseline:
In order to truly extol the virtues of this idea, it would be necessary to prove that multimodal information indeed provide an improvement in performance over purely considering structural information. Nevertheless, the authors argue against this idea of a baseline but still claim a "better performance", which in fact cannot be supported by the results. In other words, I am not convinced by the argumentation that the model should explicitly not be compared against SOTA or other models. In my view, a better performance can only be achieved against some reasonable baseline that should ideally correspond to state-of-the-art approaches. Such a baseline also has the effect that differences across datasets can be clearly attributed to datasets if observed across models.

Archictural choices:
- The statement that more than two R-GCN layers do not improve performance could benefit from references where this has been shown.
- I am not entirely sure how to read the embedding matrix in Figure 2. In the text it is stated that the rows of the embedding matrix represent the multimodal embeddings, however, in the figure this seems not to be the case. Should it be read as transposed?
- Do I understand this correctly that numerical information by definition would always result in embedding dimensionality 1 since the normalized values are treated as embeddings? How does this compare to other approaches, e.g. [11], where a vector of the same dimensionality as the entity embedding is fused with the latter?
- In fact, I am wondering how this difference in dimensionalities might impact the performance of individual modalities. Have any experiments in this direction been performed?
- If I understood this approach correctly, the model is trained on an input matrix where the input vectors of all modalities are concatenated for each entity. Why was this decision taken over previously individually training embeddings and then fusing them? Also how does this approach allow for a fair comparison given that pretrained embeddings for visual information but not any other type of modalities was utilized?
- In reference to textual information, I am wondering why the standard BPE encoding method was not an considered? It has been successfully applied in many multilingual large language models, but instead a CNN and character-level encoding is utilized.
- Why does it make sense for this approach to train from scratch rather than re-use amply available large pretrained networks for textual information? For instance, for textual information XLM-R has shown remarkable performance across languages, domains and tasks. It seems that this route has been taken for visual information, but not for text. The use of language models is briefly addressed in the conclusion but merely as future work.
- Is the connection between a literal and an entity in the graph considered a relation with its individual relation embedding or does the relation embedding training consider entity links?

Experiments:
- How was the configuration on the training epochs finalized? Was it a random number chosen or is there any specific rationale behind chosing 400 respectively 1000 epochs? How does this setting ensure optimized training? Since this is even brought up as a factor that might influence results in the discussion, it is beneficial to include a rationale for this decision or improve on finding a well-defined number by standard methods.
- It would be interesting to also report on the validation performance, as this becomes increasingly common and generally offers interesting insights into the training procedure and model performance, as well as potentially further information on the differences between datasets.

Results:
- The way that the results are presented right now requires a considerable effort in terms of scrolling back and forth and finding the right table. This is partially due to the fact that the tables are not positioned where they are referenced in the text and that they are mostly spread across a whole page, where the direction also changes in the Appendix from one page to the next (and the appendix is merged with the bibliography). I propose merging the split/merged results into a single table to reduce the number of tables and including all tables in the text similar to Table 10 and 11. It would also be beneficial to highlight the best results in each column or for each modality.
- The results suggest that the impact of specific modalities varies greatly across datasets and tasks. It almost seems as if this choice of (not) considering them should be part of the training and optimization procedure, leaving this choice to the network. However, the way the information is encoded currently with concatenated embeddings of strongly varying dimensionality - some pretrained some not, it is doubtful that this option is given to the optimization procedure. In the discussion, the difference between synthetic and real-world datasets in terms of negative impact of individual modalities is clearly addressed and attributed to the type of network chosen. However, no evidence for this assumption is presented and it is questionable whether this difference might not stem from a substantial bias in the synthetic dataset or the encoding strategy. Further investigations on this point would be extremely interesting and might have been supported by including SOTA baselines.
- The differences in modality can also be observed even if the model has been finetuned in terms of hyperparameter settings for a specific task and dataset. As it stands now, the paper presents more information on the differences between datasets/other factors than modalities, since the hyperparameter and training parameter (esp. epochs) are relatively unclear and do not warrant for the conclusions presented.

Minor comments in order of appearance:
real-worlds datasets => real-world datasets
real-worlds knowledge graphs => real-world knowledge graphs
but rather than have completely => having
example of knowledge graph => a knowledge graph
a non-linearity activation function => non-linear
dataset—AIFB+—contained => dataset AIFB+ contained

Formatting:
- Tables and information in the appendix and/or that did not fit on the page should not be placed in the middle of the bibliography.
- Figure 2 is not anchored in the text, i.e., never referenced.
- Also Table 3 is not mentioned in the text.

Review #2
Anonymous submitted on 28/Apr/2021
Suggestion:
Minor Revision
Review Comment:

The paper proposes a multimodal R-GCN (MR-GCN) based on message passing, which makes use of information from several different modalities that are present in knowledge graphs (KGs). The modalities include numerical, temporal, textual, visual, and spatial features, which are, together with the relational information, encoded jointly as node embeddings. The method is applied to node classification and link prediction tasks on both synthetic and real-world datasets. The experiments and ablation studies show that the performance depends on the specific dataset, but it is possible to improve performance by including multimodal data. The implementation of the method is publicly available, and details are given to reproduce the experiments.

Originality:
The paper focuses on end-to-end learning that uses as much raw or original knowledge as possible. The motivation to include multimodal information is described in the introduction and supported by examples. The topic is relevant since it is makes sense to exploit as much information as possible for prediction.
Related work, especially work on KG embedding methods that consider multimodal information, are presented in detail. The idea of including multimodal information as features is not new and many methods already exist. The authors also use existing methods (neural encoders, e.g., (temporal) CNNs) for encoding the modalities, which are then concatenated to form the node representations. The decoder for link prediction is a DistMult model.
The main novelty consists of the inclusion of several additional modalities and the extraction of information about modalities from the graph itself rather than from external sources.

Significance of the results:
The authors conduct extensive experiments of both synthetic and real-world datasets and show that the results are mostly statistically significant. The results do not show a consistent pattern for the application of MR-GCN on different settings (split/merged, inclusion of single/all/no modalities) but are rather dependent on the specific properties of the datasets, the modalities, or even the tasks (node classification and link prediction). Even though it is not clear which setting is optimal for a specific problem, the discussion is elaborate and provides possible explanations and solutions. It would be nice to see the performances on some state-of-the-art methods, but the current results could serve as a solid basis for further investigation on multimodal KGs.

Quality of writing:
The paper is overall well-structured and clearly written. Some comments about clarity and minor issues can be found below.

Comments:
(There are no lines available in the paper, so I will only refer to page and left/right column.)
- Introduction: The proposed method should provide end-to-end learning (the title also includes “end-to-end learning”). On p.2 left, it is stated that transforming data to be represented as a KG is a natural first step in an end-to-end learning pipeline. However, this transformation does not seem to be part of the proposed method, which consumes data already represented as a knowledge graph.
- Fig.1: On p.5, it is stated that there are two entities in the graph. Which are the two entities?
- Section 4: In the first paragraph, three feature encoders f, g, and h are introduced but not mentioned again. The subsequent feature encoders are always called f_{…}. It is difficult to connect these three encoders and thus also Fig. 2 to the five modalities.
- Equations (1) and (5) could be directly stated in a more general form (H^{i+1} = …), where H^1 could be stated as a special case. The paragraph below Eq. (5) is a little bit confusing, and it is not immediately clear why A^rH_lW^r_l = 0. Also, in my opinion, Fig. 3 is not helpful for understanding Eq. (5) and could be left out.

Minor issues:
- “real-worlds” -> “real-world”
- p.1 left: “has lead” -> “has led”
- It is usually considered bad style to have only one subsection (2.1, 4.1, 5.1, 8.1)
- Fig. 5: “with a class” -> “within a class”?
- p.11 left: “link predication” -> “link prediction”
- p.13 left: “each modalities affects the performance roughly similar” -> “each modality affects the performance roughly similarly”
- Several commas missing after an introductory phrase (e.g., “For each configuration[,] we”, “When comparing the performance gain per modality[,] it”, “Similar to the classification results

Review #3
Anonymous submitted on 19/May/2021
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing. Please also assess the data file provided by the authors under “Long-term stable URL for resources”. In particular, assess (A) whether the data file is well organized and in particular contains a README file which makes it easy for you to assess the data, (B) whether the provided resources appear to be complete for replication of experiments, and if not, why, (C) whether the chosen repository, if it is not GitHub, Figshare or Zenodo, is appropriate for long-term repository discoverability, and (4) whether the provided data artifacts are complete. Please refer to the reviewer instructions and the FAQ for further information.

The authors of the paper present an end-to-end multimodal message passing model for knowledge graphs. The idea is to build on previous models published from graph deep learning researchers in order to be able to process multimodal node features.

The paper is easy to read and understand. However, at the end of the Introduction Section, I was not sure what was the contribution and if it was novel enough for a publication at the SWJ. The idea of multimodal message passing seems more of an engineering / neural architecture problem than a pure research one. In a nutshell, a message passing can be done with multimodal data as long as the message is properly encoded through, for instance, an MLP (or any sota architecture for images / text) to output a fixed vector size wrt each node.
This is what the authors present over here for different media types (Images / text etc). So what is the real novelty here?
Furthermore, the end of the Introduction Section is vague: "by including as much of the original knowledge as possible, in as natural of a fashion as possible, we can, in certain cases, help our models obtain a better overall performance". This sentence is not clear about the motivation, we already know that adding more signal in a GCN can provide better classification / link prediction results. Point 2, seems to be rephrasing point 1) to some extent. I am also unconvinced by the experimental section that avoided any baseline comparison with any SOTA approach (some unimodal approaches can be easily extended for a fair comparison).

Detailed comments:

* Abstract: typo: "possibly divers set of"
* Source code availability is very helpful! Thanks for sharing but the datasets link page is unavailable with a 404 (https://gitlab.com/wxwilcke/mmkg). So I could not run any test.
* Related work does not really present any work about Graph deep learning and focus solely on Knowledge Graph Embeddings etc.
* From the end of the related work, it seems many approaches are already very similar to what is proposed in this paper ([16, 17]), maybe the authors can add a subsection explaining the differences and caveats of these approaches.

* In Subsection 4.1.2, "we limit our domain to years between −9999 and 9999". What about informations related to older ages (e.g, Jurassic period)?
* There are no discussions about the loss functions or at least auxiliary loss functions for the different modality encoders.
* Character level representation: Why not use more classical state of the art approaches using for instance Word2vec or Glove embeddings on the words (you can have it for multiple languages) and in case the word is bogus or unknown use the character level representation?
* Spatial information: What about the order of the points? Points don’t have to be ordered the same way to represent similar shapes / polygons etc. So the correct representation should take as an input a set of points, many approaches have been presented recently to handle this (especially in Computer Vision for Pointclouds: "PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation" Qi et al).

* Experiments do not provide any comparison with other approaches and is basically an ablation study with no definitive results as stated by the authors: " there appears to exist no discernible pattern in the performances amongst modalities".