Weight-aware Tasks for Evaluating Knowledge Graph Embeddings

Tracking #: 3320-4534

Weikun Kong
Xin Liu
Teeradaj Racharak
Guanqun Sun
Qiang Ma
Le-Minh Nguyen

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
Knowledge graph embeddings widely participate in solving many problems together with deep learning, such as natural language understanding and named entity recognition. The quality of knowledge graph embeddings highly affects the performance of the models on many knowledge-involved tasks. Link prediction (LP) and triple classification (TC) are widely adopted to evaluate the performance of knowledge graph embeddings. Link prediction is to predict the missing entity that completes a triple, which represents a fact in knowledge graphs, while triple classification is to determine whether the unknown triple is true or not. Both link prediction and triple classification can intuitively reflect the performance of the knowledge graph embedding model; but it treats every triple equally, which is not capable of evaluating the performance of the embedding models on knowledge graphs that offer the weight information on the triples. As a consequence, this paper originally introduces two weight-aware extended tasks for LP and TC, called weight-aware link prediction (WaLP) and weight-aware triple classification (WaTC), respectively, aiming to better evaluate the performance of the embedding models on weighed knowledge graphs. WaLP and WaTC emphasize the ability of the embeddings to predict and classify triples with high weights, respectively. Lastly, we respond to the newly introduced tasks by proposing a general method WaExt to extend existing knowledge graph embedding models to weight-aware extensions. We test WaExt on four knowledge graph embedding models, achieving competitive performance than the baselines. The code is available at: https://github.com/Diison/WaExt.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 26/Feb/2023
Major Revision
Review Comment:

The paper addresses the problem of training and evaluating graph embedding models for weighted knowledge graphs, in which different edges can have different importance or reliability weights assigned. To deal with this problem, the authors define two extensions of typical evaluation tasks for graph embedding models: weight-aware link prediction and weight-aware triple classification. These extensions assign different contributions to predicted edges based on the edge weights. To handle such weight-aware tasks, the paper proposes weight-aware modifications to 4 commonly used embedding models: TransE, TransH, ComplEx, and DistMult. In the evaluation experiments, these extended models are compared to the baseline ones and an improvement in performance is shown on both tasks.

Overall, the paper targets a relevant problem which often occurs in practical scenarios, where different edges either have different importance or different reliability (e.g., for graphs constructed by information extraction from text). The evaluation tasks and method extensions described in the paper are potentially useful in tackling such problems. Evaluation experiments also appear to support the claims and show better performance of weight-aware model extensions.

However, the added value of the proposed model extensions in comparison with the state of the art does not appear sufficiently discussed in the paper and supported by experiments. The proposed extensions to the task definitions and models appear rather straightforward. However, the topic of weight-aware knowledge graph embeddings is not completely novel and there are other existing proposed approaches aimed at training graph embeddings taking into account edge weights, e.g., [1], [2], [3], [4] (the latter proposes to learn the edge weights from graph content, but these can be pre-defined in the tasks considered in the paper). However, the experiments only compare the proposed methods to baseline versions of each model. Arguably, to be able to prove the added value of proposed weight-aware extensions, more discussion comparing the proposed models with other weight-aware embedding methods would be important as well as experimental comparison with the state of the art (or a discussion on why existing methods are not compatible with the tasks being considered).


Relevant topic and potentially useful model extensions, but more comparison with state of the art is needed to assess the added value.

p.3: „based model“ -> base model

1. Nayyeri et al., Link Prediction of Weighted Triples for Knowledge Graph Completion Within the Scholarly Domain, IEEE Access 9, 116002-116014, 2021.
2. Seo, M., Lee, K. Y., A Graph Embedding Technique for Weighted Graphs Based on LSTM Autoencoders, Journal of Information Processing Systems 16(6), 1407-1423, 2020
3. Wu et al., ProbWalk: A random walk approach in weighted graph embedding. Procedia Computer Science 183(1), 683-689
4. Mai, G., Janowicz, K., Yan, B. Support and Centrality: Learning Weights for Knowledge Graph Embedding Models. EKAW 2018.

Review #2
By Heiko Paulheim submitted on 16/Mar/2023
Major Revision
Review Comment:

The authors claim that current knowledge graph embedding evaluations are limited by treating each triple equally. They propose weight-aware variants of well-known evaluation metrics, as well as weight-aware adaptations of four common embedding models, i.e., TransE, TransH, DistMult, and ComplEx.

Respecting weights in knowledge graphs seems to be useful per se, and it is a bit surprising that this topic has not caught much attention so far.

My main concern about the paper is that the results do not really support the original motivation. The authors state that current, non weight aware evaluation protocols are not sufficient, and that weight aware evaluation protocols bring additional insights. However, when looking at the evaluation results presented in tables 2 and 3, the ranking of approaches stays almost stable across all evaluations (e.g., for CN15K, Complex is en par with TransE, DistMult is a close follow up, TransH is much worse). This is observed over all datasets in tables 2/3 and also in table 4. On the other hand, there seems to be a considerable gain in respecting weights in the training and evaluating on non-weighted standard metrics, so this is maybe the more interesting finding to focus on.

One thing I struggle with is the introduction of the activation function g. In my opinion, it is never clearly motivated why simply using the weights as they are (i.e., g(x)=x) should be inferior. This should be better motivated and also be included as a baseline. Moreover, the chosen activation functions lead to unintuitive final metrics, like F1 scores above 1, which can be very confusing. In evaluations, it would be better to have weighted evaluation metrics that are on the common ranges (i.e., between 0 and 1 for precision, recall, and F1).

Another concern I have with the activation function is that it is also used in the evaluation protocol. In Section 4.3, the authors report different optimal activation functions used for the different models, with ComplEx ultimately using a radically different evaluation function than the other models. In my opinion, this means that the results of the models are not really comparable, since they use different evaluation metrics. Therefore, I feel like the results reported for the Wa* metrics are misleading, since they suggest comparisons between models (e.g., ComplEx has a higher Wa* than DistMult, but for both, the Wa* metrics are computed differently).

The concern gets even worse for the dynamic metrics. According to Fig. 5, some of the dynamic base activation functions are almost uniform, i.e., close to g(x)=1 anyways. Given that the authors state that they train for 1,000 epochs, the dynamic activation function ultimately reaches (1003/1002)^x, which is almost 1^x, i.e., the evaluation metrics of Wa* should not differ too much from the corresponding non Wa* metrics, but the results are radically different. I cannot really see how values for g(x) that are that close to a constant 1 can yield results which are that different from the non weight aware metrics.

Another issue I have with the dynamic metric is that they become even less comparable when using early stopping. Let's assume one model stops after 10 epochs, while another one takes the full 1,000. Then, they are evaluated on different metrics, which corresponds to the same concern above (using different evaluation metrics for different models). The paper, unfortunately, is also not very clear here.

Summarizing, the authors tackle an interesting task. The gains when using weights and evaluating on standard metrics are a good finding, maybe not exciting enough for a journal publication, but they could make an interesting conference paper. On the other hand, the main claimed contribution, in my opinion, is hardly backed by the empirical findings, and the evaluation setup in general is problematic.