Review Comment:
The authors claim that current knowledge graph embedding evaluations are limited by treating each triple equally. They propose weight-aware variants of well-known evaluation metrics, as well as weight-aware adaptations of four common embedding models, i.e., TransE, TransH, DistMult, and ComplEx.
Respecting weights in knowledge graphs seems to be useful per se, and it is a bit surprising that this topic has not caught much attention so far.
My main concern about the paper is that the results do not really support the original motivation. The authors state that current, non weight aware evaluation protocols are not sufficient, and that weight aware evaluation protocols bring additional insights. However, when looking at the evaluation results presented in tables 2 and 3, the ranking of approaches stays almost stable across all evaluations (e.g., for CN15K, Complex is en par with TransE, DistMult is a close follow up, TransH is much worse). This is observed over all datasets in tables 2/3 and also in table 4. On the other hand, there seems to be a considerable gain in respecting weights in the training and evaluating on non-weighted standard metrics, so this is maybe the more interesting finding to focus on.
One thing I struggle with is the introduction of the activation function g. In my opinion, it is never clearly motivated why simply using the weights as they are (i.e., g(x)=x) should be inferior. This should be better motivated and also be included as a baseline. Moreover, the chosen activation functions lead to unintuitive final metrics, like F1 scores above 1, which can be very confusing. In evaluations, it would be better to have weighted evaluation metrics that are on the common ranges (i.e., between 0 and 1 for precision, recall, and F1).
Another concern I have with the activation function is that it is also used in the evaluation protocol. In Section 4.3, the authors report different optimal activation functions used for the different models, with ComplEx ultimately using a radically different evaluation function than the other models. In my opinion, this means that the results of the models are not really comparable, since they use different evaluation metrics. Therefore, I feel like the results reported for the Wa* metrics are misleading, since they suggest comparisons between models (e.g., ComplEx has a higher Wa* than DistMult, but for both, the Wa* metrics are computed differently).
The concern gets even worse for the dynamic metrics. According to Fig. 5, some of the dynamic base activation functions are almost uniform, i.e., close to g(x)=1 anyways. Given that the authors state that they train for 1,000 epochs, the dynamic activation function ultimately reaches (1003/1002)^x, which is almost 1^x, i.e., the evaluation metrics of Wa* should not differ too much from the corresponding non Wa* metrics, but the results are radically different. I cannot really see how values for g(x) that are that close to a constant 1 can yield results which are that different from the non weight aware metrics.
Another issue I have with the dynamic metric is that they become even less comparable when using early stopping. Let's assume one model stops after 10 epochs, while another one takes the full 1,000. Then, they are evaluated on different metrics, which corresponds to the same concern above (using different evaluation metrics for different models). The paper, unfortunately, is also not very clear here.
Summarizing, the authors tackle an interesting task. The gains when using weights and evaluating on standard metrics are a good finding, maybe not exciting enough for a journal publication, but they could make an interesting conference paper. On the other hand, the main claimed contribution, in my opinion, is hardly backed by the empirical findings, and the evaluation setup in general is problematic.
|