GTrans: generic knowledge graph embedding via multi-state entities and dynamic relation spaces

Tracking #: 1647-2859

Authors: 
Zhen Tan
Xiang Zhao
Yang Fang
Weidong Xiao

Responsible editor: 
Claudia d'Amato

Submission type: 
Full Paper
Abstract: 
Knowledge graph embedding aims to construct a low-dimensional and continuous space, which is able to describe the semantics of high-dimensional and sparse knowledge graphs. Among existing solutions, translation models have drawn much attention lately, which use a relation vector to translate the head entity vector, the result of which is close to the tail entity vector. Compared with classical embedding methods, translation models achieve state-of-the-art performance; nonetheless, the rationale and mechanism behind them still aspire after understanding and investigation. In this connection, we quest into the essence of translation models, and present a generic model, namely, GTrans, to entail all the existing translation models. In GTrans, each entity is interpreted by a combination of two states - eigenstate and mimesis. Eigenstate represents the features that an entity intrinsically owns, and mimesis expresses the features that are affected by associated relations. The weighting of the two states can be tuned, and hence, dynamic and static weighting strategies are put forward to best describe entities in the problem domain. Besides, GTrans incorporates a dynamic relation space for each relation, which not only enables the flexibility of our model, but also reduces the noise from other relation spaces. In experiments, we evaluate our proposed model with two benchmark tasks - triplets classification and link prediction. Experiment results witness significant and consistent performance gain that is offered by GTrans over existing alternatives.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 02/Jun/2017
Suggestion:
Reject
Review Comment:

This paper proposes a novel Knowledge Graph Embedding model based on TransE [1].

The proposed model is more convoluted than the original one, but the underlying ideas are interesting.

In short, authors propose a dual representation for each entity and relation type in the Knowledge graph: eigenstate and mimesis.
Then, each entity is described as a linear combination of its eigenstate and its mimesis, where the weights can either be
fixed (Static Weighting - SW) or selected dynamically by using graph properties (Dynamic Weighting).

Furthermore, the authors reformulate TransE by replacing the common Euclidean distance with a "standardised Euclidean distance" (SED).
What is the main motivation behind this choice?
Is this distance metric proposed or used elsewhere in the literature ?

In Sect. 3.2, the authors relax some of the requirements in TransE by replacing a set of hard constraints with soft constraints,
but this decision is not really properly justified. Indeed it's fine - it can be seen as a form of L2 regularisation, similar to the
one used in [2, 3].

In addition, authors also rely on two strategies from the literature for sampling negative examples (unif, bern)

The authors also do a great work surveying related methods (Sect. 2.2), showing they are aware of the current literature on the topic.
They also provide a comparison wth several other models from the literature on two standard benchmarks - WN18 and FB15k.

My main concern is that, despite being aware of it, the authors intentionally chose not to compare their proposed model with ComplEx [3].

Despite being much simpler than GTrans, ComplEx achieves significantly better results both on WN18 and FB15k.

For instance, after briefly checking the paper in [3], it is easy to see that ComplEx achieves a 84% Hits@10 on FB15k,
while GTrans - despite being sensibly more convoluted and relying on a much larger bag of tricks - only reaches 75.3%.
A similar phenomenon also happens for WN18.

However, despite being aware of the model in [3], authors intentionally choose to omit its results in the comparison.

This behaviour might (incorrectly) convince the readers that GTrans achieves state-of-the-art results on such two standard benchmarks,
while it's not the case - by a large margin.

For such a reason, I do not recommend the acceptance of this paper.

[1] Bordes et al. - Translating Embeddings for Modeling Multi-relational Data - NIPS 2013
[2] Yang et al. - Embedding Entities and Relation for Learning and Inference in Knowledge Bases - ICLR 2015
[3] Trouillon et al. - Complex Embeddings for Simple Link Prediction - ICML 2016 - http://proceedings.mlr.press/v48/trouillon16.pdf

Review #2
By Michael Cochez submitted on 11/Sep/2017
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

In this paper, the authors present a knowledge graph embedding approach based on the idea that each entity (and relation) should be embedded based on what is called the eigenstate and mimesis.
The eigenstate represent the intrinsic aspects of the entity, while the mimesis is representative of what the authors call exogenous properties.

The paper is well written, despite some minor errors in grammar, which could be corrected by intensive proofreading.
I do think there is some novelty in the paper, as the concept seems somewhat different as earlier approaches.
However, as I will elaborate below, a stronger argumentation is needed to convince a reader that the approach chosen is a) solving the right problems and b) actually solving them.
Further, there is an issue regarding the significance of the results. Many of the results obtained in this work are dominated by approaches which predate this submission.
Now, there are some results in which this paper still brings merit, and it might well be that the approach in this paper scales better to larger knowledge graphs as the currently dominating approaches.
Hence, I advise a major revision of this paper in which the author would investigate the behavior of their approach on larger graphs.

I will first go into some details of major issues I observed in the paper. The I will list smaller issues below.

1. As mentioned, the results in this paper are often dominated by other approaches. Exemplary is TransG [1], which dominates many of the results (others include KG2E, PTransE).
One might argue that there would not be a need to compare to that model as is is a generative model and not translation-based.
In my view, that argument is, however, void. To me, it seems that transG also has some of the same ideas in the background as the proposed approach, namely that having a translation identically for all the triples of the same relation is not necessarily a good idea. Therefore, in the revised version, an in depth comparison would be necessary to weed out the relation between these approaches.
The same goes for the already included related work. There is a short discussion establishing that TransE and TransR are special cases (parameterizations) of GTrans, but the relations to other TransX models is not elaborated.

2. The main two reasons put forward to why the approach is needed are that 1) current approaches underestimate the complexity of entities 2) a statement saying that something is wrong with the function which is typically optimized I am not sure I completely get what the problem is they want to point out, but I understand it to be that the relation vector (or matrix in some approaches) of a relation is not the same for all head-tail pairs. Now, the first argument is pretty vague and in my opinion not enough supported in the paper. The second one I agree with, and the discussion could be augmented with some examples.
All in all, there is some support for a new approach, but it should be argues clearer. Then, when presenting the approach is would be good to indicate explicitly how the model is solving the indicated issues. Especially, how it is doing it better compared to other approaches. For example, while reading, I started thinking that using normal TransE with double the vector length might be as good as having two vectors. Or that having a translation matrix would cover anything one can do with two translation vectors. Some of this becomes clearer in section 3, but a reader has to find this out independently, instead of a comparison being made explicitly.

3. I might be getting the wrong impression, but it seems to me that the model is represented more involved as its actual strength. Continuing from equation (6), I can do a rewrite as follows:

h = \alpha h_e + \beta (r_a h_a^T h_e)
h = \alpha h_e + \beta (r_a h_a^T) h_e
h = (\alpha + \beta (r_a h_a^T)) h_e

Hence, it seems like the whole first term, related to the 'abstract' vectors only results in a linear scaling of the 'eigenstate'.
This means that the whole abstract part is not able to change specific parts (dimensions) of the eigenstate, but only scale the whole vector.
It seems to me that this diminishes the power of the model greatly.

4. There is a comparison of the complexity of the approaches, however, a timing comparison is missing. I agree that the complexity analysis is essential, mainly because the graphs currently used are rather small. From experiments performed as part of my own work, I have observed that some of the approaches cited to in this paper will indeed take a very long time to complete, or not complete within a reasonable time on large graphs. Further, it is known that a (worst case) complexity analysis does not always give a complete view on the actual time needed to run the algorithm on real test cases. Hence, I would suggest to also provide a comparison between the approaches, regarding the time it takes to build the model of a given (large) dataset.

Besides these major issues, there are several other parts to pay attention to in the revised work.

* Besides the works mentioned above, there are also other works aiming to embed knowledge graphs. One branch of approaches are the ones related to (biased) RDF2Vec, which also scales to much larger KGs.

* When introducing NTN, mention explicitly where the model was first conceived.

* The sensitivity on the alpha and beta parameters could be investigated explicitly.

* In section 3.0, when introducing \delta', provide a forward reference to where it gets defined.

* In the text after eq. (9), it is unclear what the i-th entity e refers to. Is it only heads, tails or both?

* In section 3.1, you write "Moreover, SED has lower time complexity and can be applied in large-scale knowledge representation." It seems to me that you should try both measures and then say whether they are feasible. At the moment, it seems like only one was tried. Denying one of them for reasons of theoretical time complexity is not sufficient.

* I see the argumentation for changing 1/\sigma to W_r as unnecessary. Writing it as W_r is just easier for notation. The fact that this also helped you to do some (preliminary?) optimization is not important.

* In equation 19, there are some '-signs missing in the second summation. Besides, it would be fair to mention the source of inspiration for this cost function explicitly.

* I am confused by the #parameters column of table 1. It seems you mean to indicate some sort of space complexity, but I am not too sure.

* You mention "elegant performance on large-scale KG". I like the lyric language use, but only accept it in case you support this with experiments, as mentioned above.

* Using TransE for initializing the vectors for the proposed embedding algorithms seems like a good idea indeed. It would be useful if you can report how much training time this saved overall.

* Your experiments cover the datasets which are often used for performing embeddings. However, for some experiments, some datasets are not used without a clear explanation. (e.g. why no WN18 in table 3)

* As already mentioned, the current graphs used are not that large at all. So, the statement "GTrans is experimentally demonstrated to be applicable to large-scale knowledge graphs." must be toned down as long as no large KGs (like e.g., Wikidata and DBPedia) have been embedded using this approach.

* In 4.4. Distribution of relation space. I see why this technique gives an indication of how even the distribution of the relation space is. What I do, however, not understand is why this would immediately mean that the embedding is better. Perhaps there are specific interrelations which are only representable by a non-even spreading. Moreover, comparing the error measurements of the different methods directly, as seems to be done in the last paragraph does not make sense. This mainly because the functions to be optimized are different and hence their errors cannot be compared.

* Language issues
There were several issues related to language in the paper. They were minor and not disturbing. Pay attention to the use of single and plural, in particular subject-verb agreements.
Besides, articles are missing in some places.

[1] Xiao, Han, Minlie Huang, and Xiaoyan Zhu. "TransG: A Generative Model for Knowledge Graph Embedding." In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 2316-2325. 2016.

Review #3
By Heiko Paulheim submitted on 05/Oct/2017
Suggestion:
Major Revision
Review Comment:

The paper introduces a knowledge graph embedding approach called GTrans. It is based on two notions of an entity, i.e., eigenstate and mimesis. In an evaluation on three different datasets, the authors show that their approach outperforms state of the art techniques like Trans*, although the setting is a bit questionable.

I have two main issues with the paper, i.e., the presentation of the approach, and the setup of the evaluation. Furthermore, the paper requires proof-reading by a native or at least proficient speaker, since it contains quite a few sentences which are barely understandable.

As far as the presentation of the approach is concerned, I find this paper particularly difficult to follow. A more intuitive description of what the authors mean by eigenstate and mimesis would clearly help improving the understandability. A running example using some well-known entity or entities would also help explaning the authors' intuition.

There are some items which remain particularly unclear and/or lack justification and explanation:
* the two shortcomings on p. 2 (underestimation of complexity, design of evaluation rules) should be explained using an example. Where exactly do those models fall short?
* on p.2, "complex relation" is not defined. What do the authors mean by a complex relation?
* on p.4, "abstract" is introduced, but not explained
* on p.4, "state of entities" is introduced, but not explained
* on p.7, the authors make statements about the time complexity of their approach, which lack any argumentation
* on p.12, the textual interpretation of Fig. 6 is unclear. The authors should first explain their evaluation measures, and then present their findings.

There are some claims which are partly counterintuitive (at least to me) and should therefore be more carefully explained:
* p.2. The authors state that other approaches do not "fully describe entities" - actually, my understanding is that focusing on the essence of an entity instead of fully describing it is actually one of the goals of using vector space embeddings
* p.5. The authors claim that "the more occurences of a relation, the more important this relation means to the entity" - a typical counter example would be birthplace. There are thousands of people whose birthplace is the US, but this relation barely influences the meaning of the concept "US". On the other hand, a book may have only one author and one genre, but both are pretty essential for describing the book

As far as the evaluation setup is concerned, the comparison is not fair - it is actually comparing apples to oranges. GTrans is run with specific parameters for each datasets, while the other approaches use a fixed parameter set across all three datasets. The authors have to either tune the parameters of *all* approaches, or run their approach with a parameter set fixed for all datasets.

Furthermore, the setup for determining the optimal parameter settings is unclear. Did the authors split the training set into a training and a validation part, or did they tune against the evaluation set? It would also be good to make a statement about the stability of the approach, i.e., how sensitive is it w.r.t. to parameterization?

In total, the approach is not very well described, and the current evaluation setup does not prove that the approach is actually better than state of the art approaches (contrary to what the authors state). I do not recommend acceptance.

Minor issues:
* p.5: "s_e_i,s_e_j indicates that the entity e_j has k relationships with the entity e_i" - there has to be a k in the first part
* The authors use the term "triplets", but "triples" is the term usually used in the community
* Freebase does not contain "billion entities", it is smaller, see [1] for actual numbers
* On p.7, the authors mention "mini-batch", which is never explained
* On p.7, there is a broken reference to an equation

[1] Heiko Paulheim Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8(3)