A Biaswalk Based RDF Entity Embeddings

Tracking #: 2401-3615

Thi Thu Van Duong
Md Anwarul Islam
Young-Koo Lee1

Responsible editor: 
Harald Sack

Submission type: 
Full Paper
Resource Description Framework(RDF) graph has become an important data source for many knowledge discoveries and data mining tasks. However, to enable complex analytic, most of the knowledge discovery algorithms require data in vector representation. Therefore, several works have been recently proposed which aim to represent entities in the RDF graph as low dimensional vectors by graph walking. However, sequences generated by graph walking capture only structure related context, it cannot capture latent context such as semantically related information which is an important property of RDF data. In this paper, we proposed a new method to map each entity in RDF to vector using word2vec as a language modeling to learn embedding. In order to use word2vec, we produce a bias random walk, to generate sequences as node context. In this paper, we provide a new concept of similar entities which trade-off between the label of outgoing edge and outgoing nodes. By using entity similarity, we provide a structural similarity that calculates the similarity of two entities in each case of the current sequence. Moreover, we proposed a latent sequences which cannot be generated by traveling the graph, but provide more semantic sentences. Experimental results and the case study on real graphs demonstrates that our method achieves better quality and efficiency.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Heiko Paulheim submitted on 29/Jan/2020
Review Comment:

The authors propose a new method for creating entity embeddings on RDF graphs. They compare their results to RDF2vec.

The whole paper is extremely difficult to grasp, also due to language issues. From what I get, the authors propose (a) a new weighting scheme for creating non-random walks, which dynamically selects the next node to visit on a not entirely random basis, and (b) an approach for generating latent sequences (probably some generalizations of the sequences before), which is only presented in a very short paragraph.

The first approach, introducing a new weighting scheme, coined "bias walk" by the authors, does not improve much over random walks, and cannot outperform RDF2vec. The second approach, applied on top of biaswalk, shows sometimes better results than RDF2vec according to the numbers presented by the authors.

That being said, the authors picked some numbers in tables 1-6 which are not the best numbers for RDF2vec. RDF2vec has shown to work better with kernels on smaller datasets (which means that on the AIFB dataset, the authors' approach would only outperform RDF2vec on a small margin of .95 vs. .93, whereas on the MUTAG dataset, RDF2vec is still superior with .96 vs .93). On the DBpedia (not BDpedia!) based evaluations, the authors used the numbers for 4-depth walks, while in the original paper, 8-depth walks have been shown to work better. Using the 8-depth walks instead, RDF2vec is only outperformed in two out of four cases, not all four, as claimed by the authors.

Moreover, if the latent walk generation technique can be applied on top of biaswalk, and biaswalk performs worse than RDF2vec, it would be straight forward to apply it on top of RDF2vec. It is unclear why this has not been tried.

I also miss some performance figures. When running such experiments on top of graphs such as DBpedia, this would be very interesting to see.

In addition, the authors propose a way of handling string literals by first embedding them into a doc2vec model. While I can see that this is an advantage in the situation discussed (i.e., the handling of titles of academic papers), it is not applicable in general (e.g., authors' names), and the authors should discuss this more clearly. To make the approach applicable in a fully automated pipeline, it should be capable of figuring out the promising cases itself (i.e., apply the literal handling on titles, but not on authors' names).

In the experiments on DBLP and CiteSeer, the ground truth is not clear. Moreover, a comparison to RDF2vec should be carried out, as in the previous section.

In summary, the paper lacks both clarity in the approaches applied, and rigor in carrying out the experiments.

Some additional comments:
* The code and the data used are not openly available, which makes replicability hard.
* The English language quality of the paper is not very good, and it should be proof-read by a native speaker.
* Some references are simply missing ([]) or not properly resolved ([?]). Moreover, encoding iussues exist (see bottom of p. 8, left column). These are trivial things to figure out *before* submitting to a high-quality journal.

Review #2
By Mehwish Alam submitted on 23/Feb/2020
Review Comment:

The paper introduces an algorithm for using biased walks on RDF graphs for generating entity embeddings.

Major comments:

The authors claim that the main difference between their algorithm and SOTA algorithms is that this method provides encodes semantic information into the vector space. However, biased walks also capture the structural information.

The authors are completely by passing the algorithms generating knowledge graph embeddings such as translational models such as TransE, models based on Graph Convolutional Networks etc. There should be some comparison provided in reference to these algorithms. GAKE [1] is one of the algorithms which captures the contextual information and is worth discussing here.

Section 4.1 introduces a way to deal with text literals by learning their vectors using Doc2Vec and then KNN is used for obtaining the new labels for the literals. These are then captured while performing random walks. What I would really like to know how these new labels are helping in capturing the semantics of the entity for which the literal has been used? What authors are doing seems like adding these labels (not as an entity) in the vector space.

Please refer to the methods on generating Knowledge Graph Embeddings with Literals [2].

Definition 4. Computes the similarity between the nodes which is the starting point of the algorithm. If the authors are computing the similarities this way at this point why do they need to compute embeddings later. It is a bit confusing.

The authors need to situate their work based on the reference 3 in their paper which is also performing biased walks.

In the results section 5, it is not clear if the biased walks represents the work of the authors or the work proposed in reference 3.

All in all, I fail to see the novelty of the approach.

More detailed remarks:

- The English used in the paper needs to be improved a lot.
- Many of the technical terms are not used correctly such as “travelling the graph” in abstract. For graphs usually the word traversing is used.
- Introduction keeps a lot of repetitive information such as second paragraph “vectors are embeddings where embeddings are transformations from graph entities to their …….”.
- The paper is written very carelessly. In many please the references are missing such as section 5 misses the references to the datasets.
- There are many other typos and the paper at times becomes hardly understandable.

[1] https://www.aclweb.org/anthology/C16-1062.pdf
[2] Genet Asefa Gesese, Russa Biswas, Mehwish Alam, Harald Sack: A Survey on Knowledge Graph Embeddings with Literals: Which model links better Literal-ly? CoRR abs/1910.12507 (2019)

Review #3
By Michael Cochez submitted on 12/Mar/2020
Major Revision
Review Comment:

In this paper, the authors propose an improvement over the RDF2vec algorithm by biasing the random walks created as a first step of the algorithm.
Besides, the quality is further improved by including information from specific information from string literals.

While this paper has some merit, my view is that it does not have sufficient new contributions to warrant a journal publication. A major issue is that there is no comparison with other works that have investigated biasing the random walks and inclusion of literal information.
From the positive side, the idea of clustering the literals, could indeed be explored further. Also, the idea of biasing walks differently still has potential.
My recommendation is hence a major revision of the paper, but that revision will need significantly more exploration.

A secondary concern is regarding the amount of revision of the paper. There are many grammatical issues and in many places, there are technical problems with the compilation of the manuscript. These technical issues need to be addressed and a thorough language review is necessary before resubmission.

In this review, I do refer to several of my own works, which might bias my views. The reason I still kept these is that we have done research in a very similar direction.

* One major rationale used in the paper is that normal random walks will often miss walks that contain important information - more specifically important nodes. While that claim seems justified, it would be beneficial to
1. quantify this more precisely . You could for example, do an experiment in which you measure how many nodes are missed in e.g. 500 walks from each node. Then, you would also need a measurement to quantify when that is actually significant.
2. Compare with methods that actually include all paths trough statistics, e.g. [1]. Note that that work also includes results for biased walks.

* In your related work, you also refer to another work which biases the walks for RDF2vec [2], but no comparison is provided. Note that in that work many biasing strategies are explored, not only the one you mention in the related work.

* In the work on RDF2vec, literals are indeed ignored. However, it is clear that they could have been included without any technical or algorithmic changes. Hence, a comparison the that would have been beneficial. Also, there is a recent paper and survey [3], which I would recommend to the authors.

* You do perform a manual selection of the relevant literals in the graphs. While this is not wrong by itself, it means that we are moving from a completely unsupervised method to a method that requires a form of manual feature selection. It would be interesting to see how performance changes without this manual selection.

Other issues:

* You claim in several places that word2vec is "the most efficient language modelling technique". First, this claim is vague as you do not define what efficient means. But secondly, there are many other techniques which have led to much better performance, strongly depending on the task and how performance is measured.

* You talk about "Common bag of words", which is not correct.
* There is no 'removing' of layers in CBOW
* You state a training complexity result without any context. Also, much of the explanation on the word2vc methods seems very superficial.
* Your reference section needs to be cleaned up, there are several issue, including capitalization inside paper titles.

[1] Michael Cochez, Petar Ristoski, Simone Paolo Ponzetto, and Heiko Paulheim. Global RDF vector space embeddings. In Claudia d’Amato, Miriam Fernandez, and others, editors, The Semantic Web – ISWC 2017: 16th International Semantic Web Conference, Vienna, Austria, October 21–25, 2017, Proceedings, Part I, pages 190–207. Springer International Publishing, Cham, 2017d. ISBN 978-3-319-68288-4. doi: 10.1007/978-3-319-68288-4.

[2] Michael Cochez, Petar Ristoski, Simone Paolo Ponzetto, and Heiko Paulheim. Biased graph walks for RDF graph embeddings. In Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics, WIMS ’17, pages 21:1–21:12, New York, NY, USA, 2017c. ACM. ISBN 978-1-4503-5225-3. doi: 10.1145/3102254.3102279.

[3] https://arxiv.org/abs/1802.00934 and https://arxiv.org/abs/1910.12507