RDF2Vec: RDF Graph Embeddings and Their Applications

Tracking #: 1643-2855

Petar Ristoski
Jessica Rosati
Tommaso Di Noia
Renato De Leone
Heiko Paulheim

Responsible editor: 
Freddy Lecue

Submission type: 
Full Paper
Linked Open Data has been recognized as a valuable source for background information in many data mining and information retrieval tasks. However, most of the existing tools require features in propositional form, i.e., a vector of nominal or numerical features associated with an instance, while Linked Open Data sources are graphs by nature. In this paper, we present RDF2Vec, an approach that uses language modeling approaches for unsupervised feature extraction from sequences of words, and adapts them to RDF graphs. We generate sequences by leveraging local information from graph sub-structures, harvested by Weisfeiler-Lehman Subtree RDF Graph Kernels and graph walks, and learn latent numerical representations of entities in RDF graphs. We evaluate our approach on three different tasks: (i) standard machine learning tasks, (ii) entity and document modeling, and (iii) content-based recommender systems. The evaluation shows that the proposed entity embeddings outperform existing techniques, and that pre-computed feature vector representations of general knowledge graphs such as DBpedia and Wikidata can be easily reused for different tasks.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Achim Rettinger submitted on 03/Aug/2017
Minor Revision
Review Comment:

Thank you for addressing the points of our previous review but there are still some open points that remain and need revising:

"While there are many graph embedding approaches, the approaches based on translating embeddings have shown to outperform the rest of the approaches on the task of link predictions." This is not correct when you have a look at newer approaches like ComplEx [1] and HolE [2] (which we already mentioned in our previous review).

While you included the three translational embedding approaches TransE, TransH, and TransR, some information is still missing: the dimension parameter should be mentioned and different variations could be shown. It also would have been beneficial if you would have chosen different types of graph embedding approaches like the aforementioned ComplEx [1] and HolE [2] or RESCAL and NTN which are already mentioned in the paper. At least mention them and explain why you do not compare to them or why they are not applicable.

Please make it explicit that the knowledge graph embeddings from the other approaches are trained for link prediction and not for the different evaluation tasks like classification.

Below are further open points:

-- Missing explanation / reference --

* It is not discussed why the edge information was used for the Weisfeiler-Lehmann graph kernel. There have been link prediction approaches which ignore the predicate so that a comparison of the exclusion of the edge information could be interesting.
p.10: "The dataset contains three smaller-scale RDF datasets (i.e., AIFB, MUTAG, and BGS) [...] Details on the datasets can be found in [71]" Please mention why you left out the AM dataset which was the fourth dataset described in [71].

-- Parameter Settings & Results --

p.11: "We use two pairs of settings, d = 2; h = 2 (WL_2_2) and d = 4; h = 3 (WL_4_3)." and the same for "the length of the paths l" The values changed from the previous version but we still couldn't find any explanation or insights for choosing these values.
p.14: Any explanation for the surprisingly low accuracy of SVM combined with DB2vec on AAUP, especially in respect to the WD2vec combination?
p.23ff: Are there any insights of any structural differences or coverage of Wikidata and DBpedia which explain the difference in performance e.g. for DB2vec SG 200w 200v 4d compared to WD2vec SG 200w 200v 4d?

-- Notation, Style & Errata--

p.13: "On the other hand, “Facebook” and “Mark Zuckerberg” are not similar at all, but are highly related, while “Google” and “Mark Zuckerberg” are not similar at all, and have somehow lower relatedness value." The word "somehow" does not fit in this context. Lower relatedness value compared to what?
p.15: table is too big -> in top margin
p.21: https://github.com/sisinflab/LODrecsys-datasets should be in the footnote like the rest of the URIs
p.21: "versions of Movielens, LibraryThing" is in the margin
p.25: https://github.com/sisinflab/lodreclib should be in the footnote like the rest of the URIs
* By addressing the reviewers' comments, you introduced some long bothersome sentences which destroy the reading flow. Please proof-read the paper again and recreate a good reading flow.

[1] Trouillon, Théo, et al. "Complex Embeddings for Simple Link Prediction." Proceedings of The 33rd International Conference on Machine Learning. 2016.

[2] Nickel, Maximilian, Lorenzo Rosasco, and Tomaso Poggio. "Holographic Embeddings of Knowledge Graphs.”

Review #2
Anonymous submitted on 11/Aug/2017
Review Comment:

This manuscript is a revision of manuscript #1495-2707, which I had reviewed as Reviewer #1. Therefore, the following is just a follow-up review of this revision in comparison with the previous version of the manuscript.

I think that this new revision of the manuscript shows significant improvements with respect to the initial manuscript. The authors have effectively taken into account all the remarks and issues raised by my review and responded to them satisfactorily.
There remain a few typos and minor editorial issues (e.g., a footnote mark before punctuation in Section 4) which the authors can fix through a thorough re-read of the manuscript.

An issue which, in my opinion, should still be solved before the manuscript is published is the following: in Sections 3.1.1 and 3.1.2, the authors fail to properly distinguish between a vertex and its label, an edge and its label, and a subtree and its label. While, in most cases, the reader can reconstruct what the authors meant from the context, such lack of rigor does not help clarity and should be avoided, at the cost of being slightly more verbose.

Review #3
By Jiewen Wu submitted on 12/Sep/2017
Review Comment:

This version addressed the concerns and issues that I raised in my last review. The paper is easier to follow compared to the last version, and ideas are well presented. The review on related work seems complete. The experimental evaluation substantiates the claim that the proposed RDF2Vec method can be applied to RDF datasets for uses in different machine learning tasks. I recommend acceptance.

A few minor comments:

Section 5.1 It wold be helpful to mention, for the smaller RDF datasets like AIFB, how much do they overlap with the training RDF datasets (DBPedia/Wikidat) in terms of entity labels.

Section 5.3 Figure 2 comes after Figure 3.