Extracting Entity-specific Substructures for RDF Graph Embeddings

Tracking #: 1955-3168

Authors: 
Muhammad Rizwan Saeed
Charalampos Chelmis
Viktor K. Prasanna

Responsible editor: 
Guest Editors Knowledge Graphs 2018

Submission type: 
Full Paper
Abstract: 
Knowledge Graphs (KGs) have become useful sources of structured data for information retrieval and data analytics tasks. Enabling complex analytics, however, requires entities in KGs to be represented in a way that is suitable for Machine Learning tasks. Several approaches have been recently proposed for obtaining vector representations of KGs based on identifying and extracting relevant graph substructures using both uniform and biased random walks. However, such approaches lead to representations comprising mostly popular, instead of relevant, entities in the KG. In KGs, in which different types of entities often exist (such as in Linked Open Data), a given target entity may have its own distinct set of most relevant nodes and edges. We propose specificity as an accurate measure of identifying most relevant, entity-specific, nodes and edges. We develop a scalable method based on bidirectional random walks to compute specificity. Our experimental evaluation results show that specificity-based biased random walks extract more meaningful (in terms of size and relevance) substructures compared to the state-of-the-art and the graph embedding learned from the extracted substructures perform well against existing methods in selected data mining tasks.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 16/Aug/2018
Suggestion:
Major Revision
Review Comment:

This paper studies to extract substructures of given entities and learn entity embeddings from those extracted substructures. The idea might not be bad. But the paper is not well written. A lot of terminologies and key technical details are not clearly defined or described. And the experimental results are not that promising. Detailed comments are given as follows.

1. This paper uses a lot of (different) terminologies, e.g., "representative neighborhood for each target entity", "representative subgraphs of target entities", "extract relevant/interpretable/meaningful entity representations". It is not clear what the authors mean by "representative neighborhood", "representative subgraphs", and "entity representations". Do they all refer to the same thing? If so, please unify the terminologies and provide precise definitions.

2. In Definition 2, the authors define a semantic relationship as a triple . But in the follow-up sections, a semantic relationship actually means p^d.

3. Please re-organize Section 4, which gets only a single subsection 4.1.

4. Algorithm 1 and Algorithm 2 seem to be complicated. Could the authors further provide a formal complexity analysis?

5. There are key technical details missing. It is not clear how to obtain entity embeddings after computing Specificity (or Specificity^H).

6. It is not clear how to create graph embeddings for those entities with the type Film/Book/Album/Country/City.

7. The authors mention that "we use three different datasets from different domains for the tasks of classification and regression". What are these three datasets?

8. Figure 6: why not conduct the same experiments on the Book data?

9. Figure 7: why not conduct the same experiments on the Book or Album data?

10. Section 6.5: why use a new dataset DBpedia Pagerank here? How does the new dataset relate to those introduced in Section 6.1?

11. Could the authors further explain how to interpret Figure 8?

12. The experimental results for the entity recommendation task (Figure 9) are not promising enough. The proposed method cannot beat the best performing baselines. And sometimes it performs substantially worse (Figure 9(b) and Figure 9(e)).

Review #2
By Sujit Rokka Chhetri submitted on 05/Oct/2018
Suggestion:
Minor Revision
Review Comment:

The manuscript proposes a specificity-based metric to find relevant entity specific nodes and edges to aid the embedding of the knowledge graph. The authors utilize this metric along with biased random walks to extract relevant sub-structures. These sub structures are then used for extracting the embedding vectors. The manuscript is well written and easy to follow.

Strengths:

1. The use of specificity metric has allowed the authors to acquire relatively higher precision and recall values while extracting small sub-structures compared to other state-of-the art algorithms.
2. The manuscript provides additional results to validate the use of specificity metric compared to their previous work "Extracting Entity-Specific Substructures for RDF Graph Embedding."
3. The further elucidate the semantics of specificity-based embedding using Figure 10, where similar entities are located closed in when the vector is projected to 2 dimensions using PCA.
4. Compared to their previous work the authors have extended the manuscript by also comparing the specificity-based knowledge graph embedding for regression and classification task, while also introducing a new metric to capture the hierarchical classes if entities while calculating the specificity.

Weakness:

1. Although the authors provide the experimental evaluation of the computational complexity of the proposed specificity computation, theoretical bound of the algorithms 1 and 2 is not analyzed.
2. In algorithm 2 and in the rest of the paper, no intuition is provided for selection of value for β either theoretically or experimentally. Although they mention about the diminishing influence in the last paragraph of section 4.1, it would be more convincing to see some results on effect of its variation on the result.
3. Section 6.3.1 which describes the applicability of the new hierarchical specificity metric seems shallow. Compared to specificity, the hierarchical specificity seems to only help improve one semantic relation “dbo:language” shown in Table 1.
4. Moreover, the hierarchical specificity-based metric didn’t improve the average number of walks per entity for sub-graph extraction in Figure 8. In fact, it got worse than the just the specificity-based metric.

Detailed comments

1. Section 2, paragraph 3, “Unliked [14]” -> Unlike.
2. Figure 3, 4 and 10 could be updated with optimized spacing and figure to text ratio.
3. Figure 6 (a), 6 (b), and 7 (a) are too dense. One way to improve can be by changing the range of y-axis from 50-100% for specificity for figure 6 (a) and 30-100% for figure 6 (b).
4. Figure 8 needs more explanation, for instance in 8 (c), in some case with depth 2 has lower average number of random walks per entity than for depth 3, this is not explained in the text.
5. The decrease in precision and recall for “db:book” in figure 9 (b) and 9 (e) for specificity based embedding is not explained.
6. For figure 10 and 11, hierarchical specificity-based results are not present. It would be interesting to compare them with other baselines as well.

Review #3
Anonymous submitted on 09/Oct/2018
Suggestion:
Accept
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper towards to extracting entity-specific substructures for KG. The most important thing is to define and compute the "specificity" for quantifying the relevance of a relation in a subgraph. Compared with previous methods focus more on "popular", this paper focus on "relevant".

This paper is an extended edition of the authors' conference paper 'Extracting Entity-specific Substructures for RDF Graph Embedding' which was published on IRI. The core contribution of those two paper is proposed in the conference paper. The contributions of this paper are: 1) do more experiments (the embedding for regression and classification tasks); and 2) propose a variation of specificity which takes into account the hierarchy of classes in the schema ontology associated with a KG.

The idea of this paper is interesting, and this paper is also clearly written.