A Semantic similarity measure for predicates in Linked Data

Tracking #: 2104-3317

Authors: 
Rajeev Irny
P Sreenivasa Kumar

Responsible editor: 
Jens Lehmann

Submission type: 
Full Paper
Abstract: 
Semantic similarity measures are used in several applications like link-predication, entity summarization, knowledge-base completion, clustering. In this paper, we propose a new semantic similarity measure called Predicate Semantic Similarity (PSS), specifically for predicates in linked data. Accounting for the apparent similarity between a pair of inverse predicates such as influences and influenced-by is one of the motivations for the work. We exploit implicit semantic information present in linked data to compute two quantities that capture context and (semantic) proximity aspects of a given pair of predicates, respectively. We build on the Normalized Semantic Web Distance (NSWD) and generalise it to predicates to take care of the context aspect. We also propose a novel measure based on neighbourhood-formation computation on a bipartite graph of predicates and classes to capture the proximity aspect. Thus we compute similarity along two semantic-facets namely context and proximity. A weighted sum of these gives us the new measure PSS. Through experiments, we evaluate the performance of PSS against the existing similarity measures including RDF2Vec. We find that including only one of context or proximity is insufficient. We create ground-truths to facilitate a thorough evaluation. The results indicate that PSS improves over all the existing measures for semantic similarity between predicates.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Afshin Sadeghi submitted on 18/Mar/2019
Suggestion:
Reject
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along with the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

1. The methods proposed by the of the paper seem incremental. The first method is a co-occurrence measure specified by type relation, with the assumption that there always exist "type" relations and the second proposed method looks incremental to the [9] paper(Neighborhood Formation and Anomaly Detection in Bipartite Graphs) for predicate and "class" entity, with the assumption that this type of entity exists in the dataset, where the comparisons are done in the vector space.

2. I am not sure how the results are compared to RDF2Vec, authors have defined a similarity measure for the predicates, but RDF2Vec creates vectors for entities only and not predicates. Also, it is not clear which RDF2Vec used in the evaluation is trained on what dataset. I would also request the source code of the evaluations for the sake of reproducibility.

3. The paper is difficult to read and the definition of the methods is hard to follow, I sugeest structuring the paper again to make the content coherent and revise the definitions. For example, the "Neighbourhood Formation" definition was easily understandable form [9] but I got confused reading it here. The notations used in Algorithm 1 and in the notation used in the definitions are not explained and they are needed to be added.

General comment:
Due to low scores in the paper structure, readability, and the evaluation section, I suggest rejection of the work. I write a couple of suggestions that authors can consider in case they target a resubmission:
I suggest the problem definition of the paper be revised and instead of targeting similarity in the Linked Data, which may not include the predicate "type" and "class" entity, which are assumed in the article, the authors can target defining the similarity on the Knowledge bases that they include in their evaluation. For the sake of evaluation and open policy of the journal, I would suggest also releasing the source code and the test datasets and the evaluations, openly online. I would also suggest the evaluations include the comparison to other graph-based methods and also include the more general word based similarity measures such as Word2Vec and GlovE.

Review #2
Anonymous submitted on 17/Jul/2019
Suggestion:
Major Revision
Review Comment:

The paper presents an approach for calculating a semantic similarity between predicates in Linked Data. The approach is based on the existing Normalized Semantic Web Distance to calculate context-based similarity between predicates. Furthermore, the authors propose a proximity-based similarity, which is calculated using Neighborhood Formation. The authors combine both the context-based and the proximity-based similarity into one final Predicate Semantic Similarity (PSS). The approach is evaluated on 3 dataset, and it is compared to a number of baselines and related approaches. The evaluation is in favor of the proposed approach.

The paper is well structured, well written and easy to follow and understand. The paper addresses an interesting problem, and the authors present some interesting ideas, however there are several drawbacks.

- While the presented approach is interesting and seems to perform well, the authors don't provide a good motivation for developing such an approach. For example, in which applications can this approach be used? The authors vaguely list several possible applications in the introduction, but then don't provide any details how their approach could be used in these applications. It would be beneficial to see the approach being used in a real task or application, such as ontology matching, or identifying similar predicates across different LOD datasets. An evaluation with a comparison to the corresponding related approaches for that task will significantly increase the value of the paper.

- The evaluation protocol is sound, however the size of the datasets is rather small and it does not allow to draw any statistical significant conclusions. The gold standard must be extended to at least couple of hundreds of predicate pairs in order to be considered valid.

- It is not clear why the authors don't compare their approach to [5]. Furthermore, it would be interesting to see an evaluation comparison to other graph embeddings, such as TransR and HolE, which also embed relations.

- The example depicted in Figure 2 would be more useful if it contained more nodes and different type of predicates. It would be interesting to show the advantage of the proposed approach over the related approaches.

Minor comments:
- The figures and tables should be placed on the same page where they are referenced, e.g., Table 1 should be moved to page 9, Table 2 on page 10, Table 3 on page 11 etc.
- Remove "and" from the list of authors
- There should always be space before a reference number, e.g., "similarity measure[8]" should be "similarity measure [8]"
- "Table1" -> "Table 1"; "Table2" -> "Table 2" etc..

Review #3
Anonymous submitted on 02/Aug/2019
Suggestion:
Minor Revision
Review Comment:

This paper tackles an interesting area of semantic similarity assessment in knowledge graphs by presenting a novel predicate based semantic similarity metric.
Most of the paper is well written and easy to follow, however, some details are missing which are outlined below in this review.
The major weaknesses of the paper are
1) Lack of clear formulation
2) The claim of scalability without formal proof of the complexity, and empirical evidence.
The paper lacks references to related works in the biomedical domain for similarity assessment. E.g. no reference to predicate based similarity used in RDF clustering by Silvia.
Page 3 Eq 1, last line vout(x) [missing (x)]
It is not confusing to follow the running example on page 5 line 23. It would be good to show Cs(q), Co(q) cu(q) with both values and types, before calculating Cf and Cr.
E.g. It is unclear how we obtained line 34 and 35, as (x1 P X5), and X5 is of type C3
Page 7, line 12, Add the formulation for edge weights and also an example
I would recommend taking figure 1 and create a running example for 4.2-PS as well.
- Scalability and Performance
What is the complexity of
creation of the bipartite graphs with the class information?
Calculation of edge-weights for both graphs
Subsequent proximity calculation
Given that KGs are essentially incomplete, how would you consider missing class information ? or the situation where most of the entities belong to the class thing
There is no discussion on the sparsity of the resulting graphs and the resulting vectors?
It is important to discuss how many subjects and objects have the class information present in the selected KGs.
The statistics for how many subjects/objects have the class information in the selected KGs is missing.
How do you handle hierarchical class information? Do you take any measures to incorporate this information?
Both the context-based similarity and the proximity-based similarity take the class information into account.
I do not see a comparative discussion of both similarity metrics proposed in the paper. As both use only the class information of entities.
This limits the application to clean KGs having generous class information. How to deal with the problem of missing classes?
Have you thought about using common objects/classes in one of these metrics?
Why RDF2vec has not been tested for GeoSpecies and SWDF datasets?
The evaluation protocol is valid and the top-10 similar predicates are interesting.

In summary, the paper is well written, but there is a need to add more details for a clearer understanding of the proposed approach, In addition, important aspects and shortcomings (e.g. class information, the sparsity of graphs, and complexity) of the approach should be discussed in detail.


Comments

There seems to be an error in equation 2: log f_lambda(x) is used twice in the min function. Maybe the second x should be a y? Also, it could be possible to simplify the formula by taking the log function outside the max and min functions.