Analysis of the Performance of Representation Learning Methods for Entity Alignment: Benchmark vs. Real-world Data

Tracking #: 3775-4989

Authors: 
Ensiyeh Raoufi
Bill Gates Happi Happi
Pierre Larmande
Francois Scharffe
Konstantin Todorov

Responsible editor: 
Guest Editors OM-ML 2024

Submission type: 
Full Paper
Abstract: 
Representation learning for Entity Alignment (EA) aims to map, across two Knowledge Graphs (KG), distinct entities that correspond to the same real-world object using an embedding space. Hence, the similarity of the learned entity embeddings serves as a proxy for that of the actual entities. Although many embedding-based models show very good performance on established synthetic benchmark datasets, in this paper we demonstrate that benchmark overfitting limits the applicability of these methods in real-world scenarios, where we deal with highly heterogeneous, incomplete, and domain-specific data. While there have been efforts to employ sampling algorithms to generate benchmark datasets reflecting as much as possible real-world scenarios, there is still a lack of comprehensive analysis and comparison between the performance of methods on synthetic benchmark and original real-world heterogeneous datasets. In addition, most existing models report their performance by excluding from the alignment candidate search space entities that are not part of the validation data. This under-represents the knowledge and the data contained in the KGs, limiting the ability of these models to find new alignments in large-scale KGs. We analyze models with competitive performance on widely used synthetic benchmark datasets, such as the cross-lingual DBP15K. We compare the performance of the selected models on real-world heterogeneous datasets beyond DBP15K and we show that most of the current approaches are not effectively capable of discovering mappings between entities in the real world, due to the above-mentioned drawbacks. We compare the utilized methods from different aspects and measure joint semantic similarity and profiling properties of the KGs to explain the models' performance drop on real-world datasets. Furthermore, we show how tuning the EA models by restricting the search space only to validation data affects the models' performance and causes them to face generalization issues. By addressing practical challenges in applying EA models to heterogeneous datasets and providing valuable insights for future research, we signal the need for more robust solutions in real-world applications.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Vasilis Efthymiou submitted on 06/Feb/2025
Suggestion:
Accept
Review Comment:

Overall, I am happy with the changes made in the revised version and I can now recommend its acceptance.
Please see my previous comments regarding the readability of the code on github.
Thank you!

Review #2
Anonymous submitted on 19/Mar/2025
Suggestion:
Minor Revision
Review Comment:

The authors have addressed several concerns raised in the initial review:
1. Expanded experiments: on ICEWS-WIKI and ICEWS-YAGO datasets using RDGCN (Table 4), though other models like BERT-INT and i-Align were not tested on these datasets.
2. Clarified JS divergence calculation: (Table 2 footnote 6) by noting that values were scaled by 100 to show percentages.
3. Added discussion on algorithmic enhancements, such as interaction training models as robust solutions for real-world data.

Remaining Weakneses:
1. Include results for at least one interaction training model (e.g., BERT-INT) on ICEWS datasets to compare against RDGCN and validate the claim that interaction training generalizes better.
2. The authors advocate for interaction training models but it is lack of A deeper analysis of why these models handle heterogeneity better is needed, such as including qualitative examples showing how interaction training reduces noise.

1. Originality: The paper addresses a critical issue in entity alignment (EA) research: benchmark overfitting and the gap between synthetic and real-world dataset performance. The categorization of embedding-based EA methods into four groups and the analysis of dataset heterogeneity provided valuable and fresh perspective. While benchmarking models on real-world vs. synthetic data is not entirely novel, the depth of heterogeneity analysis and the focus on interaction training models for real-world scenarios are original contributions.
2. Significance of the results: The findings reveal performance drop in real-world dataset. It is impactful that the authors identify semantic similarity as a key factor influencing model performance and recommend interaction training models for real-world applications
3. Quality of writing: The paper is well-structured. But it is better to provide more intuitive explanations in sections like metric definitions. There are still many typos and grammar errors to be addressed. See details in below:
Page 2 line 16: proving -> providing
Page 2 line 38: Embdding -> Embedding
Page 4 line 50: Similarity -> Similarly
Page 7 line 38: diffirent -> different
Page 14 line 3: resutling -> resulting

Page 3 line 15-16: Research has been done on extracting more realistic EA benchmark datasets [30] from large knowledge bases like DBpedia [36].
Rephrase suggestion: Previous research has extracted more realistic EA benchmark datasets [30] from large knowledge bases like DBpedia [36].

Page 12 line 18: a higher amount of semantic heterogeneity: Consider using "degree" instead of "amount"

Ensure consistent usage of "real-world" vs. "real world" throughout (prefer "real-world")