STEM: Stacked Threshold-based Entity Matching for Knowledge Base Generation

Tracking #: 1762-2974

Enrico Palumbo
Giuseppe Rizzo
Raphael Troncy

Responsible editor: 
Guest Editors ML4KBG 2016

Submission type: 
Full Paper
One of the major issues encountered in the generation of knowledge bases is the integration of data coming from a collection of heterogeneous data sources. A key essential task when integrating data instances is the entity matching. Entity matching is based on the definition of a similarity measure among entities and on the classification of the entity pair as a match if the similarity exceeds a certain threshold. This parameter introduces a trade-off between the precision and the recall of the algorithm, as higher values of the threshold lead to higher precision and lower recall, and lower values lead to higher recall and lower precision. In this paper, we propose a stacking approach for threshold-based classifiers. It runs several instances of classifiers corresponding to different thresholds and use their predictions as a feature vector for a supervised learner. We show that this approach is able to break the trade-off between the precision and recall of the algorithm, increasing both at the same time and enhancing the overall performance of the algorithm. We also show that this hybrid approach performs better and is less dependent on the amount of available training data with respect to a supervised learning approach that directly uses properties' similarity values. In order to test the generality of the claim, we have run experimental tests using two different threshold-based classifiers on two different data sets. Finally, we show a concrete use case describing the implementation of the proposed approach in the generation of the 3cixty Nice knowledge base.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ondřej Zamazal submitted on 20/Dec/2017
Review Comment:

I would like to thank authors of the paper for their work on paper improvements. They successfully tackled all my remarks. Additionally, I spotted two minor typos:
* regarding runtime performance in Section 6.3 I would say that there should be T_{STEM} instead of T_{total}.
* in Section 7 authors added an explanation for "rigid search mechanism". Please check the typo related to "are" in "...resolved any conflict of representation by optimizing the selection criteria are:".

Review #2
Anonymous submitted on 06/Jan/2018
Minor Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The paper significantly improved since the initial submission. It has been reorganised and partly rewritten to consider most of the reviewer suggestions.

My remaining concerns are:
- The related work section has been enriched. However it lacks of a deep comparison with existing entity matching works, appart from that the STEM approach can be used on the top of any pairwise numerical threshold-based classifier. Some ensemble learning approaches should be motioned even is they do not deal with entity matching (

- I really liked the problem formulation section but it misses a summary paragraph which gives a formulation of the problem in terms of an ensemble learning problem that considers a set of entity matching decisions provided by different threshold-based systems.

- Section 4.2 is clearer now and support the soundness of the proposed approach. May be the authors should give an idea of how \lambda in equation (22) is estimated (it is important to be convinced by the equations (26) and (27))?

- minor remarks:
- paragraph before definition 3, “… e1 and e2 is carried out on a set OF literal value …”
in definition 5: add a line breaking.
- section 4.3: “However, as a rule of thumb, ….. that: O(N ∗ g2) < Ttrain(N, g)(N, g) < O(N ∗ g3)” ==> ““However, as a rule of thumb, ….. that: O(N ∗ g2) < Ttrain(N, g)< O(N ∗ g3)”
- section 4.3: use the latex symbol ‘\leq’ instead of ‘<=‘

Review #3
By Mohamed Sherif submitted on 09/Jan/2018
Review Comment:

This is the second version of the article “STEM: Stacked Threshold-based Entity Matching for Knowledge Base Generation”.
I thank the authors for addressing all the raised issues. The paper is definitely suitable for publication.

I am here mentioning some minor remarks for the authors to be addressed in the camera-ready version of the paper:
• Definition 5: The confidence vector equation exceeds the column limit.
• Definition 6: “… matching systems in stating that …” -> “… matching to state that …”
• In my opinion, it is better to distinguish the \hat{f} used in definition 7 from the one used in definition 3, may be by adding some sub- or superscript.