STEM: Stacked Threshold-based Entity Matching for Knowledge Base Generation

Tracking #: 1548-2760

Enrico Palumbo
Giuseppe Rizzo
Raphael Troncy

Responsible editor: 
Guest Editors ML4KBG 2016

Submission type: 
Full Paper
One of the major issues encountered in the generation of knowledge bases is the integration of data coming from a collection of heterogeneous data sources. A key essential task when integrating data instances is the entity matching. Entity matching is based on the definition of a similarity measure among entities and on the classification of the entity pair as a match if the similarity exceeds a certain threshold. This parameter introduces a trade-off between the precision and the recall of the algorithm, as higher values of the threshold lead to higher precision and lower recall, and lower values lead to higher recall and lower precision. In this paper, we propose a stacking approach for threshold-based classifiers. It runs several instances of classifiers corresponding to different thresholds and use their predictions as a feature vector for a supervised learner. We show that this approach is able to break the trade-off between the precision and recall of the algorithm, increasing both at the same time and enhancing the overall performance of the algorithm. We also show that this hybrid approach performs better and is less dependent on the amount of available training data with respect to a supervised learning approach that directly uses properties' similarity values. In order to test the generality of the claim, we have run experimental tests using two different threshold-based classifiers on two different data sets. Finally, we show a concrete use case describing the implementation of the proposed approach in the generation of the 3cixty knowledge base.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ondřej Zamazal submitted on 27/Feb/2017
Minor Revision
Review Comment:

The article deals with the integration of data coming from heterogeneous sources, the entity matching, which is an important and up-to-date obstacle within the process of the generation of knowledge bases. Submitted article presents an original research with a compelling impact on the research community. The approach, STEM, is based on stacking where individual threshold-based classifiers are involved. The predictions of base classifiers are used as a feature vector for a supervised model. The article provides three different contributions in the area of entity matching and knowledge base generation which are well supported throughout the paper. It proposes a generic ensemble based framework improving the performance of its involved base classifiers. Next, it shows that STEM is not so dependent on large amount of training data in comparison with complex pure machine learning approaches (SVM, Random Forest). Finally, it provides a description of STEM involvement in the generation of the knowledge base. In the article I would prefer to have more information with regard to certain aspects and suggest some reformulations. I specify it below. According to the call there is no strict page limit, so information can be added.

Detail remarks:

We could see the effect of the STEM approach in the case of involving Duke or Silk separately. What about to include both, Duke and Silk at the same time?

STEM-LIN has also been evaluated on DOREMUS datasets but this is missing for evaluation with STEM-NB. This would make the evaluation more complete.

Runtime performance of STEM has not been evaluated in the article. This should be added to the evaluation and discussion.

Authors wrote that the experiment with the 3cixty knowledge base reconciliation also proved that the approach is scalable. But how this claim is supposed to be supported? By the size of the given task? However this information is not provided.

Regarding the data model in Section 7 authors write that they established a rigid search mechanism. Could you be more specific about this mechanism?

Similarly, could you provide more information about two teams of investigators from Data Sources part? How many people involved in those teams and of which expertise? Perhaps, members of those teams are experts mentioned in the corresponding text. But this is not clear.

In the experiment with the generation of knowledge base 3cixty Nice KB has been used. Could you provide more information how Nice KB has been made?

Further remarks:

p. 2, footnote 4, 3cixtyhttps://www.3cixty.com

p. 3, “Training can be used for learning matching rules, learning in which orders matchers should be applied, automatically setting critical parameters and/or determining weights to combine matchers sim-
ilarity values and the most commonly used supervised learners are Decision Trees and SVM [18,13,14,15].” - this should be reformulated. Perhaps, separation into more parts would improve readability.

p. 3, “A pair of records is considered to be a match if the degree of confidence” → “A pair of entities...”

p. 4, After the Equation 3 there is the dot starting the new line.

p. 4, There is missing label for the vertical axis in Figure 1.

p. 5, STEM is described using several itemized list. It could be improved by introducing numbering of the workflow.

p. 5, “which corresponds to the decision rule Eq. 20” → “ the decision rule Eq. 2”

p. 8, Ad four heterogeneities: “There are 4 types of heterogeneities that these data manifest, that we have selected from the nine in task 1 and that appear to be the most problematic:” This task is introduced by authors (“...we have selected...”) or this is the task provided within OAEI?

p. 8, URLs in footnotes 13 and 14 are the same.

p. 9, “We extend the search space to a small subset” → “We reduce the search space to a small subset”

Review #2
Anonymous submitted on 07/Mar/2017
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The general problem that is tackled in this paper concerns the data linking problem where the aim is to build a system that allows to detect whether two different descriptions refer to same real world entity. To do so, the authors attempt to address the well known problem of fixing the reconciliation threshold (i.e. a value in [0;1] according to which a system can decide if a pair of instances refer to the same real world entity) for numerical data linking approaches.

They, proposed an approach named STEM that uses a stacking principle to create an ensemble of base classifiers and then combine their results by means of a supervised learner. To consider different values for the threshold they used a fixed amplitude a that is in [0;1]. For the classification part they used two different classifiers: Duke that is based on Naive bayes classifier and Silk which is a linear classifier. As a supervised learner they used an SVM with a RBF kernel.

Several experiments have been conducted where the authors gave results: (i) on improvements of recall and precision results when STEM is used upon a classifier like Duke or Silk; (ii) how the low dependency of STEM on size of the training set and (iii) how effective may be STEM in the context of knowledge bases generation. In total three datasets have been used: FEIII2016 challenge, DOREMUS and 3cixty dataset.

The paper is well written and all the needed notions are explained. However, this paper needs improvements at different points:
1- the theoretical contribution should be clarified, enriched and/or better explained
2- clarify the assumptions concerning the considered data linking problem: the Open/Closed world assumption? The unique name assumption? The ontology mapping problem, is it considered as solved, if not, how do you deal with? how the possible set of property mappings are used? Is the existence of an ontology is needed?
3- does the approach consider object properties? Is there any decision propagation? this needs a comparison with existing approaches (like Dong et al. 2005, Sais et al. 2009, Al-Bakri et al. 2016, etc.). What’s about the inverse side of the properties?
4- the scalability of the approach is not discussed at all. When in the LOD we are managing datasets of millions of triples (Dbpedia, yago, …) the datasets that are used do not exceed a thousand of instances. What’s about runtime?

General comments:
(1) originality: the problem of finding the good threshold exists since the problem of data reconciliation (record linkage) has been formalised, i.e., since Fellegi and Sunter work. The idea of using stacking approach to deal with this problem seems to be original.

(2) significance of the results: in an experimental point of view, the authors showed on different datasets the effectiveness of their approach. However, in the theoretical side, the real contributions are not easily identifiable. To some extent, a part from using existing works and putting them together to achieve the data linking task, one can ask what are the scientific challenge that are dealt with in this paper. May be, by giving detailed algorithms may help to appreciate the difficulty of the problem that is tackled and the relevance of the proposed solution.

(3) quality of writing : I found the paper well written and easy to read. All the needed basic notions are introduced and well explained. The experiments are well commented and discussed.

Detailed comments:

Section 2: When the authors give a classification of the existing work “By taking into account the matching method, …. “ they should give references for each category of work. In the related work, the authors should clearly position their work in regard to existing ones.

Section 3 : from the formalisation given by Fellegi and Sunter, they also considered the “unmach” decisions? Why you do not consider them as a problem? Since you are using a kind of blocking, then you are interested in unmach decisions …?

Section 4 : The six items explain how the STEM works. Figure 2, depicts the different steps. However, only the three first are depicted and the blocking is not listed in the six steps? The image should be clarified.

Subsection 4.2 :
The probabilistic model is similar than the one used by Fellegi and Sunter, if not, a theoretical comparison is needed.
I think there is an error when you say “Finally, assuming that, a priori, P(Match) = P(No Match)”. It should be “ P(Match) = 1- P(No Match)”.

sub-section 5.1: the size of the datasets should be given.
sub-section 5.2: the value of pi, ri and fi are they corresponding to the best value obtained for the best threshold? or an average? this should be clarified.
Section 6: the experiments should be enriched by:
1- compare Duke and STEM-NB on DOREMUS
2- whey in FFIEC the dataset LIE is not used?

Example of features (N) should be given for each dataset.

Miner remark:
The title of the reference 27 is not given.

Review #3
By Mohamed Sherif submitted on 22/Aug/2017
Major Revision
Review Comment:

The paper proposes STEM, a framework based on stacked generalization for improving the performance of threshold-based entity matching systems. The authors evaluate STEM with two different threshold-based entity matching systems, where they show that STEM is able to increase both precision and recall of the original approaches and is weakly dependent on the amount of manually annotated examples. Finally, the authors present the 3cixty knowledge base, where STEM is used for carrying out the data reconciliation process.

The paper starts by introducing the work and giving a wide view of related work, which I found very informative. Afterwards, the authors introduce a formal representation of the entity matching problem and the trade-off between the precision and the recall. In the next section, the general STEM framework is introduced followed by introducing the two threshold-based entity matching classifiers that are used in the rest of the paper, i.e. the Linear Classifier and the Naive Bayes Classifier. For experimenting STEM, Section 5 give a detailed description of the experimental setup including the datasets, performance measures and other frameworks/approaches used to carry out the experiments. In Section 6, the authors present the results of the experiment where the aforementioned claims are justified. Finally, Section 7 presents the application of STEM in the data reconciliation process of the 3cixty knowledge base.

The paper is written in a good English and well structured. The intensive details provided by the authors and the illustrative examples make it easy to follow the paper.

Concerning the evaluation:
To ease the reproducibility, I expected to see a list of similarity measures/distances or link specifications used in configuring Silk and Duke
I like the idea presented in Section 6.2 of comparing STEM vs supervised learning on similarity values (i.e. RandomForest, Logistic and SVM). What still missing from my point of view is how STEM will perform when compared with a supervised link discovery approaches like the ones presented in SILK [1] and LIMES [2,3].

Given the fact that this paper is a research paper and not a dataset paper, I found the Section 7 concerning the 3cixty knowledge base too long. In order to keep the reader concentrate in the STEM approach, In my opinion, it would be beneficial for the paper to summarize Section 7 into 2 paragraphs, one for the dataset creation and the other is for describing how STEM is used in the process of dataset reconciliation.

[1] R. Isele, C. Bizer, Learning linkage rules using genetic programming, Proceedings of the 6th International Conference on Ontology Matching-Volume 814, pp. 13-24, 2011
[2] A.C.N. Ngomo, K. Lyko, Eagle: Efficient active learning of link specifications using genetic programming, Extended Semantic Web Conference, pp. 149-163, 2012
[3] WOMBAT - A Generalization Approach for Automatic Link Discovery by Mohamed Ahmed Sherif, Axel-Cyrille Ngonga Ngomo, and Jens Lehmann in 14th Extended Semantic Web Conference, Portoroz, Slovenia, 28th May - 1st June 2017

Other comments:
- Footnote 4 has a typo
- Equation 1: replace a by e_1 and b by e_2 to be consistent with the rest of the equations
- Section 4: define the amplitude a or add a reference to section 5 where it is defined
- Left justify equations 11, 12, 13, 14, 15 and 17. The authors could gain some space by reducing the space in the - equation environment in latex
- End of Section 4.2: “Duke asks to the user in an interactive” → remove “to”
- End of Section 5.1: “There are 4 types of heterogeneities that these data manifest” → “There are 4 types of heterogeneities between these data manifest”
- Section 6: “The graph clearly shows the trade-off between precision and recall of the algorithm ruled by the threshold t and that the trend for both curves is non-linear, with moderate changes in the central part and sudden variations at the sides.” → too long
- Section 7: “The output of such an investigation leads to a survey, which has is cross-validated by two domain experts who ...” → remove has
- Reference 27: missing title