Review Comment:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.
The general problem that is tackled in this paper concerns the data linking problem where the aim is to build a system that allows to detect whether two different descriptions refer to same real world entity. To do so, the authors attempt to address the well known problem of fixing the reconciliation threshold (i.e. a value in [0;1] according to which a system can decide if a pair of instances refer to the same real world entity) for numerical data linking approaches.
They, proposed an approach named STEM that uses a stacking principle to create an ensemble of base classifiers and then combine their results by means of a supervised learner. To consider different values for the threshold they used a fixed amplitude a that is in [0;1]. For the classification part they used two different classifiers: Duke that is based on Naive bayes classifier and Silk which is a linear classifier. As a supervised learner they used an SVM with a RBF kernel.
Several experiments have been conducted where the authors gave results: (i) on improvements of recall and precision results when STEM is used upon a classifier like Duke or Silk; (ii) how the low dependency of STEM on size of the training set and (iii) how effective may be STEM in the context of knowledge bases generation. In total three datasets have been used: FEIII2016 challenge, DOREMUS and 3cixty dataset.
The paper is well written and all the needed notions are explained. However, this paper needs improvements at different points:
1- the theoretical contribution should be clarified, enriched and/or better explained
2- clarify the assumptions concerning the considered data linking problem: the Open/Closed world assumption? The unique name assumption? The ontology mapping problem, is it considered as solved, if not, how do you deal with? how the possible set of property mappings are used? Is the existence of an ontology is needed?
3- does the approach consider object properties? Is there any decision propagation? this needs a comparison with existing approaches (like Dong et al. 2005, Sais et al. 2009, Al-Bakri et al. 2016, etc.). What’s about the inverse side of the properties?
4- the scalability of the approach is not discussed at all. When in the LOD we are managing datasets of millions of triples (Dbpedia, yago, …) the datasets that are used do not exceed a thousand of instances. What’s about runtime?
General comments:
(1) originality: the problem of finding the good threshold exists since the problem of data reconciliation (record linkage) has been formalised, i.e., since Fellegi and Sunter work. The idea of using stacking approach to deal with this problem seems to be original.
(2) significance of the results: in an experimental point of view, the authors showed on different datasets the effectiveness of their approach. However, in the theoretical side, the real contributions are not easily identifiable. To some extent, a part from using existing works and putting them together to achieve the data linking task, one can ask what are the scientific challenge that are dealt with in this paper. May be, by giving detailed algorithms may help to appreciate the difficulty of the problem that is tackled and the relevance of the proposed solution.
(3) quality of writing : I found the paper well written and easy to read. All the needed basic notions are introduced and well explained. The experiments are well commented and discussed.
Detailed comments:
Section 2: When the authors give a classification of the existing work “By taking into account the matching method, …. “ they should give references for each category of work. In the related work, the authors should clearly position their work in regard to existing ones.
Section 3 : from the formalisation given by Fellegi and Sunter, they also considered the “unmach” decisions? Why you do not consider them as a problem? Since you are using a kind of blocking, then you are interested in unmach decisions …?
Section 4 : The six items explain how the STEM works. Figure 2, depicts the different steps. However, only the three first are depicted and the blocking is not listed in the six steps? The image should be clarified.
Subsection 4.2 :
The probabilistic model is similar than the one used by Fellegi and Sunter, if not, a theoretical comparison is needed.
I think there is an error when you say “Finally, assuming that, a priori, P(Match) = P(No Match)”. It should be “ P(Match) = 1- P(No Match)”.
sub-section 5.1: the size of the datasets should be given.
sub-section 5.2: the value of pi, ri and fi are they corresponding to the best value obtained for the best threshold? or an average? this should be clarified.
Section 6: the experiments should be enriched by:
1- compare Duke and STEM-NB on DOREMUS
2- whey in FFIEC the dataset LIE is not used?
Example of features (N) should be given for each dataset.
Miner remark:
The title of the reference 27 is not given.
|