Review Comment:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.
This paper addresses the problem of sameAs link discovery. In particular the problem of efficiency improvement of similarity measures. To this end, the authors present a new implementation of Jaro-Winkler distance that is known to work well on the comparison of person names. To optimize the execution of this distance, a series of filters are presented in the paper. Some are based on the length of the strings and others are based on the character frequency. These filters are used to avoid useless similarity computations. Furthermore, the authors have also proposed several extensions of the Jaro-Winkler distance by introducing the notion of upper bound. This last is used to estimate the maximum similarity of to strings.
The approach presented in this paper seems to be sound and original in the link discovery research field. The paper is in general well written. However, to allow the reproducibility of the results, I would suggest some clarifications and additional explanations.
- In section 2.1, in the definition of M’ in the beginning of the formula the authors used gamma(s, t) and in definition they used the function sigma(s, t) that is not defined.
- In the beginning of section 3 the function theta_e(s, t) is not defined.
- In the explanation of the Jaro-Winkler extensions, some equation derivations are not obvious to understand, especially: the derivation of the equations 10 and 11 from equation 9.
- In section 3.3.2, the authors introduce the tree named ‘tau’ without explaining how this tree is built and what are the nodes and the edges, the root, …. ?
My concern with this paper resides in the fact that the experiments do not provide qualitative results of the approach. How generic is this approach ? As we know (Cohen et al 2003), it has been proven that Jaro-Winkler distance has obtained good results on the comparison of person names. It is however difficult to generalize this results to the all properties describing persons (date of birth, addresses, …) and it is even more difficult to generalize it to other datasets. As the quality of data linking results mainly depends on the choice of the similarity measures (Levenstein, Jaccard, …) that are used to compare the descriptions. It would be good if the authors to extend the paper:
- by giving more experiments on the quality (recall, precision and F-measure) of the obtained links on more than one dataset that do not concern person descriptions.
- by discussing, how the proposed approach can be applied to other similarity measures, and how the proposed approach can be used by a data linking tool where the similarity measures can be differently chosen for the properties.
The related work, that I found well written, can be improved by more semantic approaches of blocking, i.e. the ones that use ontology knowledge like disjunction axioms between classes to filter out the linking space Sais et al, 2009).
|