On the Efficient Execution of Bounded Jaro-Winkler Distances

Tracking #: 944-2155

Authors: 
kevin dressler
Axel-Cyrille Ngonga Ngomo

Responsible editor: 
Guest Editors Ontology and Linked Data Matching

Submission type: 
Full Paper
Abstract: 
Over the last years, time-efficient approaches for the discovery of links between knowledge bases have been regarded as a key requirement towards implementing the idea of a Data Web. Thus, efficient and effective measures for comparing the labels of resources are central to facilitate the discovery of links between datasets on the Web of Data as well as their integration and fusion. We present a novel time-efficient implementation of filters that allow for the efficient execution of bounded Jaro-Winkler measures. We evaluate our approach on several datasets derived from DBpedia 3.9 and LinkedGeoData and containing up to $10^6$ strings and show that it scales linearly with the size of the data for large thresholds. Moreover, we also show that our approach can be easily implemented in parallel. We also evaluate our approach against SILK and show that we outperform it even on small datasets.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 09/Feb/2015
Suggestion:
Major Revision
Review Comment:

This paper focuses on the efficient execution of bounded Jaro-Winkler measures, especially for data collections with large sizes. It is well written and contains interesting results for the particular topic. The only exceptions are Section 3.3.2 that needs revisiting and a bit more information for the evaluation experiments (see the following detailed comments).

They authors state that this submission is an extension of [11], which was presented at an ISWC workshop. For completion, please explain, evenbriefly, the additions included in this submission in comparison to the workshop paper.

Section 2.2:
o Please explain t and provide an example.
o The use of t as a symbol as well as a letter from a word confuses the reader. it would be good if you can use something else, for example use quotes for the letters and italics for the symbol.
o Why is the "boost threshold" set to 0.7? How was this chosen? Maybe moving this information to the experiments since it does not really belong to the preliminaries.

Section 3.1:
o What is the relation between m and t?
o After eq. 5 the authors write that "we do not have to actually compute ? - please explain this, e.g., give the intuition for this.

Section 3.2:
o "Note that for practical usage of this bound the threshold must be chosen greater than ..."? - please explain this.

Section 3.3.2: This part of the paper is not clearly written and must be revised for being able to fully understand it. For instance:
o Please provide an example of a tie using "hello"
o What is the relation between ?(t_x) and B_?(s_x)?
o What does B_u denotes?

Section 4:
o The authors need to explain the algorithms used in the plot figures, i.e., "naive", "range(.)", "length(.)", ..., "r+l+f2", etc. This information can be included before, or directly after, Section 4.1.
o As far as I know, dbPedia does not really contain all required matches and in some cases the included matches are incorrect. Did the authors such behavior, and if so, how does this affect the evaluation?

Spelling, etc.
- Section 3.2: a "." is missing before sentence starting with "Using Equation 6 and Equation 7...."
- the sentence directly before section 3.3.1: "ist"
- first sentence of Section 3.3.1 needs to be rewritten
- last line of page 6: "they are been widely used ..."

Review #2
Anonymous submitted on 16/Feb/2015
Suggestion:
Major Revision
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

This paper addresses the problem of sameAs link discovery. In particular the problem of efficiency improvement of similarity measures. To this end, the authors present a new implementation of Jaro-Winkler distance that is known to work well on the comparison of person names. To optimize the execution of this distance, a series of filters are presented in the paper. Some are based on the length of the strings and others are based on the character frequency. These filters are used to avoid useless similarity computations. Furthermore, the authors have also proposed several extensions of the Jaro-Winkler distance by introducing the notion of upper bound. This last is used to estimate the maximum similarity of to strings.

The approach presented in this paper seems to be sound and original in the link discovery research field. The paper is in general well written. However, to allow the reproducibility of the results, I would suggest some clarifications and additional explanations.

- In section 2.1, in the definition of M’ in the beginning of the formula the authors used gamma(s, t) and in definition they used the function sigma(s, t) that is not defined.
- In the beginning of section 3 the function theta_e(s, t) is not defined.
- In the explanation of the Jaro-Winkler extensions, some equation derivations are not obvious to understand, especially: the derivation of the equations 10 and 11 from equation 9.
- In section 3.3.2, the authors introduce the tree named ‘tau’ without explaining how this tree is built and what are the nodes and the edges, the root, …. ?

My concern with this paper resides in the fact that the experiments do not provide qualitative results of the approach. How generic is this approach ? As we know (Cohen et al 2003), it has been proven that Jaro-Winkler distance has obtained good results on the comparison of person names. It is however difficult to generalize this results to the all properties describing persons (date of birth, addresses, …) and it is even more difficult to generalize it to other datasets. As the quality of data linking results mainly depends on the choice of the similarity measures (Levenstein, Jaccard, …) that are used to compare the descriptions. It would be good if the authors to extend the paper:
- by giving more experiments on the quality (recall, precision and F-measure) of the obtained links on more than one dataset that do not concern person descriptions.
- by discussing, how the proposed approach can be applied to other similarity measures, and how the proposed approach can be used by a data linking tool where the similarity measures can be differently chosen for the properties.

The related work, that I found well written, can be improved by more semantic approaches of blocking, i.e. the ones that use ontology knowledge like disjunction axioms between classes to filter out the linking space Sais et al, 2009).