Network Metrics for Assessing the Quality of Entity Resolution between Multiple Datasets

Tracking #: 2333-3546

Al Idrissou
Frank van Harmelen
Peter van den Besselaar

Responsible editor: 
Guest Editors EKAW 2018

Submission type: 
Full Paper
Matching entities between datasets is a crucial step for combining multiple datasets on the semantic web. A rich literature exists on different approaches to this entity resolution problem. However, much less work has been done on how to assess the quality of such entity links once they have been generated. Evaluation methods for link quality are typically limited to either comparison with a ground truth dataset (which is often not available), manual work (which is cumbersome and prone to error), or crowd sourcing (which is not always feasible, especially if expert knowledge is required). Furthermore, the problem of link evaluation is greatly exacerbated for links between more than two datasets, because the number of possible links grows rapidly with the number of datasets. In this paper, we propose a method to estimate the quality of entity links between multiple datasets. We exploit the fact that the links between entities from multiple datasets form a network, and we show how simple metrics on this network can reliably predict their quality. We verify our results in a large experimental study using six datasets from the domain of science, technology and innovation studies, for which we created a gold standard. This gold standard, available online, is an additional contribution of this paper. In addition, we evaluate our metric on a recently published gold standard to confirm our findings.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Dmitry Ustalov submitted on 28/Nov/2019
Minor Revision
Review Comment:

The authors performed a substantial amount of work to address the reviewers' comments, including mine. I am still not convinced that parametric metrics allow a fair evaluation of link networks because these parameters could be overfitted to a dataset. However, the article puts an interesting discussion on network metrics and contributes datasets which are useful for future work. I recommend accepting this paper after careful proof-reading.

Page iii, Algorithm 1. The algorithm is not referenced anywhere in the article but the Future Work section. Please mention the sections in which it is used.

Page iii, Algorithm 1. Why are the strength values needed and how are they provided?

Page iii, Algorithm 1. Please define N and C before they are used.

Page iii, Algorithm 1. C_1 and C_2 terms are not initialized. How can they be accessed in lines 19, 20, 22?

Page iii, Algorithm 1. A small illustration would simplify the explanation of the algorithm.

Page iv, line 10; Page xiv, line 46. Why is sigmoid defined as \frac{x}{|x| + \nu} with a parameter \nu? Why did the authors deviate from the commonly used logistic function for sigmoid?

Page x, line 17. I would mention the Majority Class Classifier before the e_Q metric in the caption because the header has it in a different order: MCC | e_Q.

I recommend the authors to pay special attention to the mathematical notation. It must be consistent, clear, and non-ambiguous. Otherwise, it would be difficult for other researchers to apply the methodology proposed in this article. For instance, every used term must be introduced beforehand.

Review #2
Anonymous submitted on 04/Dec/2019
Review Comment:

The Reviewer thanks the authors for the effort they provide in preparing the revision of the manuscript.
The Reviewer revised the new version by verifying the changes also in light of the comments provided by the other Reviewers.
This Reviewer is satisfied by this new version and he suggests to accept the paper for the publication on the Semantic Web Journal.

Review #3
By Francesco Corcoglioniti submitted on 09/Dec/2019
Minor Revision
Review Comment:

I would like to thank the authors for having answered and partially addressed all the comments in my previous review. Overall, I think the paper has improved and my recommendation is minor revision to give the chance to authors to integrate some further suggestions reported next, to their discretion.

== Major comments ==

All my previous major comments C1-C5 were addressed by authors, both in the paper and in their answer.

I'm fine with the revisions/answers for C1, C2, C4, C5:

* regarding the choice of hyper-parameters (C1), I appreciate the clarifications and the additional experiment changing \eta. The fact different values of \eta (and especially \eta = 0.1 that mostly picks the sigmoid value) lead to similar F_1 scores raises however some questions about the contributions and definitions of the bridge, distance and closure metrics supporting the eq metric (e.g., the effectiveness of the normalizations used for n_b and n_d). More generally, it would be interesting -- as future work -- to carry out an experiment similar to the one varying \eta, where different variants of eq (e.g., leaving out one or more of the bridge, distance and closure metrics, or always using the sigmoid, etc) are tested and compared -- a sort of "feature analysis" experiment.

* regarding the performance on imbalanced data (C2), I argue that in many practical applications the majority class should be known and correspond to the positive class, otherwise it means we are carrying out an automated evaluation (via metric eq) of some entity resolution system whose results are mostly wrong and thus of limited practical use. Let's say that in these imbalanced cases, metric eq can confirm what the majority class is, and in case of large eq values we know that there is the risk that eq might be underestimating the system performance.

* regarding the non-normalized weights (C4), the normalization of edge weights using some (small) ground truth data was an example of a possible solution to the problem of edge weights lacking a precise semantics. I understand the authors' decision of avoiding any ground truth data and thus a solution like the exemplified one, and I'm fine with that. I would like only to point out that the problem of unclear edge weights remains. Apart interpreting the meaning of a particular weight value (e.g., 1) in different datasets, there is also the problem that different datasets may present different weight ranges (e.g., [0.8, 1] in section 7, [0.1, 1] in section 8). There might be other forms of normalization worth considering that use only unlabeled edges/ILNs (e.g., to determine the min-max weight range and re-scale accordingly, or to use the unweighted eq metric as an "oracle" to roughly estimate edge correctness probabilities). Coming up with a weighted eq metric that is effective for any given weight configuration, without any normalization or tuning of metric hyper-paramters, seems a very difficult task to me. That said, I'm fine with the authors' response.

* regarding the expert vs. non-expert evaluation (C5), I appreciate the clarifications and I understand that collecting ground truth data was expensive. Based on that, I see the expert annotations as a kind of validation of the main ground truth collected by the non-expert annotator, and I wouldn't say (as I might have hinted before) that the evaluation numbers for the non-expert case are of "little value", although there might be some noise in the performance figures reported in section 7. I suggest the authors to consider whether to merge the two ground truth datasets in a single ground truth dataset (keeping expert data in case of disagreement), so to get rid of the double expert vs. non-expert evaluation in section 7 and later in section 10. There are two reasons for that: (i) expert data is scarce (1 to ~a dozen samples for many ILN sizes) so resulting performance figures (see table 2) and plots (figure 8b) are not much informative; and (ii) the double evaluation makes harder for readers to assess the merits of the proposed metrics, also because for the ILN sizes where both expert and non-expert data is available, the reported performance figures are not directly comparable (different samples; too few samples for the expert case). In case of merge, it is still useful to comment on the expert - non-expert disagreement as an indication of the quality of ground truth data. In case authors keep both evaluations I would be fine with that.

Instead, my comment C3 on the evaluation of weighted metrics being inconclusive remains:

* in the paper, the only reason why weighted metric eq_w is described as "the way to go" w.r.t. unweighted metric eq is that (quoting) "the eq_w metric (in 5 cases out of 7), followed by eq (in 3 cases out of 7), appear to deviate from the baseline far less on average than the remaining approaches" (see section 10). However, the reported counts are incorrect, as the highlights in bold in table 8, as they should be 3 cases out of 7 for each of eq, eq_w, and also the weighted metric eq_avg. In other words, based on this authors' criterion there is no "best" metric among eq, eq_w and eq_avg.

* in the authors' response, the statement that "using the average scores, the weighted metrics perform equal to or better than the unweighted e_Q in all cases" is also incorrect as for the Clip baseline, metric eq performs best. I may concede, however, that if we avoid distinguishing between the three weighted metrics eq_w / eq_avg / eq_min, then the unweighted eq performs best in 3 cases and the weighted eq_w / eq_avg / eq_min perform best in 5 cases, although knowing that in most cases (5 vs 3) there is "some" weighted metric among eq_w / eq_avg / eq_min that outperforms eq, without knowing which one, does not help much in practice.

* in any case, the criterion based on column "average" of table 8 is debatable as based on comparing small differences without accounting for statistical significance. I would rather compare each pair of metrics using a statistical significance test (e.g., Approximate Randomization, McNemar) to check for a statistical significant difference of good/bad classification performance between the two. This is just a suggestion that authors may disregard.

* finally, concerning the authors' response, please note that my original criticism of inconclusiveness was based on a wrong computation of average differences in table 8 (while now it is based on corrected figures in table 8); also, my original comment on "few significant figures" in reported numbers in table 8 is justified by the substantial differences between previous and new average values in that table resulting from implementing my suggestion.

I will not insist further on C3, also in the light of the other improvements of the paper. I only recommend authors to fix the incorrect counts and unsupported claim in the text of section 10, as well as the highlights in table 8.

== Minor comments ==

All my previous minor comments M1-M12 were addressed by authors, either in the paper (M1-M2, M5-M7, M9-M12) or by providing clarifications in their answer (M3-M4, M8). I'm fine with the authors' revisions/clarifications. Just a few notes:

* regarding the "simple clustering algorithm" paragraph (M2), the rephrasing of that paragraph (esp. adding "in order to generate ILNs") makes now clearer to me its purpose and utility and addresses my concerns, so I now think it's fine keeping it. I also appreciate the addition of paragraph "Complexity" in section 4 and the insights it provides, especially regarding the diameter metric (I don't see problems with the reported complexity, as I expect n and m to be small and I agree with the workaround of adding upper bounds to their values)

* regarding the distribution of ILNs by size (M9), I think the plots justify the focus on ILNs of size 3 (esp. for the geo+names scenario). As a minor suggestion, I think that including also the bin size "2" in figures 5, 7, 8 may help getting a more complete picture, even if ILNs of size 2 cannot be evaluated by the proposed eq metrics.

* regarding ILNs of size 2 (M12), I agree that external knowledge may help addressing these cases (e.g., via reasoning on explicit/inferable owl:sameAs, owl:disjointWith assertions), but I would argue that such knowledge is complementary to the eq metric, rather than being embeddable in the eq metric itself, since it's unclear to me how such knowledge could affect eq scores apart setting them to 0 or 1.

Reading again the revised paper, I would like to report two additional minor remarks that I missed in my first pass (just for authors' consideration, no response expected):

M13. [section 5.3] "the weaker the strength of a bridge gets, the less it negatively affects the quality of an identity network" -> this statement, which is coherent with equation 7, appears unintuitive (to me). Let's take an ILN G with a single bridge consisting of edge e. According to equation 7, n_b_w(G) increases with weight w(e), and thus eq_w(G) correlates negatively with w(e). However, my intuition is that eq_w(G) should rather correlate positively with w(e), since that edge with its weight is the only evidence available for considering identical the nodes in the two components of G connected by the bridge.

M14. [section 5.3] I wonder if subtracting 1 to eDiam(G) in equation 8 is actually intended or it is rather a typo, since that subtraction is already included in the definition of eDiam(G) and, by repeating it in equation 8, the argument (and result) of the sigmoid may become negative.

== List of typos and other presentation issues ==

All the previously reported issues T1-T22 were fixed. For authors' convenience, here I list some additional typos/suggestions regarding the revised manuscript:

T23. [section 1] "All data of these experiments are available online)" -> remove ")"
T24. [section 2] "the best of all computed strengths" -> "largest"?
T25. [section 2] "and therefore minimizing to algorithm time complexity (O(m) where m is ...)" -> "therefore minimizing the algorithm time complexity (O(m) in the best case, where m is ...)"
T26. [equations 3 and 4] "a, b, \in V" -> remove "," after "b"
T27. [section 6] "about 2700 HE institutions" -> I guess acronym "HE" stands for "higher education", but it is not explicitly defined in the text
T28. [section 7] "with as goal to investigate the coverage" -> "with the goal"
T29. [section 7.2] "We predict a GOOD or BAD score ... resulting in F_1 scores" -> suggest specifying which class the F_1 scores refer to (e.g., "resulting in F_1 scores for the GOOD class")
T30. [section 7.3] "the evaluation by non-experts is not bias" -> "biased"
T31. [section 8.1] "hopping for the geolocation to correct obvious noise" -> "hoping"?
T32. [section 8.2] "c3 = {}" -> ""
T33. [figures 7 and 8] I suggest keeping the two figures close in the paper (e.g., via subfig) to ease comparison. Labels "before" and "after" are not immediately clear and might be omitted (the two settings "geo-only" and "geo+names" are already clearly indicated in figures titles)
T34. [section 8.3] "is of course privy to.." -> remove one "."
T35. [section 9.1] "Now, the penalty for having one bridge is fix ..." -> "fixed"?
T36. [table 8] The best average F_1 difference for the "Clip baseline", highlighted in bold, is 0.03856 (metric "eq") and not 0.03881 (metric "eq_w")
T37. [section 11.2] "external knowledge can be used for ... could also be used to ..." -> "external knowledge that can be used for ..."