Review Comment:
The paper proposes an Estimated Quality metric "eq" for the automated prediction of the quality of an Identity Link Network (ILN) connecting 3 or more entities with identity links generated by some Entity Resolution (ER) system. Four variants of the metric are proposed: "eq" for non-weighted ILNs and "eq_min", "eq_avg", "eq_w" for weighted ILNs. All variants take values on a [0, 1] scale, discretizable to good (all links correct) / bad (some link wrong) / undecided ILN labels. The variants are all defined in terms of a weighted average of three network-based metrics (bridge, diameter, closure metrics), which assess, in different & complementary ways, how much an ILN is similar to a fully connected network. Three empirical evaluations of the metric in its four variants are conducted on the ILNs obtained by: (i) linking research institutions in 6 datasets via simple name matching methods; (ii) combining proximity and name matching methods in 3 of the 6 datasets; and (iii) reusing data from [16]. Evaluations (i) and (ii) compare the good/bad output of "eq" variants against a majority class baseline, based on ground truth human assessment of ILNs' correctness. Evaluation (iii) compares the F1 scores for 6 entity resolutions systems computed manually (in [16]) and automatically via the good/bad outputs of "eq".
The paper extends an EKAW 2018 publication [1] by the same authors. By dealing with ER, a relevant Semantic Web topic, and investigating novel techniques for its evaluation, the paper falls in the scope of the Journal and meets the criteria required for a full paper submission. As a full paper, this review will focus on the dimensions of originality, significance of results, and quality of writing.
== Originality ==
Up to section 8, the paper is basically the same as [1], with only few very minor additions: clustering Algorithm 1 (not much useful, actually), confusion matrices of tables 5 and 6 (useful), plot of F1 deviations in figure 7 (useful). The novel contributions mainly reside in sections 9 and 10, where the metric extensions "eq_min", "eq_avg", "eq_w" for weighted ILNs are proposed and then evaluated in the same settings of metric "eq" considered in [1]. While "eq_min" and "eq_avg" are trivial extensions, "eq_w" is more interesting, although their definitions appear in some cases rather arbitrary, handle weights in a debatable way (see comments later), and their evaluation results are inconclusive, not showing appreciable benefits in using weighted metrics in place of the simpler "eq".
As a result, I believe that the submission in its current state does not appreciably advance the state of the art w.r.t. what previously done in [1], and I'm not sure the novel contributions here qualify as sufficient extension for acceptance as full paper in this Journal. However, I believe these shortcomings can be addressed in a revision of the paper, at least through further analysis of proposed metrics and/or by providing further details and discussion of aspects previously not covered (see review comments), which would shed further light on the behavior of the metric originally proposed in [1].
== Significance of results ==
In the reported experiments, the proposed metrics (at least, eq and eq_w) correlate well with judgments by humans, demonstrating the potential for applying them to quickly assess the quality of ER identity links. However, these are largely results already shown in [1], and overall I have the following major concerns regarding the design and evaluation of the metrics that negatively affect the significance of the presented work (C1, C2, C5 applying also to [1] and not addressed here; C3, C4 specific to the novel contributions of this paper):
C1. Unclear hyper-parameter estimation and generalization. The defined metrics depend on few hyper-parameters (bridge metric parameter 1.6, thresholds 0.75 and 0.90, all introduced in section 4) that the authors claim to have empirically determined without providing further details (both here and in [1]). Now these parameters appear to be crucial for the accuracy of the metrics (in particular the thresholds), so I would like the authors to detail their estimation in this paper, also to make clear that they were not estimated on the same datasets used in the evaluation (which would amount to overfitting). Also, testing the impact of hyper-parameters on metric performance and their generalization to multiple datasets (only two datasets used in the paper) is something that I would like to have covered in this paper and not as part of future work (see section 11.2), as this aspect is strictly tied to the robustness and practical usability of the proposed metrics.
C2. Low metrics performance on "imbalanced" data. The evaluation results show that the metrics perform poorly when applied to ILNs produced by ER systems tuned for precision (e.g., via higher similarity thresholds), whose links are fewer but more accurate, up to the point to be outperformed by a majority class approach. See, specifically: (i) the expert evaluation for sizes 3, 4, 5 in table 2; (ii) the negative predicted value of 0.238 (precision of negative class) in table 6, which suggests that any "bad" label coming from the metric in this setting is most likely wrong; (iii) the geo+names evaluation in table 7; (iv) the increasing ranking errors for larger thresholds in figure 7. I understand these are "unbalanced" settings, but in my opinion this kind of setting is also desired and frequent, as it will occur any time the metric is applied to high quality identity links, for which the poor accuracy of the metric will limit its practical utility (i.e., the better the links, the less useful the metrics). This "imbalanced" setting is precisely the setting where using weight information may likely help, as it provides the metrics with the indication that the links forming an ILN are more accurate, and this increased accuracy may balance the negative evidence coming from "bad" network metrics due to missing links (which are likely to increase in number in a setting tuned for precision).
C3. Inconclusive evaluation of weighted metrics. Weighted metrics are the novel contribution of this paper, but their evaluation (section 10) fails at showing some concrete benefit in using them w.r.t. the eq metric. The authors suggest that the analysis in table 8 "shyly helps breaking the tie between the two metrics" (eq vs weighted eq variants). This analysis is based on computing the average of the 4 differences between F1 scores coming from eq metrics and corresponding F1 scores coming from human judgments, over 4 threshold settings. Here there are two problems that make the analysis inaccurate: (i) the authors compute the average between **signed** differences, so that a large positive error may cancel out a correspondingly large negative error, whereas it would have been appropriate to consider unsigned differences; and (ii) few significant figures are considered, so that a difference between 0.00325 and 0.0035 is expanded to the difference between 0.003 and 0.004 after rounding (table 8, eq_w vs eq for MCENTER).
C4. Reliance on non-normalized, non-comparable weights. The weights used to derive eq_min, eq_avg, eq_w (see section 9) are not normalized, so both these weights and the obtained weighted metrics are not comparable among different usage scenarios. Concrete example: let's assume to have an edge e1 in the "geo" setting and an edge e2 in the "geo+names" setting that both get the same weight w1 = w2 = 1. In the "geo" setting, that weight w1 = 1 reflects evidence coming only from the comparison of geographical information, whereas in the "geo+names" setting that weight w2 = 1 would be backed by stronger evidence also including perfect name similarity, evidence that is not reflected in the weight (w.r.t. the "geo" case). Unless there is some hyper-parameter tuning (not the case), the metric "eq" has no way to treat "geo+names" weights differently from "geo" weights, and the good/bad labels resulting from its application would likely tend to treat the "geo+names" ILNs as having lower quality than corresponding ILNs in the "geo" setting. This is a major issue that affects what is the novel contribution of the paper, and that might have led to the inconclusive evaluation results of section 10. I strongly suggest authors to address this issue. For instance, they may try to normalize weights so that they assume a precise meaning, e.g., the one of "calibrated probabilities" of link correctness. This can be achieved based on some ground truth good/bad link annotations, using, e.g., the Platt method or a similar one (see, e.g., https://scikit-learn.org/stable/modules/calibration.html for concrete solutions).
C5. Use of disagreeing expert vs non-expert ground truth data. I don't understand the utility of evaluating the approach with both "low" and "high" quality ground truths, respectively from non-expert and expert annotators (see sections 6.3, 8). I see two explanations for the expert vs non-expert differences here: (i) the annotation task is inherently difficult (e.g., one entity does not corresponds exactly to another - e.g., it represents a branch of a bigger organization - so annotators may disagree), a case that deserves further investigation with assessment of inter-annotator agreement based on precise annotation guidelines; or (ii) reliable human annotation is feasible and differences are to be ascribed only to errors by the non-expert annotator, which means that evaluation numbers reported for the non-expert case are of little value, and a merged ground truth dataset (expert annotation + checked/revised non-expert annotations) should be better used.
Of the above major comments, I think all of them except C2 can be (at least partially) addressed in a relatively short time by authors, and that's the main reason backing my major revision recommendation. Besides, I'm also reporting later some minor comments related to specific passages of the paper, which the authors may find useful and/or possibly consider to improve the paper.
== Quality of writing ==
The paper is overall well written, with the intuitions behind the proposed metrics nicely presented. The authors decided to first introduce and evaluate the base "eq" metric for unweighted graphs (basically, the contribution of the EKAW paper) and later introduce and evaluate its weighted extensions (the main new contribution), and I find the resulting paper structure acceptable. There are some typos, the definitions of the weighted metrics can be improved as well as few figures, but the required changes are limited: I list all of these issues later, for authors' convenience.
== Minor comments ==
M1. [section 1] I suggest clearly listing in the text the additional contributions w.r.t. prior work [1] by the same authors.
M2. [section 2] I agree with the authors that the aim is not clustering (nor proposing another method for entity resolution) and honestly I don't feel the need for paragraph "Simple Clustering Algorithm" that reports the very obvious clustering algorithm used by authors to detect ILNs in a weighted graph. If the authors decide to keep it, then: (i) please check the addition of multiple strengths to the same edge (~ lines 37-38), as I don't see it as necessary and, if it is, then it means metrics "eq" can work with multigraphs and that should be emphasized and motivated in the paper; and (ii) please note that worst case complexity is not O(m) but likely O(m log(m)) as merging clusters in line "C_b.add(C_s.items())" is not O(1) - that said, I don't think complexity analysis of Algorithm 1 is of much interest, and, if complexity is considered, then also the complexity of the evaluation of the "eq" metrics themselves should be discussed.
M3. [section 3] References [14, 15] refer to Coreference Resolution and Entity Linking (EL), two NLP tasks dealing with entity mentions in text. It's OK to mention them but in that case I would explicitly name the tasks they are addressing. Also, these tasks can be seen as building identity graphs whose nodes are entity mentions and (for EL only) KB entities, so I don't see them as incompatible scenarios where to use the proposed metrics, although I can understand if those scenarios are out-of-scope of this work.
M4. [section 4] Based on how it is defined, closure metric n_c is strictly < 1, whereas bridge metric and diameter metric can reach 1. As a consequence, eq metric may never reach 0. This is not a problem, of course, but the authors might want to consider slightly revising the definition so to guarantee that eq can entirely cover the whole [0, 1] range.
M5. [figure 3, caption] I would move the statement "to evaluate eq, all possible links are evaluated" in section 4 to make it more apparent to readers, as this is a requirement for the proper application of the metric (due to how bridge, diameter and closure metrics are normalized).
M6. [section 6.1] While authors assume here that datasets may contain duplicates, please note that in multi-dataset ER it is often assumed (and leveraged) the opposite (see, e.g., [A]). I don't see problems in applying "eq" in those settings, however, as the knowledge a dataset is duplicate-free, or that generalizing two entities are distinct, can be used before applying eq to immediately mark as "bad" an ILN containing a link between those necessarily distinct entities.
M7. [section 6.2, Table 2] How are defined "Positive" and "Negative" ground truth samples? I infer "positive = all ILN links are correct", while "negative = some ILN link is wrong", but I suggest the paper to explicitly specify that.
M8. [section 6.2] Why ILNs of size < 5 were not considered in the non-expert evaluation? I understand there are a lot of them and considering all these ILNs is infeasible, but if I have to sample ILNs, I would try to get a representative sample for each size to better investigate dependencies on size and avoid possible biases. Besides, in the expert evaluation (section 3), all sizes >= 3 where used.
M9. [section 7.3] I would find interesting to see the distribution of ILNs by size also in this case, similarly to what reported in Figure 4. In particular, I'm curious on whether the non-considered ILNs are mainly of size 2 (and thus out-of-scope of the proposed metric) or there is a relevant number of ILNs of size > 3, a sample of which could have been assessed to avoid possible evaluation biases (see comment C7).
M10. [section 8, figures 6, 7] I like the outcomes of this evaluation, as it shows that metric "eq" can be used to get an approximate indication of the performance (F1) of an ER system - at least when the system is not tuned for maximum precision (i.e., high threshold). What I find confusing is talking of "ranking test", "ranking algorithms" and "ranking error". To me, "ranking" here would mean to establish an order of algorithms, from best to worst performing ones (for a certain threshold). Based on that, a "ranking error" occurs if the ranking induced by applying metric "eq" is different from the ranking computed using human annotations, and to quantify that error I would use for instance some rank correlation measure, e.g., Kendall's Tau. Instead, what seems to be evaluated as "ranking deviation" in Figure 7 is the difference in F1 scores computed via "eq" and via human annotations. Small F1 differences are good, but they don't imply a similar algorithm ranking would be obtained by using "eq" (as claimed in section 11.1). Also, the text talks of the "potential to rank clustering algorithms whenever they show "\emph{significant performance differences}". Should I take "significant" as "statistically significant"? In that case, I don't see how the claim is supported by Figures 6 and 7. If that is the intended meaning, perhaps the authors may check for statistically significant differences in F1 (e.g., using approximate randomization test [B]) when using both human assessment and "eq", and check if the same differences "algorithm X significantly different than algorithm Y" are detected.
M11. [section 10] I suggest providing some quantitative measures of the differences in performance of different weighted metrics.
M12. [section 11.2] I don't see how to apply the intuitions behind the eq metrics to networks of size 2. These networks consist of exactly one edge, for which there is little to compute in terms of network metrics.
== List of typos and other presentation issues ==
T1. [section 1] "the proposed metrics indeed reliably estimates" -> either "metric" or "estimate"
T2. [section 1] "our contributions is a method" -> "contribution"
T3. [section 2] "Fig. ??" -> "Fig. 1".
T4. [section 2] "they belongs to different clusters" -> "belong"
T5. [section 4] "For example, n_c and n'_c treat a Tree, Star..." -> "and n_b"
T6. [section 5] "OpenAire: 2018.08.16" -> I suspect the date is wrong, as it is in the future w.r.t. Jan 2018
T7. [section 6] check special characters (tm, (c)) in footnote 13.
T8. [section 6.2] "Negative Predicted Value (NPC)" -> "NPV"
T9. [section 6.3] "to our results. \footnote{...}" -> drop space between "." and "\footnote{...}"
T9. [section 7.2] "c1 = {{a_1}, {b_3}}..." -> why nested sets? This suggests that something more than a subset of nodes within a graph is needed to identify an ILN, and I think the distinction between datasets is apparent also using a regular "flat" set.
T10. [figure 5] is there a meaning in the line dashing used for different edges?
T11. [table 5] "IDLINEs" -> "ILNs"
T12. [section 8] "between the baseline and the four eq metrics" -> up to this point in the paper, there is only one eq metric
T13. [section 8] "and display it in Figure 7" -> "displayed"
T14. [section 8] "Figure 7 shows a deviation of +-0.97" -> looking at table 8, I think the correct number here is 0.096
T15. [section 9] "e_i = (v_{i-1}, v_i} \in L where v_i \in V for in \in [1,k]" -> what is k? why not just saying that the two vertices are \in V? what is v_{i-1}? (I later understand it is an arbitrary vertex in a path, but here it is unclear)
T16. [section 9] definition 2 of "dist(a,b)" uses a strange notation; I would write "dist(a,b) = min_{\pi \in \Pi(a,b)} |\pi|", similarly to how "dist_w(a,b)" has been defined
T17. [section 9] I would revise also the notation in definitions 3, 4 of "diam(G)" and "diam_w(G)", e.g., "diam(G) = max_{a,b, \in V} dist(a,b)"
T18. [section 9.2] in definition 6 of "eq_avg", replace "we" with "w"
T19. [section 9.3] "\frac{2.2}{2} = 1" -> I get the point that the branch for "eDiam(G) > n - 2" is applied, but written like this looks weird
T20. [figure 8] the use of overlapping boxes makes the figure very difficult to read
T21. [section 11.1] "it estimates the quality of links" -> "of ILNs"
T22. [table 8] whats the meaning of bold vs. underlined average values in the table?
== References ==
[A] M. Nentwig, E. Rahm. Incremental Clustering on Linked Data. ICDM Workshops. 2018.
[B] E. W. Noreen. Computer Intensive Methods for Testing Hypothesis. John Wiley & Sons. 1989.
|