Remixing Entity Linking Evaluation Datasets for Focused Benchmarking

Tracking #: 1583-2795

Jörg Waitelonis
Henrik Jürges
Harald Sack

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
In recent years, named entity linking (NEL) tools were primarily developed in terms of a general approach, whereas today numerous tools are focusing on specific domains such as e. g. the mapping of persons and organizations only, or the annotation of locations or events in microposts. However, the available benchmark datasets necessary for the evaluation of NEL tools do not reflect this focalizing trend. We have analyzed the evaluation process applied in the NEL benchmarking framework GERBIL [17] and all its benchmark datasets. Based on these insights we have extended the GERBIL framework to enable a more fine grained evaluation and in depth analysis of the available benchmark datasets with respect to different emphases.This paper presents the implementation of an adaptive filter for arbitrary entities and customized benchmark creation as well as the automated determination of typical NEL benchmark dataset properties, such as the extent of content-related ambiguity and diversity. These properties are integrated on different levels, which also enables to tailor customized new datasets out of the existing ones by remixing documents based on desired emphases. The implemented system as well as an adapted result visualization has been integrated in the publicly available GERBIL framework. In addition, a new system library to enrich provided NIF [3] datasets with statistical information including best practices for dataset remixing are presented.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Michelle Cheatham submitted on 06/Apr/2017
Review Comment:

The paper describes an extension to the NEL benchmarking framework GERBIL and an accompanying standalone library that allows researchers to configure new benchmarks, based on existing ones, that have particular features such as a focus on one type of entity, varying levels of ambiguity, prominence of the entities, etc. in such a way that the datasets can be reproduced by other researchers. The paper is based on a previous publication; a modest amount of new content has been added for this version.

While the originality of the work is fairly low, this is offset by its utility to the field. I believe that many NEL researchers would be interested in this work.

Section 1 clearly and convincingly lays out the motivation for the work. Section 2 describes the different metrics related to the benchmark datasets that the system is capable of computing. This section largely rehashes the existing work of others and could possibly be shortened. Regarding the annotation density metric, the authors do warn that this metric only estimates the number of missing annotations. While they present some support for this in Section 4 by showing the correlation between this metric and precision, it would be better to support this using human judgment on a sample dataset. Regarding the prominence metric, it is not clear to me why the authors choose cutoff points for high, medium, and low prominence rather than allowing the user to specify these. Section 3 describes the implementation. The discussion of the visualizations at the end of subsection 3.1 would greatly benefit from one or more figures. Section 4 presents a very interesting analysis that is likely to be of interest to researchers in this field. The discussion of Figure 4 in Section 4.1 seems to have a small mistake – it seems that there are five datasets (not six) that contain empty documents, and three of them (not four) show a significant number of empty documents. In the concluding section, I think the second paragraph, which describes prominence values of different datasets in great detail, can be omitted because the preceding paragraph convincingly argues that those do not impact performance.

The paper is on the whole very well organized and easy to read. The are a few minor issues:

The word ‘the’ is missing in many sentences.

The word ‘amount’ is frequently used when ‘number’ is meant (e.g. amount of a collective substance, like water, but number of a countable set of things, like surface forms)

In section 3.1 “to be be applied” => “to be applied”

In the comment in figure 13 “with more then three” => “with more than three”

Review #2
By Ziqi Zhang submitted on 20/Apr/2017
Major Revision
Review Comment:

This paper describes an extension of the GERBIL framework with an aim to evaluate the quality of entity linking datasets and ultimately to automatically create (by remixing) balanced datasets. This could be a valuable contribution to the community because on the one hand, entity linking is a very important task to the Semantic Web and the problems with existing datasets discussed in the paper are valid; on the other hand, creating high quality datasets will enable a balanced and thorough evaluation of newly developed methods. Unfortunately, the quality of the paper is rather unsatisfactory for acceptance for three reasons: 1) the research problem described in this paper has been largely addressed already in the authors’ earlier work [20] as there is very little added value in terms of their methods. The major development is a tool and for this reason, I do not think that the originality or the significance of results is good enough for a ‘full research’ paper; 2) the way the paper is currently written has too many issues, primarily in the definition of the measures, and mathematical formulations, which are confusing and difficult to follow; 3) while the authors argue that the proposed measures have been developed in the tool that can be used to create better quality datasets, there are no experiments to support this. To address this, significant amount of experiments should be undertaken to compare, for example, a number of state-of-the-art systems on both the sets of original datasets, and the sets of remixed datasets, to demonstrate the issues with original datasets and that the proposed measures do indeed address the dataset quality.
Addressing these issues will require significant amount of work, which in my view, might be very challenging. However, considering the importance of this topic and potential impact once these issues are addressed, I think the authors should be given another chance to consider a major revision. Detailed comments below.

1. Originality and significance of results
On page 2, with respect to the novelty of this work, the authors state that ‘… the work in [20] is brought up-to-date and consolidated. ... extended with new additional dataset measures, a standalone library …. as well as a vocabulary to enrich ….’. However this is not all clear how much contribution this work makes towards the *research* problem on top of the previous work. The library and the vocabulary are certainly interesting and useful, but as this paper is submitted as a full research paper, it needs to be evaluated against novelty in terms of the methodologies that address research problems, i.e., how to measure the quality of a dataset.
For this it is not clear what ‘new additional dataset measures’ are. And by comparing with [20] it appears that the novelty is rather limited, as the main measures (not annotated documents, density, prominence, confusion, dominance) have all been introduced before. Without experiments it is also not possible to evaluate how the three newly added measures contribute to address the research problem, and therefore, how ‘significant’ this novelty really is.
To improve this, the authors should clearly identify the improvement brought by this work in the introduction, and back that up with empirical evidence (see point 3)

2. Quality of writing
The paper has some major problems with its quality of writing. There is a large degree of inconsistent usage of terminology and mathematical notations. For example, I cannot understand how exactly you compute confusion and dominance. The authors should revise substantially their mathematical notations and have those double checked to ensure they are consistent and making sense.

First of all, at the beginning of section 2, define all the terms you will use in the following sections. What is a document and its notation? What is an entity and its notation? What is a surface of an entity (notation)? What are the surfaces for the entire dataset? What is a dictionary, is this the same for all different datasets? What are all the entities in the dictionary and all the surfaces in the dictionary and how to denote them mathematically?
Next, are your measures applied to dataset, document, dictionary, entity, or surface? You should have something similar to table 1 to clarify this.
I will now list confusing notations below on a page-by-page basis.

Page 3 section 2.1: \mathcal{D} is a dataset and t is a document. But then what is D in equation 1? (note that \mathcal{D} is different from D).
Page 3 section 2.2: how is len(t) calculated? You should define it here, not later in page 8 section 4
Page 4 left column bottom of the page: ‘the dictionary know to the dataset containing the document is \mathcal{D}’ ---→ but just before you said \mathcal{D} is a dataset, and now it is a dictionary?
Page 4 right column top: ‘the overall set of all possible surface form is \mathcal{V}_e’, by definition, \mathcal{V} denotes a set of *surface forms*. But just the paragraph before you said ‘the overall set of all possible *entities* for a surface form is \mathcal{V}_{sf}’, where \mathcal{V} are a set of *entities*
Page 4 – I also cannot understand the relations between the dictionary known to annotation, the dictionary known to the dataset containing the document (which document? Is dictionary different depending on document?), and the overall set of all possible entities for a surface form. Again, defining all these terms upfront and use examples can help clarify.
Page 4 equations 4 and 5 does not make sense. E.g., in eq. 4, you are summing the number of e, given s \in W. Firstly of all, W is never defined and I suspect you mean \mathcal{W}. Second, the condition has noting to do with e, so it does not make sense to have e conditioned on s \in W
Page 5 left column last paragraph: ‘… the amount of surface forms used for one specific entity in the dataset e(D)...’ this time e(D) is the # of surface forms for an entity but before on page 4 eq.1, you said e(D) is the number of not annotated documents. Also, maybe here you mean \mathcal{D} not D. But again because of the significant level of inconsistency I cannot be certain.
Page 5 just before eq. 6. “the average dominance for an entire dataset is computed over all entities e \in D”, OK but what is e \in E in eq 6 and 7? What is E?
Page 5 the example about ‘angelina_jolie’ does not make sense given eq. 6.
Page 6 first sentence ‘… for a dictionary W and an entire dataset ...recall is defined as … where max recall:’ is not a parsable sentence.
Page 12 figure 10: the right axis is missing.

3. Empirical results
The paper lacks empirical evaluation to back up their claims.
First, on page 3, section 2.3 you say ‘… a power law distribution of the pagerank values over all entities is assumed...’ I doubt this is the case. Did you check this using real datasets? Because if the distribution isnt so this measure would not make much sense.
Second, empirically how does your proposed measures help remixing new datasets that are better quality? You have mentioned from place to place the ‘correlation with precision and recall’ but this is not clear at all what P and R we are talking about: P and R of what system(s)? on what datasets? To be convincing, I think you need to run a number of state-of-the-art systems on the set of original datasets, evaluate their P, R, F1; then evaluate them again on a set of remixed datasets (remixed based on some rationale, which you need to define and justify) and compare the P, R, F1. If there are significant changes in their performance it could mean that 1) the methods are sensible to particular characteristics of some datasets. E.g., as you said, some datasets may be too easy; 2) by remixing you changed the nature of the dataset and that makes the dataset better quality (but you must justify why) and the task harder. In summary, you need to carefully design your experiments, identify datasets that appear to be imbalanced according to your proposed measures, remix these datasets, and run experiments to observe and analyse the difference (if any).
In your conclusion, you say ‘according to our evaluation, the best suited datasets for … are ...’. What do you mean by ‘suited’? Again, if you do the experiments suggested above, it is more convincing.

Review #3
By Heiko Paulheim submitted on 27/Aug/2017
Major Revision
Review Comment:

The authors introduce an extension to GERBIL, which allows for (a) selecting subsets of existing entity linking benchmark datasets using various criteria, and (b) creating new synthetic datasets of existing ones by combining subsets out of those.

While I acknowledge that there is substantial difference between entity linking benchmarks, it is puzzling that the results presented in this paper actually show that they seem not to have an overly strong impact on the measured performance of entity linking tools, which somewhat questions the original motivation of the paper (i.e., since NEL tools are specialized towards certain cases, we need specific benchmarks).

Starting from table 3, we can observe that KEA is the best tool (except for places), and FOX is the worst in all cases. Since I was curious, I made some more calculations based on the table, looking at the Pearsson correlation and the Spearman rank correlation of the tools' individual results with the unfiltered results (see attachment). From those numbers, it can actually be observed that PageRank and HITS subsets do not have any significant impact on the comparison of the tools' relative performance (despite the differences in absolute results), and also the subsets by types only have a limited impact - which is quite the opposite of what the authors state as their motivation.

Some more notes on table 3: it would be interesting to also see a breakdown for ambiguity and other metrics. Unspecified entities should be added as a line in table 3 as well. Using a breakdown by more metrics could also reveal some differences between the tools (i.e., what the authors are actually hunting for) -- e.g., some tools that work better for disambiguating more ambiguous entities and/or reveal less bias towards more prominent entities.

Furthermore, it is puzzling why the performance on all subsets is better than on the unfiltered results. Especially for the PageRank and HITS subsets, which form a non-overlapping partitioning of the original datasets, it is not clear why the results for all partitions are better than the unfiltered results. My naive expectation would be that the unfiltered results (which is the union of the three subsets) should yield results somewhere in the middle of the results achieved on the three subsets. The authors should explain why this is not the case.

For the breakdown by PageRank and HITS, the authors partition the dataset given the ranking of the entities in the dataset. This means that for a dataset with a stronger bias towards head entities, entities that are in the middle or low segment would be in the high segment for a dataset with a more even distribution. In my opinion, this limits the utility of the metric. I would prefer a global partitioning by those metrics -- i.e., determining the cutoffs for 10% and 55% globally for the target KB at hand, and using the same interval delimiters for all datasets.

For the remixed datasets, there are no quantitative results, so it is questionable what the impact of those remixed datasets would be. Following the observations above, having more remixed datasets in addition to the original benchmarks might as well reveal just a lot "more of the same" results.

Further remarks:
* In the introduction, the authors claim "an uprising trend for many systems to focus on the solutoin of rather specific tasks", but there is no evidence given for that claim.
* References should be provided for the datasets and tools used in the evaluation in section 4.

In summary, I like the idea of the paper and the quantitative comparison of the tools by different dataset characteristics. However, in the current state, the findings are somewhat limited and actually disprove to the authors' motivation.


This is the table I refered to as attachment. I am sorry that I am not allowed to provide it in a better formatting.

Babelfy DBpedia Spotl. Dexter FOX KEA TagMe 2 WAT AGDISTIS Correlation Rank Correlation
No Filter 0.53 0.56 0.39 0.33 0.32 0.63 0.59 0.58 0.52
Persons 0.81 0.69 0.53 0.57 0.44 0.84 0.77 0.8 0.74 0.928476908948633 0.735456429244596
Org. 0.71 0.83 0.65 0.75 0.55 0.88 0.79 0.8 0.77 0.809032389920648 0.855723970397881
Places 0.77 0.82 0.57 0.55 0.54 0.78 0.81 0.8 0.75 0.962609017713317 0.782216917237991
PageRank 10% 0.68 0.76 0.5 0.48 0.39 0.79 0.74 0.75 0.63 0.97939505820968 0.911667111420618
PageRank 10%-55% 0.69 0.75 0.5 0.5 0.4 0.8 0.75 0.74 0.62 0.974564339690133 0.93727940573435
PageRank 55%-100% 0.72 0.7 0.48 0.46 0.36 0.81 0.74 0.75 0.63 0.980776643431662 0.923320523989795
HITS 10% 0.67 0.78 0.48 0.48 0.4 0.82 0.74 0.74 0.62 0.971750053927405 0.93727940573435
HITS 10%-55% 0.69 0.74 0.51 0.52 0.4 0.79 0.75 0.75 0.64 0.972568213887863 0.970914599352419
HITS 55%-100% 0.68 0.69 0.48 0.47 0.36 0.79 0.74 0.73 0.61 0.979745692600116 0.986907536948871