Remixing Entity Linking Evaluation Datasets for Focused Benchmarking

Tracking #: 1783-2996

Jörg Waitelonis
Henrik Jürges
Harald Sack

Responsible editor: 
Guest Editors Benchmarking Linked Data 2017

Submission type: 
Full Paper
In recent years, named entity linking (NEL) tools were primarily developed in terms of a general approach, whereas today numerous tools are focusing on specific domains such as e.g. the mapping of persons and organizations only, or the annotation of locations or events in microposts. However, the available benchmark datasets necessary for the evaluation of NEL tools do not reflect this focalizing trend. We have analyzed the evaluation process applied in the NEL benchmarking framework GERBIL and all its benchmark datasets. Based on these insights we have extended the GERBIL framework to enable a more fine grained evaluation and in depth analysis of the available benchmark datasets with respect to different emphases. This paper presents the implementation of an adaptive filter for arbitrary entities and customized benchmark creation as well as the automated determination of typical NEL benchmark dataset properties, such as the extent of content-related ambiguity and diversity. These properties are integrated on different levels, which also enables to tailor customized new datasets out of the existing ones by remixing documents based on desired emphases. Besides a new system library to enrich provided NIF datasets with statistical information, best practices for dataset remixing are presented, and an in depth analysis of the performance of entity linking annotators on special focus datasets is presented.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ziqi Zhang submitted on 13/Jan/2018
Minor Revision
Review Comment:

The revision has addressed the major issue i.e., lack of novelty, by adding substantial carefully designed experiments to discover some interesting insights into the proposed measures of NEL datasets. I very much appreciate the dedication of the authors and I think the quality and value of the paper has largely improved. But there remains many issues that perhaps a minor revision can address.

Among these, the most prominent one remains to be the math notations. Unfortunately the improved math still contains too much errors making it difficult to read. *** I urge the authors*** to invite a math person to proof read your next revision before submission, to make sure your math makes sense. To help you, i have re-written large parts of your math and include this at the very end of my review. PLEASE BE WARNED that my recommendation is no guarantee of correctness, as for many, I have to base on my own understanding of your intention, which was unclear in many places.

Next, please answer/correct following minor issues
- page 13, just under section 4.2 title: to gain more insights on the interplay of annotators ---- I think you should use 'systems' instead of 'annotators' from this point onwards, as the second by convention is often understood as human annotators. Your previously occurrences of the word 'annotators' also meant humans, so it is confusing here.
- table 4, add a note to say the table is best viewed in colour. it was not clear to me until later that I understood the table is not just black and white. this did cause confusion while reading
- page 16, figure 19: 5 systems had a dip in performance in partition 7. This is certainly not just because of the confusion measure, can you give some insight why this happened?
- page 17 bottom: 'there is a weak correlation among hits and confusion of entities. This could be interpreted as with increasing partition number there are less entities with lower popularity, which might cause better results'. Why do we need to know the correlation between hits and confusion of entities? I thought you only need to talk about the correlation between hits and performance?
- page 20, 2nd paragraph on the left: the unfair dataset results are poor, could this be due to the amount of data is very small? Are any NEL systems supervised? Because this may cause overfitting and thus bad results, not necessarily bias in your datasets. Please comment on this
- page 20, last paragraph on the left: you should say a few words drawing conclusion from section 4.2, to explain why the 'easy' dataset is easy, and why the 'difficult' is difficult.

Finally, run a spell checker to correct mis-spellings. I will not list individual cases.

Your math still has a lot of errors. I suggest 1) make corrections based on recommendations below – I have done these base on *my own* understanding which is no guarantee they are all correct or best form, so; 2) you MUST have someone familiar with math and your work to proof read it before next submission; 3) have a table as appendix to list all notations you used and an explanation - this is optional but will largely improve readability, considering that you have too many different notations in the paper.
- above equation 1: A dataset D is a set of docs d \in D (instead of t, which is confusing). A document consists of annotations and text as a tuple (d_t, d_a), where d_t is the text of d and d_a is the annotations in d
- Equation 1: change t to d
- Equation 2: delete this, the number of annotations in a document can now be |d_a|. Change accordingly your text underneath this equation
- Under equation 2: `let E^{D} (instead of E_{D}) denote all entities within a dataset D and S^{D} (instead of S_{D}) denote all used surface forms within a dataset D. QUESTION: what is the relation between ‘annotations’, ‘entities’ and ‘surface forms’? you should clarify this.
- Above equation 3: in general, the number of annotations within a document |d_a| is a measure … the average number of annotations PER DOCUMENT IN THE CORPUS (if not what you meant, change it), na(D), divides the total number of annotations in the corpus by the total number of documents: na(D) = \sum_{d \in D}{|d_a|} / |D| ------------ your original equation does not make sense because your upper term is total # of annotations in corpus, your lower term, according to your current definition, is ‘the number of annotations WITHIN A (SPECIFIC) DOCUMENT’. What’s the semantics of dividing these two numbers and which document would it be?
- Equation 4 change to: nad(D) = |{d | |d_a| = 0}| / |D| ----------- your original equation is wrong because the upper term does not make sense. The sum must add up numbers, but inside your sum ‘a(t)=0’ returns a Boolean
- Above equation 5: however, we propose…. As the relation between the number of annotations in the ground truth and the overall document length len(d) determined by the number of words, with ma …
- Equation 5: ma(D) = \sum_{d \in D}{|d_a|} / \sum_{d \in D}{len(d}
- Equation 6: By convention, big letter applies to set, small let applies to individuals. So ‘we use PageRank p(e) denote the pagerank computed for e, and the category interval is denoted by a, b \in [0, 1]: E^{D}_{a,b} = {e \in E^{D} | a <= p(e) <= b} ---------- your original equation p(D,P) is misleading: the equation returns a SET of ENTITIES, so use variations of the big letter E instead.
- Left, first paragraph: The overall set of all possible entities for a surface form is E^{s}, which is also referred to… The dictionary know to the annotator E’^{s} is a subset of E^{s} … ------ again, you are referring to set of entities, so use consistently big E instead
- Following the above part: the surface form of a dataset S_{D} can also be interpreted as a subset of V_{sf} =>> this does not make sense because S_D are surface forms, V_{sf} (now E^{s}) are entities. How could the first be a subset of the second??
- Following above, ‘the likelihood of confusion for the surface form …. Determined by the cardinality of the union of the know entities IN THE DATASET and the known entities to the annotators: E^{s}_{D} \cup E’^{s} --------- again your original equation D \cup W_sf does not make sense. D is DOCUMENTS, W_sf is ENTITIES, the union doesn’t make sense
- The second paragraph on the left column: … the overall set of all possible surface forms for an entity is S^{e} (outer lower box), which is also …. The annotations know only a subset S’^{e} which is a subset of S^{e} … the dataset … only contains S^{e}_{D} which is also a subset of S^{e}…. The likelihood of confusion ….by the cardinaity of the union of the known surface forms S^{e}_{D} \cup S’^{e} ---------again for the same reason stated above, your original D \cup W_e does not make sense because it is a union between D documents and W_e surface forms
*** you should also correct **** your figures 2 and 3 accordingly, once you changed your math notations.
- On the right column: this part is very confusing. First, before you already said the confusion measure will use ‘E^{s}_{D} \cup E’^{s}’ and ‘S^{e}_{D} \cup S’^{e}’, but these did not appear in your equations 7 and 8. Also, what is ‘annotator system dictionary W’, and how is it related to ‘entities known in the dataset’ and ‘entities known to the annotator’. My understanding is that this is neither the first or the second, but it includes the first? I do not know how to correct this part. But it seems your equations 7 and 8 should be: c_{S} (D) = (|{E^{s}_{D} \cup E’^{s} \forall s \in S^D }|) / |S^{D}|,
and c_{E} (D) = ( |S^{e}_{D} \cup S’^{e} \forall e \in E^D ) / |E^D|. In words, the first is the ratio between (all entities known to the dataset plus all entities known to annotator) and (number of surface forms in the dataset); the second is the ratio between (all surface forms known in the dataset and known by annotators) and (number of entities in the dataset). Please re-phrase these properly.
- Equations 9 and 10: these appear to be generally ok but you need to use || to return the size of sets, change the notations accordingly also you need to make it clear how W (now should be replaced by something to do with E) relates to ‘entities known in the dataset’ and ‘entities known to the annotator’. So:
- Equation 9: dom_S(W,D) = ( \sum_{s \in S^D} { |E^D_{s}| / |E^W_{s}| } ) / |S^D|
- Equation 10: dom_E(W,D) = ( \sum_{e \in E^D} { |S^D_{e}| / |S^W_{e}| } ) / |E^D|
- Equation 11: this is still wrong. s \in W_sf doesn’t make sense. W_sf returns a set of entities, how can a ‘surface’ s belong to a set of entities? According to your words definitions, it should be something like: max_recall(W, D) = | { S^{e}_W \forall e \in E^D } | / |S^D|, where S^{e}_W returns the set of surface forms for entity e, in dictionary W. Again, I cannot emphasize less that you ***must*** clarify the relation between W, to ‘entities known in the dataset’ and ‘entities known to the annotator’.
- Equation 12 is again, wrong. You say t is in the range of (0,1), but your equation 12 certainly does not return a fraction number, but integers greater than 1. Re-write as: let T(e) return the type of an entity e, the set of entities of a specific type t in the dataset D is {e | e \in E^D and T(e) =t}.

Review #2
By Michelle Cheatham submitted on 14/Jan/2018
Minor Revision
Review Comment:

One of my concerns with the previous version of the paper was related to the annotation density metric. The authors warn that this metric only estimates the number of missing annotations, and they provide some support for this by showing the correlation between this metric and precision. However, that is not in my view sufficient to show that the density metric is valuable -- for that, the authors would need to show that the metric successfully predicts missing annotations by comparing it to documents that have been manually annotated by humans. This still has not been done, so I remain unconvinced of this metric's utility.

On a similar note, I think the claims made in the second paragraph of section 4.1 are too strong (e.g. "For A2KB tasks, these datasets well lead to an increased false positive rate..."). It is possible that the documents that contain no annotations actually don't have any text relevent to the topics being annotated, and are therefore true rather than false positives. The actual situation can't be determined without any manual analysis.

The additional experimental results added to the paper are interesting and add significantly to its value. However, I think the authors should explain some aspects of the results more fully. For example, the Pearson value of both the "fair" and "unfair" datasets (these labels are perhaps not ideal) to the entire dataset are actually quite similar. In addition, as the authors state, three annotators actually perform better on the unfair dataset than the fair one. Both of these things are counter-intuitive and would benefit from more explanation. Similarly, it would also help to include, either at the end of section 4 or in the conclusion, a summarizing discussion of the overall advantages and disadvantages of the re-mixing process, as supported by section 4.

----- Minor Comments -----

I believe some problems with the formulas in the paper still remain. In particular:

I think there may be a problem with equation 3. You define |A| as the number of annotations within a particular document. If na(D) is intended to be the average number of annotations per document in the corpus then why divide by |A|?

The paragraph preceeding equation 12 states that the range is (0, 1); however, the equation defines a set, not a number.

Additionally, there are a number of relatively minor errors that remain unaddressed:

When Section 4.2.2 describes table 4, it says there are only 10 items in the rightmost partition. It then lists more than ten items.

Section 4.2.7 refers to figure 13, but it actually discusses figure 24.

"in deep" -> "in depth"
"on document level" -> "on the document level" (same for on dataset level, on entity mention level)
"will explained" -> "will be explained"
"with text fragment" -> "with the text fragment"
"as vocabulary of surface forms" -> "as a vocabulary of surface forms"
"Booth functions" -> "Both functions"
"und" -> "and"
"Besides with the existing NIF vocabulary the statistics has been..." -> this sentence is very confusing.
"It is subject of future work" -> "It is a subject of future work"
"Based on these information embedded in the NIF dataset files, a cstomized..." -> Based on the information embedded in the NIF dataset files, a customized..."
"combine the datasets to one large dataset" -> "combine the dataset into one large dataset"
"reasonable well" -> "reasonably well"
"only a very few number of items" -> "only a very few items"
"To achieve a more evenly" -> "To achieve a more even"
"reasonable even" -> "reasonably even"
"as disambiguation task" -> "as a disambiguation task"
"With exception of" -> "With the exception of"
"only very few" -> "only a very few"
"THis" -> "This"
"there are less entities" -> "there are few entities"
"only few" -> "only a few"
"of the tree different domains" -> "of the three different domains"
"annotation,s" -> annotations,"
"the result datasets" -> "the resulting datasets"
"a raising number" -> "a rising number"
"Therefor" -> "Therefore"
"the annotators performance" -> "the annotators' performance"
The last sentence in the paper is worded in a very confusing way.

Review #3
By Heiko Paulheim submitted on 02/Mar/2018
Minor Revision
Review Comment:

I appreciate the care with which my original concerns have been addressed. In fact, the paper now is much clearer w.r.t. to the original claim, and the empirical findings now support that claim rather than contradicting to it.

I also see that mainly due to the other reviews, the definitions have been revised to follow a stricter mathematical notation. However, from my point of view, these notations need a bit more care. I will detail on those issues below. Furthermore, there are a few smaller inconsistencies, but nothing that could not be fixed with a careful revision.

Detailed list of points that should be addressed:
* Tables 1 and 2 are not consistent. E.g., prominence in table 2 is only en and an, while it is ds, doc, and an in table 1. Furthermore, "en" is not explained in the caption, I suppose it denotes "entity". Following that notion, PageRank should be "en", not "an" in table 2 (at least, PageRank and HITS should have the same tag).
* In eq 3, I suppose the denominator should be |D|, not |A|?
* p.4: "Documents without annotations lead to an increase of false positives" - this is a pretty strong claim at this point, although later supported empirically. "may lead" might e a bit more defensive. Later on on p. 11, this claim is repeated (sparser documents lead to more false positives) - actually, sparsity in the annotation can also stem from a combination of a knowledge base and a document. Very specific domain documents with little coverage in the KB will always be sparsely annotated, even if the annotation is complete w.r.t. the KB at hand.
* eq. 6 should rather be p(D,a,b) than p(D,P)?
* the captions in Fig.2 and Fig.3 (D, W, Ve and Vsf) are not in line with the accompanying text
* in the same context, it is confusing using D in a different way than before (it used to be the set of documents)
* p.5, the four possible locations, last equation: I'm not 100% sure, but I think this must be intersection, not union
* eq. 7 and 8: it is unlucky to use e as a function here, which was a variable for a single entity in the paragraph above. This is confusing
* Sec. 2.9: it is unclear why those measures are computed both as micro and macro, while others (prominence, ambiguity) are not. Without further explanation, this looks a bit arbitrary.
* Moreover, this explanation comes a bit late. You have defined your measures in the formulas in one or the other way (e.g., eq. 5 is clearly a micro formulation), and later state that it can be different as well. I would rather pull section 2.9 before the definitions and always give both definitions.
* p.11: you state that "a well balanced dataset should exhibit a relation of..." - it would be worthwhile pointing out which of the dataset comes closest to that optimum, and which is the farthest away.
* p.11: the explanation from the response letter of this revision why the different partitions according to PageRank and HITS do not sum to 100% should be included here.
* section 4.2: initially, 10 tools are mentioned, but all graphs and tables depict less (often different subsets). The reason is a bit unclear and hidden between the lines. Most prominently, table 5 misses WAT.
* section 4.2: there are often findings that are not explained, e.g., the peaks at bucket 7 in Fig. 18 and 19, the local minimum of in Fig. 19, etc. In contrast to that, the findings presented are often trivial (e.g., at the end of 4.2.2). Here, a closer look would be appreciated.
* The explanation of the effects of partition 10 in 4.2.4 is a bit shallow. I guess that some of the effect comes from the confusion of dbp:United_States and dbp:Americas, but this should be examined more closely.
* For the remixed subsets in section 4.2.8, I would appreciate seeing a table with the metrics introduced before. This would help interpreting the findings better that different annotaters work better for different domains, and it would help understanding whether this is an effect of the domain per se or rather of another characteristic of that particular subset.

Minor issues:
* p.3: The sentence "A dataset D is a set of documents..." is barely readable. Rephrasing it as "A dataset D is a set of documents. A document t \in D consists of annotations and text, i.e., t = (T,A), where..." would increase understandability.
* p.4: "The number of not annotated documents is calculated..." - the below equation computes a *fraction*, not a *number*
* p.4: fix hyphenation of "PageRank"
* Fig. 2+3: I wonder whether V_sf and V_e are actually finite sets - the figures indicate that by using rectangles, but can we actually know all possible surface forms for an entity and vice versa? Moreover, this set may not be stable over time. This is just a small philosophical detail, but I wanted to share this anyways.
* p.13, last paragraph of 4.2 should end in a .
* p.14: "The red horizontal lines" - in my printed copy, I see red points, no red lines here
* p.15: "only 10 items [...] these are in particular" - the list following this sentence is actually longer than 10
* Caption of Fig. 20: capitalization of PageRank
* p.18: reference should go to Fig. 24, not 13