Review Comment:
This paper presents an approach to building a large-scale annotated dataset for training NER taggers on the domain of cultural heritage (CH) objects and specifically on recognising titles of artworks.
This is a well-written and clearly structured paper that is easy to follow. The authors identify cultural heritage objects as a broad category that has received little attention in Fine-grained NER research and for which no or few NER training datasets exist. Moreover, they provide a realistic test case of a heterogeneous set of OCR'ed documents, that contain text-recognition errors and from which most structural/layout information has been removed. This is a common situation that provides additional challenges that cannot easily be brushed aside by focusing on a small set of highly curated and correct documents.
Beyond the challenges of working with digitised CH texts, the authors do a good job of highlighting and detailing the specific challenges and kinds of errors made in recognising names in the domain of CH objects.
The paper is well-embedded in a large body of recent and older literature. I particularly like that the authors discuss explicitly how this work extends one of their previous contributions. This is really helpful to readers who are familiar with the previous work.
The evaluation shows how the three stages influence the overall performance and the authors included an analysis of how the size of the dataset affects performance, which I wish more authors would do, as it gives an indication of the cost-benefit trade-off of putting in additional annotation effort. To get an even better impression of this, it would be useful if the authors included even smaller fractions (e.g. 5, 10, 15, 20%), as that would probably make the overall shape of the curve clearer. I wouldn't call this a required revision, but definitely a way to make the contribution even stronger.
The main issue I have with this paper is with how they define the scope of their problem. The chosen focus on artworks to include only paintings and sculptures prompts several questions. Why only focus on those types of artworks? And is it preferable to use this broad label for a narrow interpretation of the category of artwork to include only these few types, or to use a different category label for this narrower subset (e.g. would 'visual artwork' be a more appropriate label)? To what extent does the answer to the previous question depend on the domain and types of texts that are to be NER'ed?
The public datasets from Wikidata and Getty are probably biased towards more popular CH objects, so probably skews the NER tagger towards the titles of these more popular objects, leaving out or underperforming on titles of the long tail of less well-known objects. This is not a criticism of the paper, just a question out of interest (I don't expect the authors to tackle all the problems of NER in the CH domain in one go). Is there any reason to believe the annotated set of titles is or is not representative of artwork titles in general? I would like the authors to add some reflection on whether they think this bias is present and causes a problem for recognising artworks in general, and if so, what the consequences of this bias are and if there are possible ways to deal with this.
Overall, this paper makes a clear contribution towards domain-specific fine-grained NER, with consequences for any downstream tasks that rely on it, but which can be improved with some minor revisions.
One comment on the fact that this paper is submitted to a special issue of SWJ. Although the authors use Wikidata and Getty datasets as sources of titles, the connection with SW in this paper is not so clear, as the focus is on NLP/NER problems and solutions.
Finally, the paper makes no mention whether the annotated dataset and the trained models are or will be made available.
Strong points:
- the focus on the CH domain for NER addresses domain-specific problems and categories and results in a training dataset that is a valuable resource for further research and for use in enriching and linking CH collections.
- the authors discuss the specific challenges in a clear and structured way.
Points for improvement:
- the exclusion on of many types of artworks needs to be discussed as well as the consequences of using a broad category label for a narrow definition.
- the criteria used in filtering the list of titles to include in the annotated dataset in the first of the three stages need to be argued for and their consequence discussed (preferably with statistics on how each step reduces the size of the dataset and how it qualitatively changes the nature of the dataset).
Specific comments:
- p. 4, col. 2, lines 45-46: "to generate high-quality training corpus" -> "to generate a high-quality training corpus"
- p. 6, col. 1, lines 43-44: "that can be found in dictionary" -> "that can be found in a dictionary"
- p. 7, col. 1, lines 37-38: "refine annotations for artwork named entity." -> "refine annotations for artwork named entities."
- p.7, col. 2, lines 26-29: The authors remove artwork names consisting of a single word, as many these are highly generic words. This is a very significant step, but the consequences are not discussed in much detail. I can imagine that a significant number of one-word titles are highly uncommon or at least unique to the artwork. The authors should discuss why the problem with generality of words is tackled by focusing only on the length of the title (a single word) and not on the commonness of the single title word. Moreover, it would be useful to provide statistics on how many/what fraction of artworks are removed in each filtering step, to make the consequences clearer.
- p. 8, col. 1, line 23: "by expert user community" -> "by an/the expert user community" or "by expert user communities"
- p. 8, col. 2, line 45: "in entity dictionary" -> "in the entity dictionary"
- p. 8, col. 2, lines 45-46: "was maintained as spans " -> "were maintained as spans"
- p. 9, col. 1, lines 23-24: "with the help of set of labelling functions and patterns" -> "a set of" or "sets of"
- p. 10, col. 1, lines 26-27: "referring an artwork" -> "referring to an artwork"
- p. 10, col. 2, lines 26-27: "After filtering out English texts and performing initial" -> This sounds as though the English texts are removed from the dataset, although the authors earlier on indicated that they focus on English. If that is the case, I think it would be clearer to say "After removing all non-English texts ..."
- p. 10, col. 2, lines 44:49: It seems as though the authors are focusing on paintings and sculptures, yet refer to them with the broader title of "artwork" and comment on the OntoNotes5 category of "work_of_art" as including many other things beyond paintings and sculptures. I think this needs to be discussed more clearly in the introduction or section 3, e.g. what types of entities do the authors include and exclude from the category of artwork in this task, and why. Wikipedia and several dictionaries consider "artwork" and "work of art" to be synonyms and to include all these types of objects. Novels, films, musical pieces and video games are also artworks and cultural heritage objects, so it seems the chosen focus on paintings and sculptures requires some discussion as to whether, and if so, how, they are different from other types of artworks. Example 6 in Table 5, where the name of a novel is tagged as an artwork, would be interesting to incorporate in that discussion. To compare the challenges and issues with e.g. identifying book titles, see refs [1-3] below (full disclosure, I'm co-author on [1]).
- p. 11, col. 1, lines 28-29: "on Ontonotes5 dataset" -> "on the Ontonotes5 dataset"
- p. 11, col. 1, lines 46-47: "an NER framework in form of" -> "a NER framework in the form of"
- p. 12, col. 2, lines 32-33: "in semi-automated manner" -> "in a semi-automated manner"
- p. 13, col. 2, line 50: "of annotation dataset" -> "of the annotation dataset"
- p. 14, col. 2, line 28: "including artwork" -> the sentence structure suggests this should be " including artworks"
- p. 14, col. 2, line 29: "a few examples texts" -> "a few example texts"
- p. 14, col. 2, line 38: "texts that needs" -> "texts that need"
- p. 15, col. 1, line 23: "in semi-automated manner" -> "in a semi-automated manner"
- p. 15, col. 1, line 46: "on existing knowledge graph" -> "on existing knowledge graphs"
-
References:
[1] Bogers, T., Hendrickx, I., Koolen, M., & Verberne, S. (2016, September). Overview of the SBS 2016 mining track. In Conference and Labs of the Evaluation forum (pp. 1053-1063). CEUR Workshop Proceedings.
[2] Ollagnier, A., Fournier, S., & Bellot, P. (2016). Linking Task: Identifying Authors and Book Titles in Verbose Queries. In CLEF (Working Notes) (pp. 1064-1071).
[3] Ziak, H., & Kern, R. (2016). KNOW At The Social Book Search Lab 2016 Suggestion Track. In CLEF (Working Notes) (pp. 1183-1189).
|