Review Comment:
This revised version of the paper improves on many of the issues raised previously. The authors added more detail and nuance about the experimental setup, the artwork category and its connection to other categories, and the potential popularity bias. The authors now also include a link to the code and trained models.
As I mentioned in my previous review, this paper makes a good contribution by discussing the value and challenges of NER for artworks and providing datasets and models. The one remaining point for improvement is the limited discussion of their choice to exclude one word titles in terms of the kind of bias it creates and how that potentially affects users of its output.
Specific comments:
P. 8: On the previous version of this paper I asked the authors to discuss the consequences of removing one word titles from the annotation data. In this revision, the authors mention that only 5% of titles are affected in this step. I appreciate their elaboration, but although that is a low percentage, they still don't discuss the implications of this systematic exclusion that creates systematic biases in the ground truth dataset. Thinking about using the output of a NER tagger trained on this data, I wonder what users' response would be if they are told that the available titles exclude all artworks with a one word title.
P. 12, footnote 16: It would be good to include the SpaCy and Flair version numbers as well as the names and versions of the specific trained models that were used, as improved versions--especially of pre-trained models--are released regularly. The requirements.txt in the GitHub repository only has the SpaCy and Flair version numbers, not the models that were re-trained. If it takes up too much space in the paper, refer to the repo README and add the details there.
P. 12, col. 1: "was configures" -> "was configured"
P. 14, sec. 6.1: the addition of the smaller sample sizes is very useful, as it shows that the curves are rapidly stabilising (as is to be expected), although the SpaCy model under relaxed conditions, seems to keeps improving more strongly, particularly on precision. This suggests there is value in increasing the training set size, but also that at the size of the current dataset, it is mainly getting better at roughly spotting where titles are mentioned, but much less at identifying the exact titles.
|