Abstract:
The ability to compare frameworks from the same domain is of central importance for their introduction into complex applications. In the domains of named entity recognition and entity linking, the large number of systems and their orthogonal evaluation w.r.t. measures and datasets has led to an unclear landscape pertaining to the abilities and weaknesses of the different frameworks. We present Gerbil---an improved platform for repeatable, storable and citable semantic annotation experiments---and how we extended it since its release. With Gerbil, we narrowed this evaluation gap by generating concise, archivable, human- and machine-readable experiments, analytics and diagnostics. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools on multiple datasets. By these means, we aim to ensure that both tool developers and end users can derive meaningful insights pertaining to the extension, integration and use of annotation applications. In particular, Gerbil provides comparable results to tool developers so as to allow them to easily discover the strengths and weaknesses of their implementations with respect to the state of the art. With the permanent experiment URIs provided by our framework, we ensure the reproducibility and archiving of evaluation results. Moreover, the framework generates data in machine-processable format, allowing for the efficient querying and post-processing of evaluation results. Additionally, the tool diagnostics provided by Gerbil allows deriving insights pertaining to the areas in which tools should be further refined, thus allowing developers to create an informed agenda for extensions and end users to detect the right tools for their purposes. Finally, we implemented additional types of experiments including entity typing. Gerbil aims to become a focal point for the state of the art, driving the research agenda of the community by presenting comparable objective evaluation results. Furthermore, we tackle the central problem of the evaluation of entity linking, i.e., we answer the question how an evaluation algorithm can compare two URIs to each other without being bound to a specific knowledge base. Our approach to this problem opens a way to address the deprecation of URIs of existing gold standards for named entity recognition and entity linking, a feature which is currently not supported by the state of the art. We derived the importance of this feature from usage and dataset requirements collected from the Gerbil user community, which has already carried out more than 24.000 single evaluations using our framework. Through the resulting updates, Gerbil now supports 8 tasks, 46 datasets and 20 systems.