Benchmarking Question Answering Systems

Tracking #: 1578-2790

Ricardo Usbeck
Michael Röder
Michael Hoffmann
Felix Conrads
Jonathan Huthmann
Axel-Cyrille Ngonga Ngomo
Christian Demmler
Christina Unger

Responsible editor: 
Ruben Verborgh

Submission type: 
Full Paper
The need for making the Semantic Web better accessible for lay users and the uptake of interactive systems and smart assistants for the Web have spawned a new generation of RDF-based question answering systems. However, the fair evaluation of these systems remains a challenge due to the different type of answers that they provide. Hence, repeating current published experiments or even benchmarking on the same datasets remains a complex and time-consuming task. We present a novel online benchmarking platform for question answering (QA) that relies on the FAIR principles to support the fine-grained evaluation of question answering systems. We present how the platform addresses the fair benchmarking platform of question answering systems through the rewriting of URIs and URLs. In addition, we implement different evaluation metrics, measures, datasets and pre-implemented systems as well as possibilities to work with novel formats for interactive and non-interactive benchmarking of question answering systems. Our analysis of current frameworks show that most of the current frameworks are tailored towards particular datasets and challenges but do not provide generic models. In addition, while most framework perform well in the annotation of entities and properties, the generation of SPARQL queries from annotated text remains a challenge.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Philipp Cimiano submitted on 20/Apr/2017
Minor Revision
Review Comment:

This paper presents a benchmarking framework for question answering systems that builds on the well-known GERBIL framework for evaluating named entity linking systems.
The paper does not present any novel method but a framework supporting the fair and objective comparison of different QA systems over linked data. This is a very important contribution to research as it will allow others to exploit the benchmarking framework to progress on their research by allowing for a systematic comparison to other systems. Without such frameworks as presented in this paper such a fair comparison would simply not be possible.
In implementing this framework, the authors have exploited their existing framework, GEBRIL. GEBRIL can be said to have had an important impact on the community. It is widely used by people developing entity linking algorithms and has contributed to more transparency into research results on named entity linking, allowing everyone working on the field to use the same evaluation criteria and datasets.
Building on the success of GEBRIL, the authors present a new instantiation of the framework they dub GEBRIL QA for the case of Question Answering systems over linked data. In general, the work is well motivated and gives appropriate treatment to the state-of-the-art.

My main concern with the paper that I would like to see addressed is a motivation for the particular experiment types considered. In particular, what I do not understand is why the authors decide to evaluate the Resources to KB and Properties to KB separately. The other experiment types seem adequate, but a better motivation for why these have been considered is needed.

There are a number of small mistakes in the paper that need to be corrected before the paper can be published. I leave it to the authors to proof-read the manuscript and will not mention the minor glitches here.

Review #2
Anonymous submitted on 21/Apr/2017
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

1) Originality : The authors present a benchmarking tool for QA systems. GERBIL QA enables to test systems in a fair environment, traceable experiment links, same splits of datasets for all tested systems etc. As mentioned by the authors previous QALD systems didn't have any Web Service, it's not possible for future researchers to compare against them in a fair fashion. Sub-experiments mentioned in the paper are also important to understand the behavior of QA systems.

Minor changes in the tool : I think it's important to provide versions of the tool for running an experiment. Since the tool might be extended/changed in the future, the user should be able to test with those versions as well. Also, it's useful for the tool to ask for the languages the QA systems can support while adding some that QA system.

2) Significance of results: As authors present a benchmark tool for QA systems, they don't perform any evaluation.

3) Quality of writing : The writing is quite clear and the points explained in the appropriate order. It's easy to follow and understand.

Review #3
By Lei Zou submitted on 07/Oct/2017
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

General Comments:
The paper presents an online benchmarking platform for question answering, which includes different evaluation metrics, measures, datasets and pre-implemented systems.
It's make sense to compare different QA systems on such a fair environment. Generally, this paper is interesting and deserves publication.

Positive Comments:
1、The topic of this paper has practical applied value while a credible online benchmarking platform for question answering make comparing different QA systems more convenient, which can promote the research of question answering.
2、The proposed question answering datasets have high quality, which have golden SPARQL quries by human annotation. The datasets not only have special quesions, but also have general questions and imperative sentence. To answer these questions, QA systems should handle many challenges such as multi-hop relations, variables, implicit relations and aggreagations.
3、The benchmark support both online systems and file-based evaluatoin campaigns, which is convenient for QA researchers to debug their system. To demand the online web service may avoid the human intervention of QA system, which gurantee the fail coparison. The benchmark also offer many pre-implemented systems and provide kinds of evaluation analysis.

Negative Comments:
1、Although the datasets have high quality, the questions number is still too few. QALD-6_train_Multilingual has 333 questions, however, there are almost 300 questions have been occurred in the last datasets. The real number of different questions is no more than 600. It is hard to train a neural network model in the small-scale data.
2、The benchmark system is complex but not very friendly to user. For example, there are many mertics and data in the experiment result page, however, users can not know what are the meanings of them.
3、There are sitll some bugs in the benchmark system. Sometimes the results are confusing, and sometimes the experiment would run too long time. Unlogin user can access any experiments by an URL with experiment ID.


Decision : Accept

authors present the GERBIL system for benchmarking Q/A Systems
Result section only evaluate the proposed framework and it does not really compare w.r.t. any competitors.

it's preferable to participate at CLEF competition for Q/A Task...

why authors use precision and not the C@1 mesure for the evaluation [Penas and Rodriguo, 2011]?

c@1 = (1 / n) * (nc + (nu * nc / n))
n: number of problems
nc: number of correct responses
nu: number of unanswered questions
Unlike precison and recall, this measure takes into account cases of indecison

for references, there are no recent references on 2017?

Dear Seifeddine Mechti,

sorry for the late reply, but I did miss the notification that there is a new review.

1) We do participate with the system in different challenges, e.g.,
2) GERBIL QA is extensible, so if you want to have "C@1 mesure" please enter an issue at . We primarily reused well-known measured used by most participants to evaluate there approach.
3) There are no recent references, since the paper was submitted in early 2017 before most major conference took place.

Thanks for your open review! Stay tuned for the final version.