Review Comment:
This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.
General Comments:
The paper presents an online benchmarking platform for question answering, which includes different evaluation metrics, measures, datasets and pre-implemented systems.
It's make sense to compare different QA systems on such a fair environment. Generally, this paper is interesting and deserves publication.
Positive Comments:
1、The topic of this paper has practical applied value while a credible online benchmarking platform for question answering make comparing different QA systems more convenient, which can promote the research of question answering.
2、The proposed question answering datasets have high quality, which have golden SPARQL quries by human annotation. The datasets not only have special quesions, but also have general questions and imperative sentence. To answer these questions, QA systems should handle many challenges such as multi-hop relations, variables, implicit relations and aggreagations.
3、The benchmark support both online systems and file-based evaluatoin campaigns, which is convenient for QA researchers to debug their system. To demand the online web service may avoid the human intervention of QA system, which gurantee the fail coparison. The benchmark also offer many pre-implemented systems and provide kinds of evaluation analysis.
Negative Comments:
1、Although the datasets have high quality, the questions number is still too few. QALD-6_train_Multilingual has 333 questions, however, there are almost 300 questions have been occurred in the last datasets. The real number of different questions is no more than 600. It is hard to train a neural network model in the small-scale data.
2、The benchmark system is complex but not very friendly to user. For example, there are many mertics and data in the experiment result page, however, users can not know what are the meanings of them.
3、There are sitll some bugs in the benchmark system. Sometimes the results are confusing, and sometimes the experiment would run too long time. Unlogin user can access any experiments by an URL with experiment ID.
|
Comments
reviews
Decision : Accept
authors present the GERBIL system for benchmarking Q/A Systems
Result section only evaluate the proposed framework and it does not really compare w.r.t. any competitors.
it's preferable to participate at CLEF competition for Q/A Task...
why authors use precision and not the C@1 mesure for the evaluation [Penas and Rodriguo, 2011]?
c@1 = (1 / n) * (nc + (nu * nc / n))
n: number of problems
nc: number of correct responses
nu: number of unanswered questions
Unlike precison and recall, this measure takes into account cases of indecison
for references, there are no recent references on 2017?
Adressing your review
Dear Seifeddine Mechti,
sorry for the late reply, but I did miss the notification that there is a new review.
1) We do participate with the system in different challenges, e.g., http://2018.nliwod.org/challenge
2) GERBIL QA is extensible, so if you want to have "C@1 mesure" please enter an issue at https://github.com/dice-group/gerbil/issues . We primarily reused well-known measured used by most participants to evaluate there approach.
3) There are no recent references, since the paper was submitted in early 2017 before most major conference took place.
Thanks for your open review! Stay tuned for the final version.