MQALD: Evaluating the impact of modifiers in Question Answering over Knowledge Graphs

Tracking #: 2585-3799

Authors: 
Lucia Siciliani
Pierpaolo Basile
Pasquale Lops
Giovanni Semeraro

Responsible editor: 
Harald Sack

Submission type: 
Dataset Description
Abstract: 
Question Answering (QA) over Knowledge Graphs (KG) has the aim of developing a system that is capable of answering users' questions using the information coming from one or multiple Knowledge Graphs, like DBpedia, Wikidata and so on. Question Answering systems need to translate the question of the user, written using natural language, into a query formulated through a specific data query language that is compliant with the underlying KG. This translation process is already non-trivial when trying to answer simple questions that involve a single triple pattern and becomes even more troublesome when trying to cope with questions that require the presence of modifiers in the final query, i.e. aggregate functions, query forms, and so on. The attention over this last aspect is growing but has never been thoroughly addressed by the existing literature. Starting from the latest advances in this field, we want to make a further step towards this direction by giving a comprehensive description of this topic, the main issues revolving around it and, most importantly, by making publicly available a dataset designed to evaluate the performance of a QA system in translating such articulated questions into a specific data query language. This dataset has also been used to evaluate the best QA systems available at the state of the art.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Simon Gottschalk submitted on 05/Nov/2020
Suggestion:
Major Revision
Review Comment:

The authors introduce the dataset MQALD, which is made of SPARQL queries with modifiers, their results and verbalisations. As most QA systems struggle with SPARQL queries that contain modifiers, this is a relevant dataset for evaluating the expressiveness of QA systems. Part of the dataset comes from the existing QALD challenges, plus 100 newly created queries. The dataset itself is freely available on zenodo and has well-structured and well-verbalised queries in four languages. The evaluation of QA systems on these datasets is very detailed and interesting. However, I see several drawbacks with respect to the dataset creation, quality, and usefulness. This review follows four steps: a review of the paper itself, a review of the dataset and code, a check of the SWJ dataset paper evaluation criteria and some minor comments.

== Paper ==

- Related Work/QA Systems analysis: The main motivation of the paper is to create a dataset of queries that QA systems are known not to handle well. While the authors state that "only few works" consider the construction of queries containing modifiers, they still evaluate their dataset on three systems that are not contained in the respective citations (i.e., [5,6,7]). If such systems do not consider modifiers, bad results do not come with a surprise. I still think that the evaluation in Table 5 and Table 6 is interesting, as they show how close systems can still come to answer these type of questions. However, I would except an analysis beforehand about which systems consider what types of modifiers. Also, in contrast to what the authors state (page 9, line 41ff in the right column), I can also retrieve the SPARQL queries generated by TeBaQA and gAnswer (at least using their websites).

- MQALD vs MQALD_ext: The biggest contribution of the authors is the manual creation of the MQALD_ext dataset. However, Section 3 starts with a strong focus on a sub set of the QALD datasets, which first gives the impression that there has not been done more than that. I suggest to clarify that early (e.g., in the intro of Section 3). I would also suggest to rename the MQALD_ext dataset, as this is not just an extension of (M)QALD but the main contribution. The text describing QALD queries mentions several "errors" in this data (e.g., "OFFSET 0"). For the dataset described in the paper (i.e., MQALD), I would suggest to have a well integrated version containing a cleaned version of the QALD queries with modifiers (e.g., without things like "OFFSET 0") plus the newly created queries (MQALD_ext) in one file (and, maybe a training and test split).

- Creation of questions: Although highly relevant for a dataset paper, there is no detailed description of how MQALD_ext was created. Who has created them? How were the questions selected? SPARQL query or verbalisation first? Were they motivated by the QALD questions (see, e.g., the similar questions abot countries with the Euro as currency)? I also do not understand the statement about languages on page 7, right column, line 39f (does this imply that for French and Spanish, machine translation was used?).

- GERBIL: On page 9, line 12ff (left column), the authors mention two error messages coming from GERBIL. Unsolved errors are nothing I would expect from a paper, specifically, as they sound like the error is on the side of MQALD. I suggest to the authors to either fix that errors (maybe with the help of the GERBIL team?) or to change that passage in the paper. In general, an inclusion of MQALD into GERBIL would be a great way to demonstrate visibility and usefulness of the dataset.

- Related Work: There are more QA datasets that should be mentioned. Most importantly, the second version of LC-QuAD, but also others such as TempQuestions, ComplexWebQuestions, Event-QA, and more.

- KG independent queries: I am not sure if I agree that the query in Listing 8 is actually problematic (compared, e.g., to the really problematic examples at the beginning of page 7): in any case, a QA system needs to identify a property. Why not multiple ones?

- OWA and versions: Several of the presented questions assume a complete (e.g., "Which is the fifth most populous country in the world?") or up-to-date knowledge graph (Listing 9). It is totally fine to keep these questions, but I would expect a brief discussion of these aspects (open-world assumption and the knowledge graph and the necessity to fix knowledge graph version and question date when using a QA dataset). Actually, it is not quite clear which DBpedia version is used (the 2016-10 version is mentioned on page 12, but does that also refer to MQALD?).

== Dataset & Code ==

- Variety: MQALD_ext lacks topical variety of questions. For example, there are several questions about Elvis and Harry Potter, about books, cities, countries, mountains, lakes and artists, but no questions about sports, science or conflicts, and other topics which are prominent in DBpedia.

In general, the dataset is well-formatted and the English query verbalisations look good. I still have comments on some questions (based on a random selected subset of questions in MQALD_ext):
- Question 161: Change to "OFFSET 2" to retrieve the third-highest mountain.
- Question 162: This is a question that particularly suffers from the OWA and also suffers from a missing definition of what you would still consider as mountain. So - although syntactically all fine - I would prefer to replace this question (this is debatable, though).
- Question 169: "creator" or "painter" instead of "author" in the verbalisation
- Question 239: Results should be given as two columns, not two rows. Otherwise you can't tell what is the population and what is the population density.
- Question 244: "?child dbo:child dbr:Elvis_Presley" needs to be changed to "dbr:Elvis_Presley dbo:child ?child"

The code on the GitHub page has no proper documentation in the readme file.

== Evaluation Criteria (see http://www.semantic-web-journal.net/reviewers)

- "such a paper shall ... give information ... on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; ... method of creation": There is no information about date, number and licensing given in the paper (concerning the zenodo release, I am surprised to see the MIT license used for QALD, which is typically used for software). There is no proper description of the method of creation (see my comment above).

(1) Quality and stability of the dataset - evidence must be provided: No evidence given, and my look at example queries in the datasets confirm errors in the data (see my comments above).

(2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided: There is no evidence given of how the dataset is used yet (only the authors themselves have used it to benchmark three QA systems). One suggestion would be a proper integration with GERBIL.

(3) Clarity and completeness of the descriptions: As stated above, the actual MQALD_ext dataset description is hidden a bit. For example, statistics in Table 1 do not include this dataset. A common description of the contributions would clarify that.

- "Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people". A large part of the dataset and the paper is about QALD so I was indeed expecting a statement about this.

== Minor ==

(page, line number, left/right column)
- P2, 8l: "DBpedia [3] and Wikidata"
- P2, 38l: Remove space after "dbr:"
- P2, 42l: "When was the President of the USA born?"
- P3, 12l: use multicite ([5,6,7])
- P3, 17l: "MQALD[, or :] a"
- P3, 5r: It can be a list of values as well?
- P3, 20r: "SimpleQuestions" instead of "SimpleQuestion"
- P4, 30l,ff: Change order: "questions extraction from ... that require ..."
- P6, 13l,ff: Use proper quotation marks
- P6: 48l: "SPARQL [q]uery forms"
- P8, 31r: Use proper capitalisation for the WDAqua dataset
- P10, Table 4: There are more than six datasets in the table.
- P3, 2l: DESCRIBE is never used, so I would probably skip it here. Actually, I am not even sure if I would consider "ASK" as a query modifier or as a particulary complex task, but it's okay.
- Please take some time to beautify the SPARQL and JSON listings. Avoid column or page breaks within listings (see Listing 2 and 8), compactify them (see end of Listing 2), capitalise all reserved words (also "year", "now", "regex", ...), maybe even set some code highlighting.
- I do not like the look of "MQALD_ext". I suggest to use a proper subscript.

Review #2
By Ricardo Usbeck submitted on 10/Nov/2020
Suggestion:
Minor Revision
Review Comment:

This manuscript was submitted as 'Data Description' and has been reviewed along the following dimensions.

# Summary/Description

The article describes MQALD, a version of the QALD datasets focussed on modified (aggregate functions, solution sequence modifiers, query forms) SPARQL translations of natural language questions. This is the first dataset to thoroughly study the effect of specific SPARQL modifiers and how to bridge the lexical gap for these.

# Short facts

Name: MQALD
URL: http://doi.org/10.5281/zenodo.4050353
Version date and number: 2.0., May 21, 2020
Licensing: MIT
Availability: guaranteed
Topic coverage: not applicable
Source for the data: The existing QALD - benchmark series
Purpose and method of creation and maintenance: By extracting SPARQL queries containing modifiers and adding 100 (!) novel questions
Reported usage: None. Maybe in the future.
Metrics and statistics on external and internal connectivity: none.
Use of established vocabularies (e.g., RDF, OWL, SKOS, FOAF): RDF
Language expressivity: English, Italian, French, Spanish only.
Growth: none.
5 star-data?: no.

# Quality and stability of the dataset - evidence must be provided
Due to the merge process, there should be test set leakage. The QALD-7 test is entirely in the QALD-8 train, and the QALD-8 test is contained in the QALD-9 train. Thus, systems see the question in the train as well as in the test dataset. That can be easily fixed, in any case.

# Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided.
They were not provided.

# Clarity and completeness of the descriptions.
The paper is well-written, and the description is clear, which enables replication.

# Overall impression and Open Questions
The cited surveys and approaches all rely on classical IR/QA approaches. I wonder how deep learning approaches, as mentioned in [2], handle the situation. While most systems are hard to reproduce, I would be happy to see a literature cross-check.

Why is there no reference to LC-QuAD 2.0, and why wasn't it or the other datasets used as a base? Please add a short justification to Section 2.

Given the low number of modified questions (see Table 2), the dataset is undoubtedly not suited to train neural approaches, but there is use as an adversarial dataset for testing the generalizability of neural approaches.
Another highlight is the focus of the extension on KG-agnostic query modifiers, which work on any KG and will drive the KG-agnostic QA systems field forward. A downside of the extension is that it enforces the skewed distribution we see in Table 2 towards popular (?) modifiers. That could be fixed by adding more template-based questions to MQALD and paraphrasing them afterward.
Also, previous work [1] (not mentioned in the paper at hand) highlighted the difficulty of aggregation functions only broadly while this dataset goes deeper!
It is also positive that the authors provide their evaluation script online.

# Minor issues
P2:l37 - put the footnotes at the end, not in the middle of a URI
P2:l50 - Citation of [12] is not necessary as it describes the entity linking evaluation, not the QA evaluation. Please remove.
P5 - JSON formatting could be improved by doing it over two columns and closing and opening brackets on the same line. Doing it the way it is done now makes it hard to read and does not add to the paper.
P5:l42 - add a ~ between \ref{...}~Table to avoid breaking to a new line.
P8: Add to the paper the DBpedia version to which your extension was created. That is, which dump can execute all MQALD questions?
P9: The second issue with GERBIL, uploading answer and dataset file should be solved, see https://github.com/dice-group/gerbil/issues/344

[1] Saleem, M., Dastjerdi, S. N., Usbeck, R., & Ngomo, A. C. N. (2017, October). Question Answering Over Linked Data: What is Difficult to Answer? What Affects the F scores?. In BLINK/NLIWoD3@ ISWC.
[2] Chakraborty, Nilesh, et al. "Introduction to neural network-based approaches for question answering over knowledge graphs." arXiv preprint arXiv:1907.09361 (2019).

Review #3
By Rony Hasan submitted on 02/Dec/2020
Suggestion:
Minor Revision
Review Comment:

In this work, the author introduced a dataset MQALD, which is developed by adopting data from previous QALD where questions contain modifiers. The authors described different modifiers used in the data set with statistics and examples. However, the annotation process is not clearly explained.

In the abstract section, the following claim is not completely true: "The attention over this last aspect is growing but has never been thoroughly addressed by the existing literature." Modifiers are already discussed in Dubey et al.,2016 (AskNow: A Framework for Natural Language Query Formalization in SPARQL, ESWC)?
In the introduction, put some references to support the claim ("since it allows the creation of a Natural Language Interface (NLI)"). How it allows the creation of NLI? In section 3 (MQALD), a proper indentation in Listing 2 would make the data more readable. In 3.2.8, Listing 8 is broken. It would be better to reposition the listing somewhere where it doesn't break.
In section 3.3, details about the annotators and annotation tools used are provided. How many annotators were used? And how many annotators per language? What was the inter-annotator agreement?
Finally, the overall writing quality of the paper could be improved. Consider breaking some of the long sentences into smaller simple sentences. For example, the following sentence in the abstract could be written in multiple sentences: "Starting from the latest advances in this field, we want to make a further step towards this direction by giving a comprehensive description of this topic, the main issues revolving around it and, most importantly, by making publicly available a dataset designed to evaluate the performance of a QA system in translating such articulated questions into a specific data query language.
Typos:
———
In abstract:
state of the art -> state-of-the-art
In introduction:
- Quotation started with a curly one but ended with a straight one in "Who directed Philadelphia?". This issue happened in many places throughout the paper
- Linstings 1-> Listing 1
- composed by -> composed of. (Several times in different sections)
- e.g. -> e.g.,
In related work:
- translated in -> translated into
- i.e. -> i.e.,
In 3:
- JSON Object -> JSON object
- relevant with -> relevant to

Repetition of similar issues (missing articles, wrong preposition, missing punctuation marks throughout the whole paper.

NB: Please check the page limit for the dataset description paper. The submitted script is 12 pages long (13 pages including the reference section). Please follow the submission guideline and make changes in the paper so that it fits inside the required page limit.