Review Comment:
The authors introduce the dataset MQALD, which is made of SPARQL queries with modifiers, their results and verbalisations. As most QA systems struggle with SPARQL queries that contain modifiers, this is a relevant dataset for evaluating the expressiveness of QA systems. Part of the dataset comes from the existing QALD challenges, plus 100 newly created queries. The dataset itself is freely available on zenodo and has well-structured and well-verbalised queries in four languages. The evaluation of QA systems on these datasets is very detailed and interesting. However, I see several drawbacks with respect to the dataset creation, quality, and usefulness. This review follows four steps: a review of the paper itself, a review of the dataset and code, a check of the SWJ dataset paper evaluation criteria and some minor comments.
== Paper ==
- Related Work/QA Systems analysis: The main motivation of the paper is to create a dataset of queries that QA systems are known not to handle well. While the authors state that "only few works" consider the construction of queries containing modifiers, they still evaluate their dataset on three systems that are not contained in the respective citations (i.e., [5,6,7]). If such systems do not consider modifiers, bad results do not come with a surprise. I still think that the evaluation in Table 5 and Table 6 is interesting, as they show how close systems can still come to answer these type of questions. However, I would except an analysis beforehand about which systems consider what types of modifiers. Also, in contrast to what the authors state (page 9, line 41ff in the right column), I can also retrieve the SPARQL queries generated by TeBaQA and gAnswer (at least using their websites).
- MQALD vs MQALD_ext: The biggest contribution of the authors is the manual creation of the MQALD_ext dataset. However, Section 3 starts with a strong focus on a sub set of the QALD datasets, which first gives the impression that there has not been done more than that. I suggest to clarify that early (e.g., in the intro of Section 3). I would also suggest to rename the MQALD_ext dataset, as this is not just an extension of (M)QALD but the main contribution. The text describing QALD queries mentions several "errors" in this data (e.g., "OFFSET 0"). For the dataset described in the paper (i.e., MQALD), I would suggest to have a well integrated version containing a cleaned version of the QALD queries with modifiers (e.g., without things like "OFFSET 0") plus the newly created queries (MQALD_ext) in one file (and, maybe a training and test split).
- Creation of questions: Although highly relevant for a dataset paper, there is no detailed description of how MQALD_ext was created. Who has created them? How were the questions selected? SPARQL query or verbalisation first? Were they motivated by the QALD questions (see, e.g., the similar questions abot countries with the Euro as currency)? I also do not understand the statement about languages on page 7, right column, line 39f (does this imply that for French and Spanish, machine translation was used?).
- GERBIL: On page 9, line 12ff (left column), the authors mention two error messages coming from GERBIL. Unsolved errors are nothing I would expect from a paper, specifically, as they sound like the error is on the side of MQALD. I suggest to the authors to either fix that errors (maybe with the help of the GERBIL team?) or to change that passage in the paper. In general, an inclusion of MQALD into GERBIL would be a great way to demonstrate visibility and usefulness of the dataset.
- Related Work: There are more QA datasets that should be mentioned. Most importantly, the second version of LC-QuAD, but also others such as TempQuestions, ComplexWebQuestions, Event-QA, and more.
- KG independent queries: I am not sure if I agree that the query in Listing 8 is actually problematic (compared, e.g., to the really problematic examples at the beginning of page 7): in any case, a QA system needs to identify a property. Why not multiple ones?
- OWA and versions: Several of the presented questions assume a complete (e.g., "Which is the fifth most populous country in the world?") or up-to-date knowledge graph (Listing 9). It is totally fine to keep these questions, but I would expect a brief discussion of these aspects (open-world assumption and the knowledge graph and the necessity to fix knowledge graph version and question date when using a QA dataset). Actually, it is not quite clear which DBpedia version is used (the 2016-10 version is mentioned on page 12, but does that also refer to MQALD?).
== Dataset & Code ==
- Variety: MQALD_ext lacks topical variety of questions. For example, there are several questions about Elvis and Harry Potter, about books, cities, countries, mountains, lakes and artists, but no questions about sports, science or conflicts, and other topics which are prominent in DBpedia.
In general, the dataset is well-formatted and the English query verbalisations look good. I still have comments on some questions (based on a random selected subset of questions in MQALD_ext):
- Question 161: Change to "OFFSET 2" to retrieve the third-highest mountain.
- Question 162: This is a question that particularly suffers from the OWA and also suffers from a missing definition of what you would still consider as mountain. So - although syntactically all fine - I would prefer to replace this question (this is debatable, though).
- Question 169: "creator" or "painter" instead of "author" in the verbalisation
- Question 239: Results should be given as two columns, not two rows. Otherwise you can't tell what is the population and what is the population density.
- Question 244: "?child dbo:child dbr:Elvis_Presley" needs to be changed to "dbr:Elvis_Presley dbo:child ?child"
The code on the GitHub page has no proper documentation in the readme file.
== Evaluation Criteria (see http://www.semantic-web-journal.net/reviewers)
- "such a paper shall ... give information ... on the following aspects of the dataset: name, URL, version date and number, licensing, availability, etc.; ... method of creation": There is no information about date, number and licensing given in the paper (concerning the zenodo release, I am surprised to see the MIT license used for QALD, which is typically used for software). There is no proper description of the method of creation (see my comment above).
(1) Quality and stability of the dataset - evidence must be provided: No evidence given, and my look at example queries in the datasets confirm errors in the data (see my comments above).
(2) Usefulness of the dataset, which should be shown by corresponding third-party uses - evidence must be provided: There is no evidence given of how the dataset is used yet (only the authors themselves have used it to benchmark three QA systems). One suggestion would be a proper integration with GERBIL.
(3) Clarity and completeness of the descriptions: As stated above, the actual MQALD_ext dataset description is hidden a bit. For example, statistics in Table 1 do not include this dataset. A common description of the contributions would clarify that.
- "Papers should usually be written by people involved in the generation or maintenance of the dataset, or with the consent of these people". A large part of the dataset and the paper is about QALD so I was indeed expecting a statement about this.
== Minor ==
(page, line number, left/right column)
- P2, 8l: "DBpedia [3] and Wikidata"
- P2, 38l: Remove space after "dbr:"
- P2, 42l: "When was the President of the USA born?"
- P3, 12l: use multicite ([5,6,7])
- P3, 17l: "MQALD[, or :] a"
- P3, 5r: It can be a list of values as well?
- P3, 20r: "SimpleQuestions" instead of "SimpleQuestion"
- P4, 30l,ff: Change order: "questions extraction from ... that require ..."
- P6, 13l,ff: Use proper quotation marks
- P6: 48l: "SPARQL [q]uery forms"
- P8, 31r: Use proper capitalisation for the WDAqua dataset
- P10, Table 4: There are more than six datasets in the table.
- P3, 2l: DESCRIBE is never used, so I would probably skip it here. Actually, I am not even sure if I would consider "ASK" as a query modifier or as a particulary complex task, but it's okay.
- Please take some time to beautify the SPARQL and JSON listings. Avoid column or page breaks within listings (see Listing 2 and 8), compactify them (see end of Listing 2), capitalise all reserved words (also "year", "now", "regex", ...), maybe even set some code highlighting.
- I do not like the look of "MQALD_ext". I suggest to use a proper subscript.
|