QA3: a Natural Language Approach to Statistical Question Answering

Tracking #: 1489-2701

Maurizio Atzori
Giuseppe Mazzeo
Carlo Zaniolo

Responsible editor: 
Guest Editors ENLI4SW 2016

Submission type: 
Full Paper
In this paper we present QA3, a question answering (QA) system over RDF cubes. The system first tags chunks of text with elements of the knowledge base, and then leverages the well-defined structure of data cubes to create the SPARQL query from the tags. For each class of questions with the same structure a SPARQL template is defined, to be filled in with SPARQL fragments obtained by the interpretation of the question. The correct template is chosen by using an original set of regex-like patterns, based on both syntactical and semantic features of the tokens extracted from the question. Preliminary results obtained using a very limited set of templates are encouraging and suggest a number of improvements. QA3 can currently provide a correct answer to 27 of the 50 questions of the test set of the task 3 of QALD-6 challenge, remarkably improving the state of the art in natural language question answering over data cubes.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By John Bateman submitted on 03/Dec/2016
Minor Revision
Review Comment:

The paper presents a fairly clean method of setting up queries for a
particular class of questions that might be asked of RDF-cubes. The
principal problems dealt with are aggregate queries of statistical
information over sets of data and the automatic construction of formal
queries on the basis of unconstrained natural language. The method
operates by classifying the tokens of the input question according to
several kinds of linguistic information and keywords and taking this
as a key into specific query templates that may then be
combined. The query templates are accessed via regular expressions
over the kinds of information extracted for the natural language
queries. Although relatively simple, the method appears to perform
well in relation to some standard testset evaluations. It may then
serve well as a basis for further work and discussion. The paper is
well written and straightforward to understand; there are, however,
quite a few very minor language problems to be fixed up: these are
primarily missing or extra determiners (a/the). Before the paper
is published, therefore, it would be necessary to have it read over
carefully in order to correct these mistakes. There are a number of
small phrasing problems that should be improved as well. The following
suggests some of the kinds of problems, but is only an indication:
there are many more.

create the SPARQL query from the tags
--> a SPARQL query

publish information about the public expenses
--> about public expenses

for finding list of entities with specific properties
--> lists

must be as much accurate as possible
-->must be as accurate as possible

in an aggregation query missing a constraint
--> an aggregation query missing a constraint

which allows to define a set of
--> which allows the definition of a set of


Review #2
By Oscar Corcho submitted on 06/Jan/2017
Major Revision
Review Comment:

QA3: a Natural Language Approach to Statistical Question Answering

This paper describes an approach, implemented in an online system, that supports NL-based question answering over statistical data cubes that have been made available in RDF following the RDF DataCube W3C recommendation. This work is quite novel, since there is only one other specialized system in the state of the art for this type of task (albeit general-purpose question answering systems that can obviously be applied to this type of data).

Before moving into a detailed section-by-section review, these are the main highlights of this review:
- The approach seems in general sensible, especially in what respects to the definition of the query templates that are used as the core basis for the NL question understanding. It is interesting to see how a small number of query templates allow answering to a good number of the usual queries according to the testbed used for evaluation.
- However, the initial step of tagging questions with elements of the KB, which is used to select the data cubes to evaluate the query, as well as for the following phases, is very vaguely described, what limits the understandability of the approach, and would obviously limit the reproducibility of the experiments. This is the most relevant reason for the proposed recommendation for this paper.
- Only a small number of templates are made available in the paper. The whole set of patterns should be made available online (probably using figshare or zenodo, or Github or alike) so that others may reuse them and replicate the experiments. Also for the evaluation of the paper.

That said, the work presented here looks very promising as an automated way to evaluate NL-based queries over statistical data. The results, as presented in the evaluation, are not as good as those obtained with systems like SPARKLIS (see figure 6), but obviously they are different types of systems with different purposes.

Now a more detailed review, section by section:
- In section 1, some parts of the text are written in a very non-scientific manner. A clear example is the paragraph on natural language interfaces, where sentences like “the current state-of-the-art system is Xser” are given, without much more explanation of how it works. Or where sentences like “This witnesses the fact that…”.
- Indeed, in section 1 the state-of-the-art part is very poor in general and lacks details of how those systems are created. I would actually recommend authors to extend and provide a deeper state of the art of NL-based question answering in a different section of related work or state of the art.
- When referring to those systems, you comment that none of them are specialized for data cubes, but then in the evaluation you present CubeQA. Is that correct?
- Section 2. Data Cube is a recommendation (not strictly a standard following the terminology used by W3C).
- I don’t like figure 1, since it is in fact incorrect according to the RDF Data Cube model. Measure Properties are not necessarily a number, indeed, according to the specification, if I am not wrong. You also comment that AttributeProperties associate observation with literal values (e.g., the year), which is not necessarily correct, as they could be also URIs, AFAIK.
- “the RDF Data Cube dictionary” —> To be precise, it should be simply “RDF Data Cube”.
- Section 3.1. How do you calculate the weight of chunks?
- Section 3.2. You mention a dictionary. Where is it made available?
- System: I have checked it, and I see that some of the predefined questions work, but there are also other questions for which a template is not found and I cannot find it easy to understand why.
- Section 4. In general, it lacks the “why these results”. I would really like to see a further discussion on why the results are as they are and how thy may be improved.
- Section 5 is repeating parts of the introduction, and in fact it is still very vague in general when describing existing systems.
- In general, I do not understand well what the discussion section brings in for those willing to implement similar systems or understand the approach that has been presented.

Review #3
Anonymous submitted on 18/Feb/2017
Major Revision
Review Comment:

The paper describes an approach for question answering over RDF knowledge bases using SPARQL templates. For each question type supported, a template is defined whose variables are filled in with information retrieved from the question using both syntactic and semantic features.

Although the motivation is well-described, the emphasis in this section is mainly placed on comparing the proposed framework to other relevant ones, which should be part of the related work section. I would suggest that the authors revise this section, describing the challenges and motivation, giving also few details on what RDF cubes is (why are RDF cubes important? What problems do they solve? etc.), what is the approach followed in this paper and present more clearly the contribution of the paper. A short comparison to the state of the art (shortcomings, different approaches, etc), could be included in this section, but there is no need to be so detailed. More details can be provided in the related work section.

Section 3.1
More details are needed here. For example: “which is based on the weighted percentage of Q that is covered by tagged chunks”, “Chunks have weight based on the apriori probability of being tagged and their length”. The authors should formalize these in order to be better understood. Moreover, “the results of the matching…”. What is the logic behind matching, how is it performed? Is semantics taken into account?

Section 3.2
As the authors also explain, the system can answer only a specific type of questions with certain structural dynamics. It would be very useful to somehow summarize the structural question requirements of the framework, e.g. in a table, and also examples of questions that are not supported. For example, each pattern in Fig.3 can be enriched with some example textual questions, and also with examples that are not supported.

Overall, the framework seems to impose some quite hard restrictions on the types of questions that it supports (e.g. tokens must have a specific order), but also on the way information is captured in the KB (e.g. each dataset should have a default measure), hampering first of all flexibility but also makes difficult the use of the framework on top of existing KBs that do not follow the RDF cubes model. The authors should further elaborate on these and describe the way that this inflexibility can be overcome.

Also, I believe that section 3 should be revised in order to better formalize the algorithms and metrics used. Although a verbal description of the methodology sometimes is easier to be understood, important details are suppressed, which makes it difficult to obtain a clear view on the proposed approach.

An example is also missing that would further help the reader understand the approach.

Section 4
As mentioned earlier, the framework requires a preprocessing phase in order to transform a KB into cube. Although the authors do not explicitly mention this, it seems that the graphical tool that has been developed for evaluating the framework runs over a preprocessed snapshot of QALD-6. If this is the case, more details are needed regarding the effort needed to transform a dataset into cubes and if there are restrictions. For example, any dataset can be transformed into cube? Is there any generic tool that can be used to achieve this? Do we need to develop from scratch a separate tool for each dataset?

Also, the results of the experiments seem to test the ability of the system to find the correct dataset but not the correct answer. If this is the case, it seems that the experimental evaluation may be relevant to QA systems that also follow the RDF cubes model. So, here I have two questions:
1. Is it possible to compare the system to other general purpose QA systems that do not follow the RDF cube model? (for example the systems that are part of the QALD challenges)
2. Why wasn’t possible to measure precision and recall on the actual results returned by the SPARQL queries, but they are computed at the level of datasets?

Also, the term: “statistical question answering” used in the title seems a little bit misleading. From what I have understood, there is no any statistical reasoning involved in the framework, at least not a “heavy” one, that justifies the classification of the approach as statistical one. A title containing “RDF cubes” or so seems to me more meaningful.