IQA: Interactive SPARQL Query Construction in Semantic Question Answering Systems

Tracking #: 1953-3166

Hamid Zafar
Mohnish Dubey
Jens Lehmann
Elena Demidova

Responsible editor: 
Guest Editors Knowledge Graphs 2018

Submission type: 
Full Paper
Semantic Question Answering (QA) systems automatically interpret users' questions expressed in a natural language in terms of formal semantic queries. This process involves uncertainty, such that the resulting semantic queries do not always accurately match the users' intent, especially for more complex and less common questions. In this article, we aim to empower users in guiding QA systems towards the intended semantic queries by means of interaction. We introduce a probabilistic framework that enables seamlessly incorporating user interaction and feedback directly in the question answering process. We propose and evaluate a model for the interaction process. Our evaluation shows that using this framework, even a small number of user interactions can lead to significant improvements in the performance of QA systems.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 15/Oct/2018
Minor Revision
Review Comment:

The paper proposes a probabilistic framework for incorporating user interaction into the Question Answering process. It proposes and evaluates a specific model compliant with this framework, demonstrating some encouraging outcomes.

The paper presents some nice ideas, but falls down on presentation, particularly in the first half. It is missing adequate motivation and intuition for many of the concepts introduced and assumptions made. It is thus rather matter-of-fact in its presentation and does not easily convince the user on the fundamental significance of what is being described.

In Section 2, it is unclear where the SPARQL query (mentioned at the top of the second column of page 3) comes from. It is stated that "such a formal query does not have to include interpretations of all the information nuggets" but it is not made clear why not. A bit later on it is stated that the SPARQL query "interprets the user question as a whole", which seems to contradict the earlier statement.

What is the meaning of a set of candidates "jointly interpreting" the elements of q_STR?

At the bottom of the second column of page 3, there is mention of "edge relations" and "The edge relation", so it is not clear if there is one or several of these.

On page 4, towards the end of Section 2, there is again a mixing in of SPARQL as one of the ways of representing a query interpretation. It would be clearer to give two separate formalisations, with and without the SPARQL aspects. The description also says "can be represented in two ways" which may imply that it could be one or the other. Whereas, from my understanding, it is actually both. So that phrase should rather be "is represented in two ways".

At the beginning of Section 3, what is the meaning of a "principled" user interaction scheme?

At the end of 3.1, there is mention of CQI_FQ whereas in equation (1) we have CQI_PL.

At the bottom of page 4, column 2, what is the intuition behind the four categories C1-C4? Why these, no more, no less?

Likewise, at the top of page 5, column 1, some intuition and explanation is required for the conditions 1)-4).

Later on in that column, we are again missing an intuition/justification of the complexity options C1-C4.

At the end of Section 5.1, it is unclear how the number of user evaluation questions - 400 - is arrived at. There are four complexity categories and we are told that 100 questions are taken from each apart from category 5, but this doesn't add up.

At the bottom of page 9, it is stated that "Each user evaluated ... questions on average" - the number is not given!

In Section 6.1, the third pararaph is unclear in its discussion of fixed templates including the rdf:type constraint. More details are needed.

In the discussion of results towards the end of page 11, insufficient explanation is given of why these results might be arising

The wording at the beginning of Section 6.2 needs some more precision: "such as IQA-OG and IQA-IG", "lays by four interactions"

Review #2
By Mariano Rico submitted on 22/Oct/2018
Review Comment:

This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

The quality of writing is good. Only a few typos have been found ans are pointed out at the end of this review.
Concerning originality this paper provides an interactive system to create SPARQL queries from NL sentences. Under my perspective, the formalization of the pipeline is interesting although it can be tough due to the number of symbols and subtle details of the mathematics. However, the interactive approach is not new, and several related works are shown in related work section. I miss a relatively recent system named Sparklis[1]. A comparison with this system should be provided.
It is the significance of the results, in my opinion, the weak part of this paper. The main claim of this paper, as stated in the abstract, is that "a small number of user interactions can lead to significant improvements in the performance of QA systems". However, figure 6 shows that interaction is useful for queries with low complexity (2 in the 2-5 rank). Authors claim in section 6.1 that there is a probe of this claim, but it only consider the approaches IQA-OG and IQA-IG used over the system proposed.
I also consider important to be able to reproduce the results shown in the paper. To this end I recommend to provide a valid URL to test the tool. Otherwise it requires a leap of faith.
See more details below.

[1] Ferré, Sébastien. "Sparklis: an expressive query builder for SPARQL endpoints with guidance in natural language." Semantic Web 8.3 (2017): 405-418.

- colors in figures. Printed in b&w is impossible to distinguish bars (fig 4 and 5) or lines (fig 6). Please, fill the bars with patterns instead of solid colors. Figure 6 lines should have bullets with different shapes.
- Error bars. Are fine in figure 5 but I miss them in figure 4, 7 and 8.
- In the introduction it is presented the balance between number of interactions and usability. However, the usability is not measured in the experiments with users.

Section 5.3.2. Each user evaluated ... questions on average. Please, specify.

Review #3
Anonymous submitted on 10/Nov/2018
Major Revision
Review Comment:

This paper aims to provide advances in the area of semantic question answering systems. In particular, it focuses on supporting the translation of the users’ natural language queries into formal (SPARQL) queries by proposing an interaction based approach. The goal is to propose an interaction mechanism that (1) balances uncertainty reduction in the QA pipeline and user interaction costs; and (2) maintains system usability. To that end, the paper proposes and evaluates a user interaction scheme based on the notion of “Option Gain”.

The topic of the paper in interesting and timely, and the authors present solid work that is promising for a journal publication. At the same time, the paper has the following shortcoming which need to be addressed before publishing the paper, as follows.

Comments related to problem setting:

The paper focuses on “complex” queries – yet this term is never really defined. What constitutes a complex query? Is it a long query which mentions many entities and relations (this corresponds to the definition of the complexity categories of LC-QuAD discussed in Section 5.1)? Or does the complexity rather come from the difficulty to match the query terms to the underlying knowledge graph structure (this would be the case of the example query given in Section 1)?

Comments related to the proposed solution:

The proposed solution this needs to be better explained. I found the paper hard to follow – a lot of theoretical aspects of the proposed solution are presented upfront without an intuitive example that would allow an easier understanding of these rather abstract parts (an intuitive example appears only on page 7, this would ideally be presented much earlier). Furthermore, many details remain unclear: for example, the heuristics for defining usability(IO) remain completely unmotivated (what is the intuition behind those?) and weakly defined (e.g., C1 – does similarity refer to *string* similarity, and how is it computed?; C2 – does path refer to “shortest path”?). I did not understand section 3.3 – the proposed model is neither justified nor explained in sufficient detail. It is also questionable whether the proposed solution is generic, as it is defined for a certain type of QA pipeline.

The concrete realization of the proposed approach (which is also used as a basis of the experimental evaluation), appears to deviate from the approach itself. In particular, the user interface in Section 4.7, does not allow a user to enter a question as is stipulated by the interaction process step 1 in Section 3.4. (btw, the interface in Figure 1 also presents a typo: “refers to” should be “refer to”). This deviation raises questions about whether the evaluation really evaluates the proposed approach.

Comments related to the evaluation:

In general, the evaluation section suffers from a lack of explanation of how the evaluation goals influence the evaluation setup. The authors make a number of choices in terms of baseline systems to compare against and configurations to be tested, but the rationale behind these choices with respect to supporting concrete evaluation goals remains implicit and not accessible to the reader.

In fact, the evaluation section would benefit from an introductory part that clearly explains the evaluation goals and the two evaluation settings (i.e., oracle- and user-based) used. Currently, sections 5.1 and 5.2 describe details of these settings, *before* these settings are actually introduced in section 5.3 (to this reviewer this was rather confusing).

Section 5.1: The sentence “In order to be able to compare with external baseline …” does not make much sense within this paper – or should be better explained.

The choice of the IQA configurations tested should be better justified. Currently, the authors test a configuration that relies solely on the information gain and a configuration that also considers usability. But why is this important? What is the hypothesis that the authors wish to test? And how does this configuration selection support the overall evaluation goal announced at the beginning of Section 5? This should be explained.

A similar comment holds also for the choice of baselines: it is nice that authors choose three different baselines, yet, they should better explain in what ways these baselines are relevant for their evaluation goals.

Details of the user feedback questionnaire are missing: what questions did users have to answer? How were those questions chosen and why were they relevant to the evaluation goal?

Section 5.3.2 fails to mention the number of questions evaluated by users (currently shown as “…”).

The interpretation of the results could be improved as follows: consider using Boxplots in figures 7 and 8 to more clearly show average values (in particular, for figure 8 the text refers to average values which are not shown on the graphic); better align the reasons of deviation between user judgement and benchmark listed as bulleted list (left column, page 13) and Table 2 (in its current form, it is difficult to identify which reason maps to which row in Table 2); the user feedback was rather inconclusive: in what way is the obtained feedback useful for the evaluation goal?

Comments related to the novelty of the work and positioning with respect to related work:

The positioning of the work in the landscape of related work is insufficient, with section 7 failing to clearly identify the novelty of this work. Currently, this section presents a number of interactive QA approaches both on top of RDBMS and knowledge graphs. However, no discussion closes this section that would identify gaps in these works and how this paper addresses those gaps. Although Table 3 presents an overview of these works (1) it is never referenced, explained and discussed in the text (e.g., its dimensions are not explained, why are they relevant?); (2) it lists works that are never discussed in text, namely those referenced as [19, 22, 23] – unfortunately, these are approaches applied on knowledge graphs and therefore very relevant; (3) works discussed in the text do not appear in the table, i.e., reference [31]. A further drawback is the lack of mentioning any conversational approaches, such as chatbots, which use an interactive approach to question answering, such as CuriousCat [1], for example. This section needs to be thoroughly extended and revised.

The conclusion section resembles rather a summary. I miss a critical discussion of whether the objectives of the work were achieved and a clear link between the main evaluation results and the goals of the work. Also, limitations and threats to validity should be discussed (for example, the fact that users cannot really ask their own questions).

To sum up, while this paper is certainly promising, there is major space for improvement. In particular, I encourage the authors to:
* clearly state what the novelty of their work is. For example, the introduction of the “Option Gain” notion appears to be a novel contribution at the solution level and could be made a central point in the paper narrative.
* improve the presentation of the solution by using an intuitive example in tandem with the theoretical part of the solution;
* settle on a set of hypotheses that they can use to derive meaningful and convincing evaluation goals and then use these goals as a motivation for the chosen evaluation setups as well as to explain the observed results.
* overall, the paper misses important details, which should be provided.

[1] Luka Bradesko, Michael J. Witbrock, Janez Starc, Zala Herga, Marko Grobelnik, Dunja Mladenic: Curious Cat-Mobile, Context-Aware Conversational Crowdsourcing Knowledge Acquisition. ACM Trans. Inf. Syst. 35(4): 33:1-33:46 (2017)