Explainable multi-hop dense question answering using knowledge bases and text

Tracking #: 2902-4116

Mohsen Kahani
Somayeh Asadifar
Saeedeh Shekarpour

Responsible editor: 
Guest Editors Ontologies in XAI

Submission type: 
Full Paper
Much research has been conducted extracting a response from either text sources or a knowledge base (KB). The challenge becomes more complicated when the goal is to answer a question with the help of both text and KB. In these hybrid systems, we address the following challenges: i) excessive growth of search space, ii) extraction of the answer from both KB and text, iii) extracting the path to reach to the answer, and vi) the scalability in terms of the volume of documents explored. A heterogeneous graph is utilized to tackle the first challenge guided by question decomposition. The second challenge is met with the usage of the idea behind an existing text-based method, and its customization for graph development. Based on this method for multi-hop questions, an approach is proposed for the extraction of answer explanation to address the third challenge. Since the basic method uses a dense vector for scalability, the final challenge is also addressed in the proposed hybrid method. Evaluation reveals that the proposed method has the ability to extract answers in an acceptable time and volume, while offering competitive accuracy and has created a trade-off between performance and accuracy in comparison with the base methods.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Carlos Badenes-Olmedo submitted on 10/Nov/2021
Major Revision
Review Comment:

# Review

The article presents a hybrid question-answer system that works jointly on a knowledge base and on a set of sentences. The strategy it uses decomposes the original question, in case it is multi-hop, into single questions (i.e. single-hop) and chains the partial answers until the final answer is reached. It considers that the set of partial answers are explanatory of the final answer and its main contribution is the ability to extract answers to complex questions using a knowledge base and text in the form of sentences.

# Relevance

It is a work that deals with a very relevant topic but does not provide much novelty over existing work beyond its approach to combine structured and unstructured knowledge. In that line much more detail is expected on how facts are created from the passages (i.e. sentences) that are related to the question to extend the graph on which the answer is extracted. What does k mean?(, how is the probability of Ve measured?(, what do the edges between Ve and Vd or Vf mean? I suggest to deeply review section 4.1.2 to describe in more detail how that process is performed. Perhaps a guided example on a simple query can help to clarify this process. The work could otherwise look like an extension of the PullNet system with the decomposition of complex (i.e. multi-hop) queries into simple (i.e. single-hop) queries and lose much of its importance.

# Technical Description

The study needs more rigor in its statements, either by giving references (e.g. "which is necessary in the real world" (Section 3.3), "using the embedding vector comparison method" (Section 5.1)), or by using more precise terms (e.g. "The analysis method is performed in such a way that, with little supervision..." (Section 4.1.1), "...since the database is not available at this time,..." (Section 4.1.1), "...since the database is not available at this time,..."). (Section 4.1.1), "...since the database is not available at this time,.." (Section 5.2)), or with a more detailed explanation ("a text corpus that uses entity link tools...") (Section 5.2). (Section 5.2) (what tools?, How is entity linking performed?).

The method is gratefully presented (Section 4) by means of an example. At a general level, it allows the reader to get an idea of the stages into which the process is decomposed. It would be advisable to increase the resolution of Figure 1.

I suggest avoiding the term 'document' (Section 4) as it may induce to think of paragraphs, when in fact the text considered in this work are sentences.

# Reproducibility
This is a critical point. The work presented is not reproducible, which in my opinion is fundamental. The source code is not provided, nor is any content linked that would allow reproducing the results obtained in the evaluation, so it cannot be evaluated as part of the state-of-the-art.

# Evaluation

One of the aspects highlighted in the article is the ability of the system to extract responses "in acceptable time and volume". However, the evaluation does not support such a claim since it measures performance only in terms of accuracy (Tables 1, 3, 4 and 5). In this line it is not made clear how it is checked that the answer obtained is the correct one, i.e., what distance (based on Strings) has been used to calculate Hits@1?

Table 2 (Section 5.2) is mentioned, but this table does not appear in the document.

Considering its hybrid capability, perhaps it would be interesting to perform the evaluation on datasets with verbalized answers that have been recently published (e.g. VQuaNDA [1], VANiLLA [2]).

[1] - Kacupaj, Endri, Hamid Zafar, Jens Lehmann and M. Maleshkova. “VQuAnDa: Verbalization QUestion ANswering DAtaset.” The Semantic Web 12123 (2020)
[2] - Biswas, Debanjali, Mohnish Dubey, Md. Rashad Al Hasan Rony and Jens Lehmann. “VANiLLa : Verbalized Answers in Natural Language at Large Scale.” ArXiv abs/2105.11407 (2021)

# Recommendation
The proposed method, although promising, cannot be considered valid until three key aspects mentioned above are corrected: it should be reproducible, it should describe in sufficient detail the procedure for incorporating unstructured data into the knowledge graph on which it extracts the answer, and it should limit the contributions to the evaluation results. In my opinion the article cannot be accepted in its present state.

Review #2
Anonymous submitted on 11/Jan/2022
Review Comment:

Although this paper investigates an important problem of automatic question answering, I'm not convinced by the solution proposed by the authors.
I have a few issues with the paper, which I list below:
1) with the title and the keywords and contribution list the authors promise "Explainable multi-hop dense question answering", but the explanation part is not demonstrated.
2) In section 4, the authors state 3 fundamental axes, on which they base their solution. Those axes are not motivated nor explained where do they come from.
3) The evaluation in Section 5 is only partial. As the authors say "We need the
database to contain multi-hop questions, some of whose answers are in text snippets and some in KB entities for evaluations. However, since the database is not available at this time, the proposed system, GraphMDR, is compared separately with two base systems, and experiments on such datasets are left to future work". This means that the proposed system is not fully tested, and hence the integration of the text and knowledge-base is not tested. This also rises question, whether such systems are required in practical applications.
4) I'm not able to assess the significance of the results due to the quality of writing.
-In the introduction I'm missing clear problem definition and justification why this problem is important. I think I can guess what the authors want to say, but I would prefer to read this.
- Some abbreviations are not explained, some sentences are hard to understand, especially because of plenty of referral expressions that can refer to different subjects.
- The methodology of performing this research is not described
- Figure 1 is more information flow than architecture of the system
- In section 4.1.3. authors calculate a probability of finding the sequence. Are there any assumptions about this probability, i.e. independence? Can it be justified?
- In section 5 (especially 5.1) the authors ask the reader to hop between the text and appendices.
- Which statistics are provided in Table 1? what is dev?
- The formatting style is not kept through the paper.

"Long-term stable URL for resources” was not provided, hence was not evaluated.

Review #3
By Ivan Donadello submitted on 14/Jan/2022
Major Revision
Review Comment:

The paper proposes a question-answering system for multi-hop questions by leveraging both free text and a knowledge base as sources of information.

===== Originality =====

I am not an expert of the field and from my comprehension of the paper, it is difficult to assess the originality of this work. The work proposal is to extend the sota works PullNet and MDR but it is not clear whether the implemented extension regards just technicalities or strong scientific principles that can be used by other researchers. These differences are explained in Section 4.1.2, but for non-expert readers it is difficult to discern the technicalities from the scientific principles. Neither the RQs are clear. It seems to me that question decomposition (RQ1) has already been studied, as well as the simultaneous use of both text and knowledge bases (RQ4), see the hybrid methods in the related work. The same holds for RQ3: the extraction of the intermediate answers has already been studied by multi-hop systems. Moreover, with this formulation, RQ3 is not a research question. In addition, Section 4 states that “The present study provides a solution by integrating the three considered axes to balance the accuracy and the efficiency, while extracting the answer from both sources, considering the constraints stated in the question.” However, it seems to me that this is already done by PullNet and MDR.

The authors should better (and explicitly) state their scientific contributions with respect to the other works and especially with MDR and PullNet.

The considerations in the Discussion Section seem to me just findings already done by the sota:
- The findings in Section 6.1 are not so informative. The points ii) and iv) are just trivial considerations coming from the sota. Point iii) is just false, probabilistic methods enable the extraction of explanations as non-probabilistic methods do. It depends on how you develop the explanation system. This point is too vague.
- Section 6.2 does not add much. It seems a remark of the importance of explanation extraction, sub-question decomposition (that seems to me already done by other works) and finding the best sequence of intermediate answers (maybe is this the main contribution of the paper?).
- Some point of Section 6.3, the 1st and the 4th are not novel findings.

===== Significance of the results =====

The Experiments and Results Section is well written and easy to follow, and the numeric results seem to confirm the validity of the method over the main competitors. However, there are some passages that deserve a better discussion:

- Table 2: Why did you test COMPLEXWEBQUESTIONS on the dev set and not on the test set?
- Table 2: The PullNet results regard the KB + Text setting, however, PullNet in the only text setting has much better results. I understand that the authors want to use the KB to preserve the semantics of the answers, but they must show the better PullNet results in the text setting. This can be followed with a discussion on the importance of the semantics.
- Tables 2, 4, 5: are these results extracted from the test set?
- Fig 2 should include the results also for PullNet for a fair comparison.
- Section 5.2: “Fifty percent of the KB is utilized …”, why not the whole KB?
- Section 5.2: “GraphMDR is more accurate than PullNet”, this is true only if you use 3 hops in the MetaQA dataset. The authors should explicit this.

===== Quality of writing =====

The writing is really lacking clarity and the structure of the paper can be improved for a better comprehension for non-expert readers. In general, the paper is verbose with many technical details, it seems more a technical report than a scientific paper with strong principles.

- The word “Explainable” in the title is misleading as no evaluation has been carried out about the explainability. See [1] for an example of evaluation of explainability with real users. The system is able to give explanations but we do not know whetere these explanations are good or not.
- The introduction starts abruptly with no real introduction about the topic, it seems more a list of related works. On the other hand, the text from “In recent years … ” to “and unstructured sources.” in the related work is perfect for the introduction.
- After the introduction, a section with the background that explains with some formalism the topic and the two main works of the sota (MDR and PullNet) would help non-expert readers. This is partially done at the beginning of Section 4, from “As mentioned …” to point iii). I suggest moving this part in a Background Section.
- The Related Work Section is verbose, explaining the details of what every single work does and how becomes useless for the reader. It would be better to define some important features of QA works and state how much the current works fit or not such features.
- Section 4 is the core of the paper and unfortunately is the least comprehensible part. Figure 1 is not the architecture of the system but just a flow diagram merged with an example. I suggest drawing a real architecture showing all the used computational blocks (even if they are from other works) and their input/output in order to include every single symbol in the equations. Section 4 is verbose, some pseudocode would make it clearer. The pseudocode should be a formal representation of the new figure. It is not clear what the three types of inference are, you cannot refer to another paper as the presented work should be self-contained.
- Section 4.1.4, you cannot refer to another paper for the use of the constraints. The presented paper should be self-contained.
- Section 5.2, Why is only 50% of KB used? Is this a common practice? Please specify why.

===== Other concerns =====

- Many grammatical errors, the paper needs a full linguistic revision.
- Section 3.3: in two main model -> in two main models
- Section 4: searched in sequence -> searched in a sequence
- Section 4 just before 4.1: triples, facts, documents, concepts: all plurals.
- Section 5.1: “there are two main categories”: of what??
- Section 7: goals are search -> Goals are search
- Section 7: using and answer extraction answers from -> using answer extraction from

[1] Donadello, I., Dragoni, M., & Eccher, C. (2019). Persuasive explanation of reasoning inferences on dietary data. In SEMEX: 1st Workshop on Semantic Explainability (Vol. 2465, pp. 46-61). CEUR-WS. org.