Answer Selection in Community Question Answering Exploiting Knowledge Graph and Context Information

Tracking #: 2675-3889

Golshan Afzali
Heshaam Faili1

Responsible editor: 
Guest Editors DeepL4KGs 2021

Submission type: 
Full Paper
With the increasing popularity of knowledge graphs (KGs), many applications such as sentiment analysis, trend prediction, and question answering use KG for matching the entities mentioned in the text to entities in the KG. Despite the usefulness of commonsense knowledge or factual background knowledge in the KGs, to the best of our knowledge, these KGs have been rarely used for answer selection in community question answering (CQA). In this paper, we propose a novel answer selection method in CQA by using the knowledge embedded in KG. Our method is a deep neural network based model that besides using KG, uses a latent-variable model for learning the representations of the question and answer, by jointly optimizing generative and discriminative objectives. Specifically, the proposed model leverages external background knowledge from KG to help identify entity mentions and their relations. It also uses the question category for producing a context-aware representation for each of the question and answer. Moreover, the model uses variational autoencoders (VAE) in a multi-task learning process with a classifier to produce a class-specific representation for each answer. The experimental results on three widely used datasets demonstrate that the proposed method significantly outperforms all existing models in this field.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Simon Gottschalk submitted on 21/Feb/2021
Major Revision
Review Comment:

The authors propose an approach for answer selection in community question answering based on a Siamese network that incorporates the question category as well as knowledge graphs. On the positive side, the authors provide a long list of related approaches, they compare to 11 other approaches and outperform them, and they provide an ablation study that shows that all components of their approach contribute to its performance. However, I see several drawbacks w.r.t. problem definition, the quality of writing, the experiments and the idea behind the approach:

- Problem statement: The task of answer selection in community question answering is never (formally) introduced. This could be done easily and would introduce part of the notation in Table 2. The output consisting of three classes is only mentioned late (Section 4.3).

- Intuition/Example: The selected example (Table 1) is not very intuitive, as it is hard to understand (What is the Qatar Living Lounge? How is it related to Nissan Vehicles? The question is not answered!). In general, given that categories are a core element of the approach, more examples of such categories would be helpful. Also, the second answer is too simple/wrong.

- Quality of writing: The paper has several typographical, grammatical and stylistic errors (see minor comments provided below for a few but not exhaustive examples). There are several long sentences (often including many semicolons) that are hard to read and hinder understanding (e.g., Section 2.2 or the third paragraph in Section 4.3).

- Discussion: I would like to see a more detailed discussion of the task and the approach, potentially based on examples for which the wrong answers were selected. The suggested approach does basically return the answer that is the most similar to the question. Therefore, I have several questions: (i) To what extent does question/answer similarity contribute towards the idea of finding good answers, assuming the goal is to find the best answer among different answers, all related to the question? (ii) Does the use of a Siamese network imply that P@1 (i.e., the correctness of the highest ranked result) would drop to zero if a copy of the question is added to the answers for each question in the test data? (iii) How does the approach generalise to other datasets?

- Related Work: This section is rather lengthy. I do not see much benefit from Section 3.1; readers of the Semantic Web Journal should be aware of application scenarios of KG. This does not require ten references (few examples plus a survey may be sufficient). For Section 3.2, a compact representation of existing approaches in a table may improve the overview.

- Results: Evaluation results of the compared approaches differ from what is stated in the papers: For example, the CETE paper reports MAP=0.947 for SemEval 2015 (in contrast to 78.63 as in Table 4). Also, for example, [*] provides a very similar table (Table 2 in [*]) with highly different numbers. How can this be explained? Is it due to the footnote in Section 5.4? If so, why did the authors decide on the three-class setting? Is the comparison to the baselines still fair, and does it prove the superiority of the suggested approach?

[*] Yang, Min, et al. "Knowledge-enhanced Hierarchical Attention for Community Question Answering with Multi-task and Adaptive Learning." IJCAI. 2019.

- The authors introduce the abbreviation "KGs" for "Knowledge Graphs", but then consistently use "KG", which is very confusing. I would prefer to see "KGs" in the text.
- Refer to figures in the text via "Fig." instead of "Figure".
- Fig. 1: Fig. 1 should come with higher resolution, or better, as a vector graphic (/PDF). Also, the caption should provide more explanation, at least about input and output. It could already help to mention Table 2 immediately when mentioning Fig. 1 (plus in its caption).
- Footnotes are reset to "1" at each page?
- Section 1: "the model can correctly assign ... due to the entities and facts exist in it"?
- Section 1: "to encod[e] all semantic information"
- Section 2: "sentence representation[s]"
- Section 2: "parameters, respectively, which the lower bound"?
- Section 3, title: "Related Work"
- Section 3.2: "research duration"?
- Section 3.2: "SemEval organised a similar task" -> similar to what?
- The text about Babelfy in Section 4.1 is a bit technical and could in general be structured better (e.g., simply as a list of the three steps).
- Section 5.3 "for trying"?
- Section 5.6: "Figure 2 and Figure 3[ ]indicate"
- Table 3: Is it true that the numbers for SemEval 2016 and SemEval 2017 are partially identical?
- Table 4: "p-value the < 0.05"?
- Conclusion: "In this [article]"
- Fig. 2 and Fig. 3 look like they were straight copied out of Excel or Google Sheets, what I personally do not like.
- Fig. 3: "macr[o]"
- Bibliography should be cleaned and more consistent.

Review #2
By Ruijie Wang submitted on 21/Mar/2021
Major Revision
Review Comment:

In this paper, the authors focus on the answer selection task in Community Question Answering and propose a Siamese architecture-based model to classify candidate answers into three categories (i.e., Good, Potential, and Bad) in accordance with a given question. With the following review, my suggested decision on this paper is Major Revisions Required, and the revision should address all the negative aspects listed below.

The proposed model consists of three modules: Initial Representation, Attention Layer, and Multi-task Learning. 1) The Initial Representation module is to obtain pre-trained vector representations of the given question, candidate answers, and the question context (i.e., question category and subject). The focus of this module is to disambiguate fragment semantics with knowledge graph-based entity linking. 2) The Attention Layer addresses the redundancy and noise problem of questions/answers by computing attentional question/answer representations based on the question context representation. 3) The Multi-task Learning module employs a pair of Siamese variational autoencoders to encode question/answer representations and compare them to classify candidate answers. The decoding of latent question/answer representations and the latent representation-based classification are jointly trained.

Strengths: 1) the ambiguity issue of questions/answers is addressed with information from the knowledge graph, and the Siamese architecture-based model achieved superior performances in comparison with baseline methods, which demonstrate the significance of the results; 2) an ablation study was conducted to scrutinize the contribution of each module.

1) several existing methods, e.g., Babelfy and NASARI, are directly utilized in the model. However, there is a lack of sufficient introduction to the used methods, which hampers the readability of the paper.

2) the introduction to the attention layer only takes two short paragraphs without any formal definitions or equations which makes this module totally unclear.

3) the given "implementation details" are not detailed at all. Only the layer numbers of the autoencoder and the dimensions of latent representations are given. What are the parameter settings of the initial representations, the attention layer, the MLP classifier, and the convolutional filters?

These two issues make it difficult to reproduce the system as well as the experimental setup. Please provide both the formal definitions as well as the parameter settings used for each component to enable reproducibility.

4) In the ablation study, when any one of the key components (e.g., the knowledge graph-based disambiguation, or the attention layer) was removed, the model still outperformed most of the baseline models. However, there was only one component removed every time in the ablation study. The baseline of the proposed model (i.e., the vanilla version without any component) is not evaluated. Is it possible to start with the evaluation of the baseline, incrementally add one component at a time, and analyze how the performance could be increased with more components added? Also, it would be appreciated to have more information about the comparison set-up as well as the implementation details of the proposed model and the other compared models.

5) the originality of the paper is limited because: first, the main modules of the proposed model, e.g., knowledge graph-based disambiguation, and the Siamese autoencoder, have been already widely utilized in the literature; second, given the lack of details of the model design and implementation (e.g., how the existing methods are integrated into the model, and how the attention layer is customized in this model), the originality of each proposed module is difficult to be assessed.

6) The quality of writing does not meet the requirement of this journal due to the aforementioned lack of readability and minor errors such as:
In Section I - Paragraph 1, several applications, e.g., recommender systems, are listed without adequate references.
In Section I - Paragraph 3, there is no reference for SemEval2015.
In Section I - Paragraph 7, "unable to encode" instead of "unable to encoding", and it should be a period instead of a semicolon at the end of the paragraph.
In Section I - the last contribution, there should be references for the three listed datasets.
In Section II, please use adequate mathematical expressions instead of English characters when denoting variables and parameters.
In Section II - Paragraph 2, "as follows" instead of "as follow".
In Section III, it is claimed that "none of existing methods have considered the context in question-answer representation". However, after reading the related works introduced in the paper itself, I am skeptical of this claim.
Please thoroughly check the writing of the paper.

Review #3
Anonymous submitted on 25/Jun/2021
Major Revision
Review Comment:

The paper proposes a new answer selection approach for community question answering which combines a knowledge graph and a siamese architecture-based model.

The method appears to be sound, and it was evaluated against eleven baselines on three standard benchmarks. The ablation study is interesting and highlights the importance of the knowledge graph in the process. However, the paper also presents several shortcomings regarding 1) the quality of writing, 2) lack of explanation for some important concepts and implementation details, and 3) reproducibility. I will detail them in the following.

The quality of English needs to be improved. The paper contains many convoluted sentences that use strange constructions as well as several typos. I suggest the authors to revise it thoroughly.

The paper does not define several background concepts that are crucial to its understanding. In particular, Section 1 lacks a formal introduction of the question answering task that will be addressed as well as the three categories of candidate answers (‘Good’, ‘Potential’, and ‘Bad’). The reader learns much later in the paper what are the input and the output.
Table 1 is also quite confusing. What is a question category and how is it used? Is answer ‘1’ a good answer and answer ‘2’ a bad one? I suggest the authors to define a good running example in the introduction and use it also for the explanation of the method. The example should also explain the three categories that the paper is using.

The approach needs to be explained more in details, considering also that the Semantic Web Journal is not a machine learning venue. For instance, the attention layer introduced in Section 4.1 needs to be formalized and explained further.

The evaluation is quite comprehensive, and the proposed approach is tested on 3 well know datasets (SemEval 2015, SemEval 2016, SemEval 2017) against 11 alternative methods. The results seem very good. However, I have some doubts on the baselines that were reimplemented for the three-class classification task. Which are the methods that were re-implemented? Do the authors think that this change can affect their performance? It is also important to give more details about their implementation and parameters, otherwise the reproducibility is quite low.
Another issue with the evaluation is that the data and code are not shared. I strongly suggest the authors to make the data available in accordance with the FAIR principles. This is particularly important since it appears that many of the original baselines were reimplemented and therefore are not available in the version that was tested in the paper.

The note to Table 4 and 5 “Numbers mean that improvement from our model is statistically significant over the baseline methods (t-test, p-value the < 0.05)” sounds quite cryptic. I assume it means that the author run the t-test on the results. I suggest to briefly discuss it in the paper, specifying exactly which methods were compared, and reporting all the relevant p-values.

In conclusion, it is a potentially interesting article, but the current version presents several issues regarding the writing quality, reproducibility, and missing details. Therefore, I suggest a Major Revisions.