Answer Selection in Community Question Answering Exploiting Knowledge Graph and Context Information

Tracking #: 2863-4077

Golshan Afzali
Heshaam Faili1
Yadollah Yaghoobzadeh

Responsible editor: 
Guest Editors DeepL4KGs 2021

Submission type: 
Full Paper
With the increasing popularity of knowledge graph (KG), many applications such as sentiment analysis, trend prediction, and question answering use KG for better performance. Despite the obvious usefulness of commonsense and factual information in the KGs, to the best of our knowledge, KGs have been rarely integrated into the task of answer selection in community question answering (CQA). In this paper, we propose a novel answer selection method in CQA by using the knowledge embedded in KGs. We also learn a latent-variable model for learning the representations of the question and answer, jointly optimizing generative and discriminative objectives. It also uses the question category for producing context-aware representations for questions and answers. Moreover, the model uses variational autoencoders (VAE) in a multi-task learning process with a classifier to produce class-specific representations for answers. The experimental results on three widely used datasets demonstrate that our proposed method is effective and outperforms the existing baselines significantly.
Full PDF Version: 

Minor Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ruijie Wang submitted on 11/Oct/2021
Minor Revision
Review Comment:

In the last review (regarding submission #2675-3889), I raised the following questions/weaknesses and expected the authors to address them. Please find the updates and my follow-up questions below:

1) several existing methods, e.g., Babelfy and NASARI, are directly utilized in the model. However, there is a lack of sufficient introduction to the used methods, which hampers the readability of the paper.

Update: The authors added some introduction to the models used in Section 4.1. However, the readability of this section has not been largely improved because the introduction does not fit the context of this paper very well. For example, in the introduction to Babelfy, it is very difficult to understand what “lexicalized semantic network” and “semantic signature” mean because these terms are directly taken from the Babelfy paper without any link to the context of this paper. In addition, some acronyms are used without giving their full names, e.g., EL and WSD.

2) the introduction to the attention layer only takes two short paragraphs without any formal definitions or equations which makes this module totally unclear.

Update: The authors added equations correspondingly in Section 4.2. However, if I understand the paper correctly, Equ. 4 and 6 are wrong unless the authors indeed only selected ONE word (indexed by $i$) from each question/answer sentence to represent the question/answer.

3) the given "implementation details" are not detailed at all. Only the layer numbers of the autoencoder and the dimensions of latent representations are given. What are the parameter settings of the initial representations, the attention layer, the MLP classifier, and the convolutional filters?
These two issues make it difficult to reproduce the system as well as the experimental setup. Please provide both the formal definitions as well as the parameter settings used for each component to enable reproducibility.

Update: The “implementation details” part has been extended, and, in the response letter, the authors promise that the source code link will be provided in the camera-ready version.

4) In the ablation study, when any one of the key components (e.g., the knowledge graph-based disambiguation, or the attention layer) was removed, the model still outperformed most of the baseline models. However, there was only one component removed every time in the ablation study. The baseline of the proposed model (i.e., the vanilla version without any component) is not evaluated. Is it possible to start with the evaluation of the baseline, incrementally add one component at a time, and analyze how the performance could be increased with more components added? Also, it would be appreciated to have more information about the comparison set-up as well as the implementation details of the proposed model and the other compared models.

Update: The authors added an evaluation of the baseline model and conducted experiments on how the model’s performance improves with the modules proposed in this paper being added incrementally.

5) the originality of the paper is limited because: first, the main modules of the proposed model, e.g., knowledge graph-based disambiguation, and the Siamese autoencoder, have been already widely utilized in the literature; second, given the lack of details of the model design and implementation (e.g., how the existing methods are integrated into the model, and how the attention layer is customized in this model), the originality of each proposed module is difficult to be assessed.

Update: Regarding the originality concern, I accept the explanations given in the response letter. And the authors have added more details on model design and implementation in the manuscript.

6) The quality of writing does not meet the requirement of this journal due to the aforementioned lack of readability and minor errors such as:
In Section I - Paragraph 1, several applications, e.g., recommender systems, are listed without adequate references.
In Section I - Paragraph 3, there is no reference for SemEval2015.
In Section I - Paragraph 7, "unable to encode" instead of "unable to encoding", and it should be a period instead of a semicolon at the end of the paragraph.
In Section I - the last contribution, there should be references for the three listed datasets.
In Section II, please use adequate mathematical expressions instead of English characters when denoting variables and parameters.
In Section II - Paragraph 2, "as follows" instead of "as follow".
In Section III, it is claimed that "none of existing methods have considered the context in question-answer representation". However, after reading the related works introduced in the paper itself, I am skeptical of this claim.
Please thoroughly check the writing of the paper.

Update: the authors have addressed some minor issues and improved the quality of writing. However, my concern regarding claiming “none of existing methods have considered the context in question-answer representation” has not been responded to. In Table 2, the description for the references [10] and [43] is “used CNNs for similarity matching and the label of previous and next answer for CONTEXT modeling through LSTM”. Why do you think that they did not consider context?

In conclusion, it is appreciated that the authors have addressed most of my concerns and updated the manuscript accordingly. However, as given above, there are still some follow-up questions, drawbacks, and even errors regarding the current version. Therefore, my suggested decision is a minor revision.

Review #2
Anonymous submitted on 18/Oct/2021
Review Comment:

The paper proposes a new answer selection approach for community question answering which combines a knowledge graph and a siamese architecture-based model.

The authors did a good job of addressing the issues that I mentioned in the previous review and the new version of the paper is much more robust. I am happy for it to be accepted.

Review #3
Anonymous submitted on 21/Oct/2021
Minor Revision
Review Comment:

In comparison to the previous revision, the author improved the introduction and the definition of the problem a lot, strengthened the formalism of their proposed model and extended the evaluation.

I still have three comments on the revised article:

Evaluation: With the current evaluation setting, it is still unclear whether superiority over the baselines can be explained by the proposed model architecture or simply by the fact the proposed model is trained and the ranking is evaluated concerning all three classes.
In the revised version of the paper, it now says “having a three-class classification model, needs the model to be more accurate and it is perhaps the superiority of our proposed approach over the competitors”. Instead of stating “perhaps” here, it would be good to have a proper evaluation demonstrating that the exploitation of a knowledge graph and context information leads to the superiority over the baselines and not simply the training on three classes.
In the case of SemEval 2017, the test set does not even contain the “Potential useful class. The training set, however, is exactly the same as SemEval 2016 (as also indicated in Table 4). This would explain why evaluation results are better for SemEval 2017 than for SemEval 2016 (because only two classes need to be considered). This should be clarified in the paper.

Baselines: Another question is how the evaluation results are computed for the baselines that have originally been evaluated in the two-class setting. According to the revised paper, six baselines were “re-implemented … for three-class classification”. I wonder if simply the output of the three baselines has been evaluated in the new three-class setting or if really all baselines were extended to three classes and re-trained. In the answer to reviewer 3, the authors state that “the link for the source codes will be provided in camera ready” which makes it impossible for the reviewers to see it before a potential acceptance. I highly suggest the authors to provide the requested source code already in case of a next revision.

Problem Definition: Following the authors’ answers to my questions and their clarifications in the paper, I understand that the task is well-aligned to the SemEval task description, along with the limitations this brings (e.g., no annotation of correct answers and the missing generalizability due to the limitation to the QatarLiving forum – and no others such as Stack Overflow [*]). However, I understand that the article follows the data, annotation and evaluation established in the SemEval task and that the problem definition is now much clearer.
[*] Shirani, Amirreza, et al. "Question relatedness on stack overflow: the task, dataset, and corpus-inspired models." arXiv preprint arXiv:1905.01966 (2019).