Multilingual Question Answering Systems for Knowledge Graphs—A Survey

Tracking #: 3417-4631

Aleksandr Perevalov
Andreas Both
Axel-Cyrille Ngonga Ngomo

Responsible editor: 
Philipp Cimiano

Submission type: 
Survey Article
This paper presents a systematic survey of the research field of multilingual Knowledge Graph Question Answering (mKGQA). We employ a systematic review methodology to collect and analyze the research results in the field of mKGQA by defining scientific literature sources, selecting relevant publications, extracting objective information (e.g., problem, approach, evaluation values, used metrics, etc.), analyzing the information, searching for new insights as well as generalizing them in a structured manner. Our insights are mainly derived from 37 publications: 21 papers about mKGQA systems, 9 papers about benchmarks and datasets, and 7 systematic survey articles. This work introduces a taxonomy for the methods used or the development of mKGQA systems. In addition, we formally define three major characteristics of these methods: resource efficiency, multilinguality, and portability. The formal definitions are intended to serve as landmarks for the selection of a particular method for mKGQA in a given use case. We provide all the collected data, scripts, and documentation as an online appendix. Finally, we discuss the challenges of mKGQA, offer a general outlook on the investigated research field, and define important future research directions.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Hugo Gonçalo Oliveira submitted on 18/May/2023
Minor Revision
Review Comment:

This paper surveys Multilingual Question Answering Systems for Knowledge Graphs (i.e., for the task of mKBQA).

(1) It describes each of the 17 systems found to meet the defined criteria, which is a good starting point for someone working on the area. Yet, to serve as an introductory text on the topic, it could go deeper on explaining the area of KBQA, before mKBQA, provide more details common approaches and challenges, and on the actual application of the referred techniques to mKBQA.

(2) Inclusion criteria are well-defined and coverage of target systems is the best available to my knowledge. Still, some parts of the selection step could be more clear. For instance:
Why were the initially accepted 63 papers further narrowed down to 37?
Why were the sources considered for three languages (specifically, English, German, Russian)?
What were the 12 selected publications that the authors were previously aware of? Why were they not selected with the common procedure?

(3) The paper is well written and well organised. However, I feel that Section 4.1 may be too long and the authors could try to compress it, perhaps by describing similar systems together. The taxonomy in section 4.2 helps to better understand the relations between papers, so authors could try to align section 4.1 even more with the columns of table 4, and even introduce them before describing the papers.
Some descriptions are not completely systematic. For instance, why is it referred for some systems that they are in Java, while the programming language is not given for others?
Moreover, it does not make sense to say "we assume the ... following papers do provide details" and, a few paragraphs later, actually describe these papers. For papers which are several years old, it is also not very natural to describe their future work as something for future. More important would be to confirm whether it has been tackled, either by the same or by other authors.
Finally, would it make sense to refer to the first described paper, from 2011, as some kind of landmark? Some reasons could be further provided for having nothing earlier, and on what triggered mKBQA.

(4) The covered material is definitely important for the Semantic Web community, but also for others, like Natural Language Processing (NLP). I would highlight the website with links to related papers, and the leaderboards for several datasets.
Still, having in mind current advances in NLP, the pros and cons of KBQA towards, e.g., prompting Large Language Models (LLMs) for answer generation should be discussed.

Additional comments:

The Introduction states that "the extraction of the direct answers is enabled by the introduction of the Semantic Web", but it is not exclusive of the Semantic Web, as there are other tasks with similar goal, such as Extractive Question Answering and, more recently, tools like LLMs.

In the taxonomy (section 4.2), I would not put group G3 at the same level as the others, because it is exclusively related to the translation between natural languages.

Section 4.3 describes three characteristics of the methods, but the section could be less theoretic, and better supported by the surveyed papers, also, if possible, speculating on how they could be measured. Otherwise, they are not much more than ideas.

Future directions in Section 6.2 suffer from a similar problem. They do make sense, and some are quite obvious, but could be linked specifically to the identified challenges and supported by conclusions taken in the surveyed works.

Finally, section 5, on benchmarking, could include a deeper discussion on the best approachs for each dataset. Perhaps a summary of the leaderboards, including conclusions on the most suitable / promising approaches.

Possible typos / grammar issues:
p6, l46: there are -> there were ?
p9, l44: will allow to set of up the corresponding services
p11. l24: encoder on the data on a data-rich
p11. l26: This paradigm to be adapted to KGQA to build
p12, ll36: require the other ones to form (which ones?)
p14, l34: in the NLP as
p16, l42: Section 4 (4.1)
p18, l11: worth underlining, that (remove comma)
p21, l8: showed, that the assignment (remove comma)

Plus, there is no introductory text in sections 4.2, 5, 6; and numbers below 13 should be written in full (e.g., four instead of 4).

Review #2
By Vanessa Lopez submitted on 26/May/2023
Minor Revision
Review Comment:

The paper presents a systematic survey on the field of multilinguality in Question Answering Systems over KG.

The paper motivates well the need for this review, as the previous 7 surveys that the authors refer to on KGQA systems do not cover this aspect in depth, while they highlight it as an important challenge to tackle. In fact, the subject is barely addressed in these surveys, perhaps cause at the time they were written there were not many systems focused on proposing solutions for that challenge. In any case, this review is timely, justified and suitable as introductory text targeted to researchers, students and practitioners.

The paper is well motivated. However, in the motivation (and through the rest of the paper), the authors do not contextualise this work in relation to the recent remarkable progress and advantages of LLM for QA and conversational systems. Starting with the introduction, where the authors state “the extraction of the direct answers is enable by the introduction of the Semantic web, which aims at making the Wed data machine readable”. I think this statement is outdated, it was ok a few years ago, but the role of these systems and their abilities need to be revised , in particular considering conversational and QA abilities of LLMs. While I think the topic of this survey is still relevant (even if I honestly don’t agree with the statement that “KGQA systems or its underlying technologies are serving a major role in providing users with access to knowledge available on the Web” for many reasons, despite KGQA being a hot research topic for the last 20 years), the paper is really well written and very thorough, however, LLMs are having and will continue to have a major impact that needs to be acknowledged in this survey.

In the introduction section, a minor clarification is needed. The authors state that the type of answer expected for the majority of these systems are a resource/URI of an entity, a literal or a boolean. What about a list of (ranked or not) resources (just as an example: “ Sort all mountains in Asia by elevation”)

The authors follow a systematic review methodology explained in Section 2. One strength of this section is the emphasis on reproducibility and transparency in the methodology for selecting the papers. An online repository is also provided. Following a set of well justified criteria 37 publications were selected for the review: 21 related to systems, 9 to benchmarks and 7 survey papers. I only have two minors questions here : (1) for the authors to clarify why the search was conducted only in 3 languages (English, Germany and Russian), missing other key languages like Spanish or French; (2) In Section 2 authors state the phases of: sources selection, publications selection, information extraction and results systematisation are described in the following subsections, but only the first three are described in 2.1, 2.2 and 2.3. Other than that, this section is very thorough and the methodology used through the different steps is sound. The differences with respect to each of the previous surveys is also summarised clearly, and a description of each surveyed systems is provided.

For the systematic review the authors propose an interesting taxonomy of methods. The classification of these methods is based on three major groups (1) rules and templates; (2) statistical methods and (3) machine translation methods. This classification provides a good picture/overview of the area and it makes sense for analysing multilinguality aspects further, considering that the methods in the first two groups are widely used in monolingual KGQA, while the third group is better suited for multilinguality, as described very well in this section. Three characteristics are further defined and discussed for the methods: resource efficiency, multilingualism and portability

Some observations are provided at the end of Section 4, where authors state “we foresee the following research challenges and research directions for the mKGQA”. However the list provided reads more like a good set of requirements and best practices rather than challenges or future directions. For example, the need for reproducibility, new metrics or justification of the selected metrics, and the need to explore further multilingual or language-agnostic methods. I agree with all of these, they are common sense best practices that it is good to emphasise and they should apply in general, rather than deeper insights or an analysis of technical challenges / future directions.

Regarding benchmarks, the authors state “only five benchmarks exists that tackle multiple languages”. This may not seem like a lot when compared to QA systems over documents or on the Web , however, if would be helpful to consider the total number of benchmarks typically used to evaluate QA over KGs. So, five out of how many benchmarks used to evaluate QA systems over KGs? Or in other words, how many benchmarks for QA over KGs do not consider multilingualism at all?

In sum, I think that the review methodology followed here is a good example of best practices for surveys of this nature. In addition, the authors state that in the future, they will regularly update the survey results on the online leaderboard to keep track of the SoA, that is a valuable contribution and commitment .

The main weakness of this survey is the discussion on challenges and future research directions being too shallow. For example, the summary section highlights that the majority of, multilingual at least, benchmarks, only use Wikidata as a KG, it also mentions the lack of standardisation. Those are good and actionable insights, but I believe there are more in-depth insights that the authors could discuss, as well as providing their own expert vision in the subject. For example, some of the topics, such as the lack of systems that target the languages that originate from different language groups, the benchmarks being small, traits of different culture, improving the quality in languages other than English, they are kind of obvious and lack a deeper analysis on the reason. Are we now in better position to tackle those than we were till now? are the barriers still there? Are the topics still relevant enough? . The future works highlights gaps and guidelines, but with little insights or novel suggestions on future research topics. A key aspect missing entirely from the discussion on challenges and future work is how the context of KGQA has changed since the introduction of Large Language Models, which have impressive zero shot and multilingual capabilities that could actually be leveraged here. I believe the authors could improve this paper greatly by improving the discussion and future work.

Review #3
Anonymous submitted on 05/Jun/2023
Review Comment:

This article is a survey of multilingual question answering (QA) systems for knowledge graphs (KG) – or mKGQA. To that end a systematic review methodology is applied (Sect. 2). A review of survey papers discovered during the systematic search highlights that none of these explicitly consider multilinguality/multilingual KGQA (Sect 3). Sect. 4 discusses findings based on the 17 primary studies (e.g., papers describing actual mKGQA systems) identified during search. Sect. 4.1 provides a summary of each of the 17 papers and synthesizes the data in an overview table (yet, I was not able to find an analysis of this extracted data focusing on the individual columns of the table as cross-cutting comparison criteria). Sect 4.2 identifies “methods” (probably “technique” would be a more correct word here) used for developing mKGQA and organises those into a taxonomy. Sect. 4.3 derives three characteristics of methods for implementing mKGQA (resource efficiency, multilinguality and portability although the way these characteristics are derived is not entirely clear). Sect. 4.4. is very brief and points to a list of collected evaluation results for the reviewed systems, without offering any in-depth analysis and conclusions. Sect. 5 collects and presents benchmarks for evaluating mKGQA and sums up some of their shortcomings. Finally, Sec. 6 presents challenges and future work – for both these items, they are rather briefly described and it is not clear how they were derived (e.g., there is no reference back to the sections of the paper that would substantiate these challenges).

While the paper presents several aspects of mKGQA, it suffers from a number of critical shortcomings.

Lack of a clear definition/demarcation of the mKGQA systems. From a survey paper that often acts as an introductory material to a subject area, it is quintessential to clearly define and position that subject area (in this case to define what is/is not meant by a mKGQA), typically through a dedicated section. Also, since mKGQA are a subset of a broader set of systems KGQA, one would expect that the proposed definition is discussed/compared to existing definitions of KGQA. Yet, the paper limits itself to position these systems briefly as part of the introduction, following a logic that is rather confusing (i.e., by supra-posing concepts such as web-search, semantic web, knowledge graphs). No part of the paper defines what is *exactly* an mKGQA and how does it differ from a KGQA.

Methodological issues:
•What were the research questions that guided the design of the study? What did the authors want to find out? The goal of the study defined by its research questions has a major influence on the subsequent steps such as the definition of the search keywords.

•When (at what date) was the search for the papers performed? This is very important for a systematic survey for reasons of reproducibility. Yet, the only information related to the timing of the search is that, during the study, papers published in the period 2011-2023 were selected. This would indicate a search time in (late) 2023 (to truly include relevant papers published in the entire 2023), yet the survey itself was submitted for review in mid March 2023. It is also quite unfeasible that the search was performed in early 2023, as completing and write-up of such a study is hardly feasible in a couple of months (additionally, footnote 18 states that the work began in 2021, which again questions a search date in 2023 given that search is one of the first steps in conducting the study). Overall, the contradictory information on such an objective fact, challenges the credibility and methodological correctness of the work.

•How was the paper selection process organised? Who were the involved researchers? Did they have the same selection guidelines? Was there a process for ensuring agreement between the researchers? In most studies, a group of researchers is involved and there are coordination stages in which the reviewers synchronise about the followed guidelines to ensure that their work is well-aligned.

•Extracted information (section 2.3): As no research questions are announced, it is challenging to understand how the information items that were extracted were decided upon. For example, why was the “dataset” extracted? What did authors plan to do with this information? What does dataset actually mean (while the extraction of the dataset is proposed in Section 2, it only becomes clear in Section 4 and table 4 that the authors mean “evaluation datasets” )? Additionally, it is not clear: (a) whether the data was extracted in textual form, or whether a more structured data extraction was used (e.g., selecting between a predefined set of values, for example for metrics); (b) who extracted this information? One person? More persons – and if yes, what was their inter-annotator agreement?

•Analysis of extracted data is very limited: a structured summary of each reviewed paper is provided in Section 4.1.

Due to this methodological issues (especially the lack of some guiding research questions), the paper often feels as a collection of weakly coordinated topics, without a clear logical sequence between sections, or between the extracted data and the derived conclusions.

Based on the considerations above, my assessment according to the main four review criteria described by the SWJ guidelines is as follows:

(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. – Partially. On the one hand, the article addresses a range of topics related to mKGQA. On the other hand, it lacks a clear definition of such systems, it is unfocused and the connection between collected data and analysis/conclusions is often rather weak.

(2) How comprehensive and how balanced is the presentation and coverage. Good coverage by considering several systems.

(3) Readability and clarity of the presentation. Weak. While the text itself is easy to read, the goal and research questions of the paper are not clearly stated. The analysis is not sufficiently related to the extracted data.

(4) Importance of the covered material to the broader Semantic Web community. Unclear. It is unclear whether a dedicated survey of mKGQA is necessary on its own, or whether such systems could be covered in a broader survey of KGQA. This lack of clarity partly stems from a lack of clear definition of mKGQA and their demarcation from KGQA

While the authors have clearly made a significant effort to collect various aspects of these systems, the paper itself lacks clear focus, methodological correctness as well as a proper analysis of the extracted information. It is questionable whether these shortcomings can be addressed after the completion of the study.

Smaller comments:
•Consider replacing “doubtful” with “unclear”, “questionable”.