Overcoming Challenges of Semantic Question Answering in the Semantic Web

Tracking #: 1207-2419

Konrad Höffner
Sebastian Walter
Edgard Marx
Jens Lehmann
Axel-Cyrille Ngonga Ngomo
Ricardo Usbeck

Responsible editor: 
Marta Sabou

Submission type: 
Survey Article
Semantic Question Answering (SQA) removes two major access requirements to the Semantic Web: the mastery of a formal query language like SPARQL and knowledge of a specific vocabulary. Because of the complexity of natural language, SQA presents difficult challenges and many research opportunities. Instead of a shared effort, however, many essential components are redeveloped, which is an inefficient use of researcher’s time and resources. This survey analyzes 62 different SQA systems, which are systematically and manually selected using predefined inclusion and exclusion criteria, leading to 70 selected publications out of 1960 candidates. We identify common challenges, structure solutions, and provide recommendations for future systems. This work is based on publications from the end of 2010 to July 2015 and is also compared to older but similar surveys.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Chris Biemann submitted on 07/Jan/2016
Major Revision
Review Comment:

This manuscript was submitted as 'Survey Article' and should be reviewed along the following dimensions: (1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic. (2) How comprehensive and how balanced is the presentation and coverage. (3) Readability and clarity of the presentation. (4) Importance of the covered material to the broader Semantic Web community.

(!) Mostly
(2) Partly
(3) Very
(4) Very

This paper presents an extensive survey over Question Answering Systems on Linked Data (Semantic Question Answering, SQA) for the period of 2010-2015.
The task of SQA is to retrieve answers from structured SW resources for natural language queries.

The paper does a great job in introducing the subject and motivating its goal: to provide a more natural access than SPARQL queries and to abstract over potentially differing vocabularies.
The methodology for selecting work to be included in this survey is brilliant and I have rarely encountered a more inclusive, systematic and transparent approach. The final selection of 62 surveyed systems is, as far as I can tell, comprehensive and excellently suited for a survey. Also, the authors did a great job in relating their survey to previous surveys and I agree with their conclusion that a more recent, and a more comprehensive survey is in fact desirable.
In all given brevity, the overview of systems is successful: the authors strike a good balance between highlighting the important aspects. I also like the fact that not all 62 systems are described in a list, but rather are grouped by different aspects - a full picture is given in Table 7, and more systems are reviewed in-place in the next section.
The identification of challenges and the survey of techniques to address them can be seen as the main contribution of this paper. It becomes clear which are the most pressing challenges, and not surprisingly, they relate to the lexicon-semantic interface: NL, as opposed to formal languages is ambiguous on many levels and uses several expressions for the same meaning. The length and depth of the subsections of section 5 reflects their relative importance, which is also reflected in Table 5.

So far, the survey is (apart from some terminology glitches, see below) truly excellent, as it sums up all relevant issues in SQA as well as current and past attempts to solve them. One thing is missing though: there is no attempt to qualitatively compare these systems, so readers with less background are left puzzled to what actually makes a contribution and what does not. While I am not suggesting to re-enact a QALD challenge overview here, it would be commendable to mention evaluation efforts and to include at least some hints and general trends so readers would know what to concentrate on. Since some of the authors have been involved in QALD, this should not impose a tremendous burden.

From Section 6 on, however, the quality of the survey drops considerably: Findings are seemingly hastily summarized and the conclusion just shortly reflects the content. What I'd like to see is not only an assessment of the past and present, but also some directions for the future: What are the most promising recent developments, what technologies should we look out for? It is not sufficient to wish for "reusable libraries to lower the entrance effort", one should also detail their functionality and a realistic account of what can be done with libraries that function independently from the actual task. One column in Table 6 does not suffice here.
For Lexical Gap and Ambiguity, I would suggest looking at respective surveys from the NLP community instead of trying to re-invent the wheel. There is ample work on distributional semantic models beyond ESA like e.g. neural embeddings, there is short text similarity, lexical substitution, paraphrasing, word sense disambiguation with respect to lexical resources, and many other tasks that have been successfully attained in the (lexical) semantics tracks of e.g. ACL/NAACL/EACL conferences. These could also be discussed with respect to multilinguality: what language-specific resources are required for which approach, and do they exist for other languages than English?
And, relating to the QALD comment above, how should future challenges be designed to actually measure progress specifically in mentioned research challenges?

The overall length of the survey's text is rather short, compared to previous surveys in this journal. For making the paper acceptable, I would advise the authors to continue writing at the point where they have stopped, and add evaluation, recommendation and a stronger future outlook.

Below some factual errors and glitches that can be addressed rather easily. Especially the terminology glitches are crucial fixes for a survey article, as terminology from surveys is likely to be adopted in subsequent work. Please spend a few lines on defining your concepts clearly and thoroughly, in accordance with the common interpretation of terms.
* Terminology Glitch 1: ambiguity, homonymy, polysemy, synonymy, entailment, abstract, concrete. You write "polysemy, i.e., words with different forms have the same sense" [factual error]; "which can be concrete (Barack Obama) as well as abstract (love, hate)" [misleading]. Ambiguity is the phenomenon of the same string having different meanings; these can be structural/syntactic (like "flying planes") or lexical-semantic (like "bank"). We distinguish between homonymy, where the same string accidentally refers to different concepts (as in money bank vs. river bank) and polysemy, where the same string refers to different but related concepts (as in bank as a company vs. bank as a building). The flipside of ambiguity is the lexical gap, we distinguish between synonymy and taxonomic relations such as metonymy and hypernymy. Entailment talks about their directionality: synonyms entail each other, whereas e.g. hypernyms entail in one direction only (birds fly - > a sparrow flies, but not the other way around). Abstract and concrete should be extended with "instance" - from your example one might conclude that named objects are concrete and others aren't, which is not the case. E.g. person and president are concrete, and Barak Obama is an instance of this., whereas feelings are abstract notions
* Terminology Glitch 2: statistics vs. corpus-based vs. resource-based vs. semantics. You write "statistical disambiguation relies on word co-occurrences, while corpus-based disambiguation also uses synonyms, hyponyms and others" [unclear]; "While statistical disambiguation works on the phrase level, semantic disambiguation works on the concept level" [unclear]. Further, Underspecification[113] and Dudes [23] is listed under statistical approaches [wrong]. Statistics involves the use of counts, often used in normalized form as probabilities and in frameworks like HMM, MLN etc.. Corpus-based methodology, in my understanding is drawing these counts from unstructured text collections. Resource-based methods rely on content of structure of lexical or semantic resources (e.g. connectivity, degree, path length, conceptual density) and 'semantics' as a term is vastly underspecified. Thus, co-occurrences are in fact statistic but also corpus-based, but one could also use X-onyms inside statistic approaches. All of these approaches in this context try to solve the mapping between lexical units (in the query) and concepts (in e.g. DBPedia), so they work both on the lexical (phrase) level AND on the concept level - the difference is rather where the information for disambiguation comes from.
* Table 4 "similarity measure" example is not very instructive; find something that is not identical to "running"?
* [on mapping NL to RDF] .. whose relations form a tree. Thus, RDF graph structures cannot be directly mapped." this is not the main reason. Even making it a graph on the NL side, which is in fact one aspect of the new effort on Universal Dependencies, would not allow a direct mapping since NL does not express the question necessarily in the form that matches the database.

Review #2
Anonymous submitted on 13/Feb/2016
Minor Revision
Review Comment:

Review of “Overcoming Challenges of Semantic Question Answering in the Semantic Web”

The paper "Overcoming Challenges of Semantic Question Answering in the Semantic Web" presents an interesting and exhaustive overview of the field of Semantic Question Answering through the discussion of 62 systems as they emerge from the 2010-2015 scientific literature. The rules adopted for the selection of systems and the choice of criteria used to survey their main aspects are clearly presented and provide a further contribution of the paper.
The paper in fact introduces early the selection policy and the definitions/guidelines used to delimit the scope of the analysis in the SQA area. In this way, the paper provides a specific focus on Question Answering as the process of retrieval formalized information (RDF triples or structured relations) from knowledge repositories, typical of the Semantic Web (SW).
The paper is valuable as for its synthesis over a large and critical area of the current SW research. It is mostly clear and its coverage is good.
In the review, I would like to further discuss three major issues related to the current version of the paper.
1. Coverage and validity of the adopted notion of SQA
2. Organisation of the paper that embodies a structured overview of this very broad area
3. Paper impact, i.e. if the paper has or not contributed to the aim of its title.

1. Coverage and Validity of SQA
In my view, by focusing ONLY on the retrieval of structured data, the adopted QA notion is consistent but risk not to give a complete account of the field that has most to do with the systematic integration of unstructured information. I refer, for example, to the current work on topics such as textual similarity, paraphrasing as well as textual entailment and their impact on retrieval of passages or other pointwise information that does not assume any specific KB being available (with reference, i.e. gold, entity information or typed relations). This work is about semantic tasks dealing with text understanding and retrieval but is here underrepresented. An example is the PARALEX approach, correctly cited in the paper, and firmly based on learning from text. However, the PARALEX approach is representative of a wider set of open domain QA systems, such as the one presented in (Bordes et al, 2014). These make use of paraphrase learning methods that integrate linguistic generalization (e.g. neural embeddings) with knowledge graph biases. Neglecting this line of research (just because it is not directly insisting on RDF-like resources nor employing explicitly disambiguation steps that depend on some form of reasoning) can be seen as a limitation of this paper.
In general, here low attention is posed in the paper to distinctions that characterize the approaches at the level of the type and nature of the employed inference algorithms. Generative, discriminative inductive methods as well as symbolic methods are seemingly discussed and, when targeted to a single phenomenon, mixed in the same sections. The choice is to survey SQA phenomena and detect the underlying (application) challenges rather than to focus on the involved functionalities, i.e. onto one (or more) of their possible decomposition in a range of tasks (and thus methods to solve them).
As an outcome no consensus seems to be found around a general architecture for the QA process so that no discussion can be traced around best practices related to individual subtasks.
It must be said that the area has a very broad nature and heterogeneous methods do not allow easily to define on the one side analogies among systems and best practices and, onm the other side, a reference architectural decomposition. However, there is no effort in the authors to go in such a direction.
I think that this does not help the paper to achieve the goal to shed more light on the field.
2. Paper organisation.
The core of the paper is to discuss some specific challenges that SQA system seem to face today. The different reference challenges defined and discussed in Section 5 are:
• Lexical Gap
• Ambiguity
• Multilingualism
• Complex Queries (but it Operators In Table 5)
• Distributed Knowledge
• Procedural, Temporal or Spatial Questions
• Templates
For each of the above challenges, first the main solutions provided by surveyed SQA systems are introduced, as an exemplification of the challenge. They are also used as triggers for a comparative discussion among different techniques proposed. Then, in Section 6, a general analysis is provided as a way to detect trends and prospects of the SQA research in the near-medium term.
I have to say that the selection of the challenges is not clearly motivated, as some of them seems to be poorly representative of the field (e.g. multilingualism as an issue covered by very few systems) and, on the other hand, some of them are vaguely defined and possibly cluster a too large area of research whose comparative analysis is very complex.
Lexical Gap. The lexical gap issue for example is representative of a too large set of phenomena to be presented as a unique challenge. As in the introductory text, it is defined as the problem of mapping text tokens into the KB primitives:
"Each textual tokens in the question needs to be mapped to a Semantic Web-based individual, property, class or even higher level concept. Most natural language questions refer to concepts, which can be concrete (Barack Obama) as well as abstract (love, hate). Similarly, RDF resources, which are designed to represent concepts, are characterized by binary relationships with other resources and literals, forming a graph. However, natural language text is not graph-shaped but a sequence of characters which represent words or tokens, whose relations form a tree."
This problem is not a lexical problem, as the authors admit by mentioning the (ontological?) mismatch between RDF graphs and the syntagmatic nature of words graphs or parse trees: it is not clear here if the authors refers to constituent based grammatical approaches that employ in fact parse trees among syntagmatic structures, e.g. complex noun phrases, or dependency based approaches that represent grammar through binary relations among words, e.g. heads of complex phrases.
However, in synthesis, I see several independent issues clustered in this challenge:
1. The lexical mismatch between named entities (or linguistic labels for other more abstract concepts) and knowledge graph node names
2. The complexity in the interpretation of individual linguistic relations (as recognized at the level of grammatical representation of sentences/queries) in terms of semantic or conceptual relations, interoperable with the semantics of the targeted RDF KBs
3. The complexity of the overall matching between grammatical graphs and knowledge graph whereas all grammatical relations in the query interacts with all the involved arcs in a knowledge and joint inferences are required.
In the proposed subfields, i.e. "Normalization and Similarity", "Automatic Query Expansion", "Pattern libraries", "Entailment" and "Document Retrieval Models for RDF resources" and "Composite Approaches", a mix of solutions (e.g. "Query Expansion" ) vs. phenomena (e.g. "Entailment"), of algorithmic techniques (e.g. "Normalization") wrt. modelling paradigms (e.g. "Similarity") is presented and it is not helpful in developing a clear picture of the topic (i.e. the kind of challenge targeted) and the surveyed contributions (i.e. the scientific framework in which the research can be organised).
Again, I think that an "architectural" approach that proceeds from a decomposition of the problem (e.g. LexicalMatching < SyntacticInterpretation < EntityMatching < SemanticRelationMapping < JointInterpretationOfEntitiesAndRelations) to paradigms and methods for each step would have been clarifying. Notice how much this “Lexical Gap” challenge is in overlap with the "Ambiguity" area.
As a general suggestion I would thus reorganise the discussion or introduce it with a clear picture of the involved subtasks related to each challenge. In any case I would
- Rename the "Lexical Gap" area by merging it with the "Ambiguity" one, as for their gross overlaps;
- Avoid to mention explicitly sub-challenges (e.g. "Document Retrieval models for RDF Resources") that are exemplified by only one system.
- Keep quite independent tasks (such as "Normalization" vs. "Similarity-based lexical matching") separate
A possible renaming of the labels defined for the challenges is the following:

Current Challenge Suggested labelling
Lexical Gap + Ambiguity Semantic Interpretation
Multilingualism (may be not needed, see below)
Complex Queries Question Expressivity
Distributed Knowledge Knowledge Locality and Heterogeneity
Procedural, Temporal or Spatial Questions Question Types and Retrieval Complexity
Templates KB Query Formalism

Notice how all the proposed labels are phenomena/processes (e.g. Knowledge Locality and Heterogeneity, Semantic Interpretation) or methodologies/solutions (e.g. KB Query Formalism).

3. Paper impact
It is important in the light of the above observation to establish if the paper succeeds or not to achieve its aims. I feel that the paper has a strong potential to shed light on the area, but it is not fully achieved in this version. As for its coverage, the paper is very valuable and for this reason I strongly think it is to be accepted for the publication.
My problem is that given its current organisation (developed around systems and challenges) the paper does not fully clarify which are the computational trends in the area, such as:
- Which techniques are mostly useful?
- Which are the subtasks that mostly impact in the quality of the overall SQA chain?
- Which are the missing aspects that are
(1) already studied substaks for which accurate solutions do not exist yet, or
(2) challenges that have been underestimated, for which more work is needed
On the contrary, several statements in Section 6, are not fully justified.
An example is the discussion about multilingualism:
"By this means, future research has to focus on language-independent SQA systems to lower the adoption effort. For instance, DBpedia [76] provides a knowledge base in more than 100 languages which could form the base of a next multilingual SQA system."
It is not clear why a language independent SQA is necessary. Is it useful or simply necessary, given the resources in different language? The problem here is that it is not even clear what does "language independent SQA" means given that no system seems to make a clear use of a multilingual technique and no definition is given: is independence related to the language of the question or to the language to which descriptions in the KB make reference? This difference is deep as completely independent issues are targeted in the two cases.
In synthesis, again, an architectural view would have been helpful in focusing the discussion on real (empirically verified) limitations of the current technology and providing some basis to sketch a roadmap for future research.

Summary Review
(1) Suitability as introductory text, targeted at researchers, PhD students, or practitioners, to get started on the covered topic.
The paper is very well suited as an introduction to the field, as for its coverage and clarity of focus.
(2) How comprehensive and how balanced is the presentation and coverage.
The proposed material fails to capture some relevant aspects of the current research in QA, especially for what concerns QA over unstructured data, but this is a strongly motivate choice of the authors.
(3) Readability and clarity of the presentation.
The presentation is very clear with a good impact on readability.
I do not agree on some of the adopted definitions that are misleading to my view, but a suitable renaming is proposed above.
(4) Importance of the covered material to the broader Semantic Web community.
The lack of an architectural view of the SQA process does not allow to properly frame most of the discussion. A more process oriented view would have been beneficial. The authors are requested to improve their manuscript by devising a general reference (and comprehensive) architecture of a SQA process and then discuss most of the current material according to such an organised view.

Pointwise observations
Page 5. “Answer presentation”.
Why (just after the review of existing systems that does not give any structural view on their general workflow) the first aspect to discuss is Answer Presentation?
I find it a secondary issue in the organization of a general view on the field and would postpone this discussion laer on in the paper. Afterall, entity summarization (i.e. Cheng et al. [22] ….) IS NOT verbalization of RDF triples (i.e. Ngonga Ngomo et al.). and the cluster is not entirely justified.

Page 5. “Thus, a new research sub field focusses on question answering frameworks, i.e., frameworks to combine different SQA systems”
The notion of framework here intended is not clear. I see that is not just software framework, but mostly methodological frameworks, used to make different tools compatible within a unified QA architecture. This notion should be discussed at length by outlining typical examples of components (or subsystems) to be reused and whose integration require a common framework. It is likely required here an independent subsection.

Page 6. “Lexical Gap”
I do not think it is just a lexical gap, but it is more precisely a gap between linguistic knowledge and the target encyclopaedic knowledge of an RDF repository.
After all a sense is the outcome of the entire information expressed in a sentence and thus it is the outcome of interpretation. It is not just a dictionary phenomenon. i.e. information missing from the lexicon.

Page 6. “Normalization & Similarity”
This maybe means "Advanced candidate matching techniques": normalization (as well as lemmatization) or fuzzy matching are all methods to optimize candidate identification for SQA at the lexical level. But similarity, as a phenomenon that depends on the language or on the theory about a KB, is NOT a process/technique such as “Normalization”.

Page 7. “Patterns”
Knowledge patterns involved here can refer to RDF triples or to textual structures. In the latter case I would call them linguistic patterns. Knowledge patterns would be more precise in the former one. It seems to me that the first choice is better here. Notice that patterns are also used to infer ranking of candidate answers and should thus be also linked to section 5.2.

Page 7. “Entailment”
I am not sure it is a good choice, as entailment is also a logic property between formulas (i.e. Knowledge subgraphs). Notice that entailment between texts (e.g. questions and answer pairs) is the focus of a large area of research called Textual Entailment Recognition and it can lead to confusion here.

Page 8. “Ambiguity”
Here it seems that two kinds of ambiguity are discussed: sentence ambiguity, that is ambiguity in the linguistic interpretation of the question, as well as ambiguity in the matching of entities as answer candidates. This should be better clarified as the different subsection (Semantic vs. Statistical disambiguation) deal in fact with both problems interchangeably.

Page 9. Semantic Disambiguation. “While statistical disambiguation works on the phrase level, semantic disambiguation works on the concept level.”
Quite critical distinction...
Semantics is always the OUTCOME of a disambiguation process, so that it is always proceeding at a concept level. Statistical methods are just OFTEN tight only to lexical information and its distributional behaviour, making explicit use of these properties within a quantitative inference model. Semantic is more often only tight to capture dictionary sense or KB entity information, but (1) most methods work in an hybrid manner, (2) at both level quantitative approaches and algorithmic is usually applied. Graph-based models (e.g. random surfer approaches) applied to knowledge graphs are always statistical in nature.

Page 10. “Alternative approaches”
They seem mostly approaches where natural language processing is not applied, so that a sort of controlled language is adopted for querying. The title does not capture this aspect explicitly.

Page 11. “GETARUNS”
The original reference to GETARUNS should be
Delmonte, R. (2008). Computational Linguistic Text Processing - Lexicon, Grammar, Parsing and Anaphora Resolution. New York: Nova Science Publishers. ISBN: 978-1-60456-749-6.
Pag 11. Footnote 20: Such as “List the Semantic Web people and their affiliation.”
If you decide that an explanation of the notion of coreference is needed, then you should be more explicit here, mentioning explicitly the coreferent "their" and the referred entity ... "people".

Page 11. “… handling procedural questions ….”
Is this still QA in your initial assumption? Why is this case different from complex but plain document oriented retrieval?

Page 11. “… statistic distribution…”
… statistical distribution …

Page 12. “Xu et al [12] …”
You need to introduce here also the Xser acronym of the corresponding system here, as it is referred afterward in the text. I think that the reference to the system rather than to the authors is better here.

Page 14. Conclusion
In the text
“Future research should be directed at more modularization, automatic reuse, self-wiring and encapsulated modules with their own benchmarks and evaluations. Thus, novel research field can be tackled by reusing already existing parts and focusing on the research core problem itself.”
This is exactly what the overview od section 4 and 5 do not allow to define: Reusable modules and parts are never defined, even in section 6.

“Another research direction are SQA systems as aggregators or framework for other systems or algorithms to benefit of the set of existing approaches.”
The space dedicated to outline a potential unifying framework is negligible. As it is here reported as a future research direction is should be more carefully traced.

“Furthermore, benchmarking will move to single algorithmic modules instead of benchmarking a system as a whole.“
The target of local optimization is benchmarking a process at the individual steps, but global benchmarking is still needed to measure the impact of error propagation across the chain. A Turing test-like spirit would suggests that the latter is more important, as the local measure are never fully representative.


A. Bordes, J. Weston, and N. Usunier. Open question answering with weakly
supervised embedding models. In Proceedings of ECML-PKDD'14. Springer, 2014.

Review #3
By Chris Welty submitted on 20/Feb/2016
Review Comment:

I think this paper is acceptable and shoudl be published quickly so as not to be out of date, as all survey papers quickly are. Also I suggest a title change to refect that this is a survery of SQA systems and does not overcome any obstacles.