Multilingual Question Answering using Lexico-Syntactic Patterns

Tracking #: 2174-3387

Authors: 
Nikolay Radoev
Amal Zouaq1
Michel Gagnon

Responsible editor: 
Philipp Cimiano

Submission type: 
Full Paper
Abstract: 
Continued work on existing knowledge bases (KBs) has given acces to a large amount of structured data. However, most of the existing QA systems using those KBs are designed to handle questions in a single language (mostly English) and those that handle multiple languages have a rather generic approach, leading to reduced overall performance. We present a different method for transforming natural language questions into SPARQL queries. Our method focuses on leveraging the syntactic information of questions to generate one or multiple SPARQL triples. We present a QA system aimed at multilingual query processing and describe a set of lexico-syntactic patterns used to generate the SPARQL queries. Our evaluation was done by applying the described patterns on our QA system over DBpedia and measure the impact on the overall performance.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ricardo Usbeck submitted on 23/May/2019
Suggestion:
Minor Revision
Review Comment:

# Originality & Significance

Based on the references found in online databases, this article describes an extension of LAMA [2] which is an extension of AMAL [1]. AMAL and this extended version of LAMA use the same or similar modules to classify the type of the question and DBpedia Spotlight for entity extraction. However, the way the property lexica are formed, complex questions are deconstructed into simple questions and how a SPARQL query is formed are different and novel. Thus, the paper describes a novel QA system, but even single subparts of the extended system are known to the QA over KG community. The significance of this system lies in the simplicity of its solution (POS and Dependency Tree Patterns, which can be easily adapted to other languages).

The discussion of the datasets is well done. Please refer to SQA as LC-Quad.

There is no source code or online demo available for this system, which blocks repetition of experiments and comparisons via platforms such as GERBIL QA [3].

# Quality of Writing

The paper is well written and easy to follow. It can be a good entry paper for someone starting in QA.
Unfortunately, the reader has to stop in Section 5.2 reading this paper and needs to read [2] first to understand the rest completely.
However, the reference section is not well done:
- citations 12, 15, 31 are empty
- citation 34 is strange
- citation 7 is not proper: Usbeck, R., Ngomo, A. C. N., Haarmann, B., Krithara, A., Röder, M., & Napolitano, G. (2017, May). 7th open challenge on question answering over linked data (QALD-7). In Semantic Web Evaluation Challenge (pp. 59-69). Springer, Cham.
- citation 8 is not proper: Trivedi, P., Maheshwari, G., Dubey, M., & Lehmann, J. (2017, October). Lc-quad: A corpus for complex question answering over knowledge graphs. In International Semantic Web Conference (pp. 210-218). Springer, Cham.
- citation 3, the latest citation is: Diefenbach, Dennis, et al. "QAnswer: A Question Answering prototype bridging the gap between a considerable part of the LOD cloud and end-users." 2019.
- Why are there three different citations for WDAqua? citations 32, 3, 23
- There is a dangling heading "French Patterns" at the end of the citations
- Citations 26 - 35 do not appear in the text

# Major issues

- The authors claim to be Knowledge Base agnostic (Page 2, l11 -17) but do not prove it. the SPARQL patterns are DBpedia centric (imagine a birthplace property which goes from a place to a person) Suggestion: remove the claim
- The entity extraction (page 6. l 45-51) is greedy and know from HAWK. It remains unclear how entities are chosen if shorter and longer combinations both carry valid URLs
- The formula on page 6 seems to be constructed ad hoc and also the factors 2 respectively length(e)/2 seem chosen ad hoc. Evaluation of the impact of these parameters would benefit the paper a lot.
- Also, the thresholds on page 7 of 3 (Levenshtein) and 0.6 (similarity) seem ad hoc. An evaluation would be appropriate here to increase the soundness of the approach.
- Assuming that a timeout for a boolean query needs to be considered false is wrong. The system should return null. Assuming a return value of true or false heavily influences the performance of the system w.r.t. ask queries in QALD and LCQUAD as both datasets are skewed and do not contain 50% false and 50% true answers.
- Section3/4: Without the evaluations mentioned above, the manual patterns seem to be implemented to reverse engineer the actual datasets. They do not seem to be universal or general enough. Maybe the authors are missing to describe how they crafted the pattern. Which system or methodology did they use? Are the two tables the complete lists?
- Page 11, l 19 - 22 What is the accuracy of the parsing mechanism. Without comparing it, this argument is invalid.
- Page 11, r 29 Why is pattern 7 applied first and not pattern 2? The description of the pattern combination is not satisfying at the moment.
- Page 11, r 44 Filter and Aggregation functions are strongly underexplained. "the rest of the system" Please describe these additions, even if they are ad hoc.
- Is it possible that one complex parse tree leads to different SPARQL queries? Please discuss how this is handled.
- Section 5.1. What pattern gets chosen and why?
- Sections 5.1.1 and 5.1.2 do not take into account how the QALD/LCQuad datasets are created. While QALD is made from real search queries, SQA is generated from patterns which in turn explain the distributions a bit more.
- Section 5.2 Which F-measure is used, macro or micro or QALD? See [3]. It is also unclear why the system is punished so hard if it does not return all answers.
- Table 9: What about the influence of aggregations and filters? What are the remaining errors at 0.9 F-measure? Why is 0.1 missing? A more in-depth error analysis would be appreciated.
- Page 16, right column: To compare QALD-7 F-measure, the authors should rather use task 1 instead of task 4.
- Conclusion: If the authors claim to improve the performance over the base system, they should at least mention the base system and its performance explicitly.

# Minor issue

- Page 5, r 47 => "For example, the solver"
- Page 8, r 17 => "for a triple"
- Page 13 l 46 => "Frequencies"
- Page 14 l 50 => "cannot"
- Page 16 l 45 whitespaces after comma missing
- Page 16 r 2 (REF)???

#References
[1] Radoev, Nikolay, et al. "Amal: Answering french natural language questions using DBpedia". In Mauro Dragoni, Monika Solanki, and Eva Blomqvist, editors, Semantic Web Challenges, pages 90–105, Cham, 2017. Springer International Publishing.
[2] Radoev, Nikolay, et al. "A language adaptive method for question answering on French and English." Semantic Web Evaluation Challenge. Springer, Cham, 2018.
[3] Usbeck, Ricardo, et al. "Benchmarking Question Answering Systems." Semantic Web Journal, 2019

Review #2
By Dennis Diefenbach submitted on 21/Jun/2019
Suggestion:
Major Revision
Review Comment:

#This manuscript was submitted as 'full paper' and should be reviewed along the usual dimensions for research contributions which include (1) originality, (2) significance of the results, and (3) quality of writing.

(1) originality: the presented work focuses on testing dependency and POS tagging patterns coverage for two popular benchmarks and the effects they have on a QA system. This is done for french and english. POS tagging and dependency parsing are well know techniques, used for example in PowerAqua, Treo and DEANNA (for POS tagging) and Intui2, Intui3 and Freya (for dependency parsing). The novelty is that this approach was also tested on French using Universal Dependencies
(2) significance: the paper evaluates these approaches on some newer benchmarks getting good results. Unfortunately the metrics used do not respect the corresponding benchmark metrics because two benchmarks were fused and the evaluation metrics seams to be changed. Moreover no related work is compared. I would accept this paper only if the evaluations are carried out so that they are comparable with the state of the art. I strongly recommend to use the gerbil for QA platform for this. This will also show runtime values.
(3) quality of writing there are: the paper is globally good structured but many points are unclear. I left many questions in the part below.

Introduction
- "Many applications" -> which ones
- "opt for keyword search" -> references
- "those systems" -> references
- [4] is not a QA system
- saying multilingual and restricting to english / french is a bit strange
- present LAMA, but you give a reference ?!?

Methodology
- I do not agree that by looking at the datasets one can see the type of questions asked by an average user. The datasets in questions are both artificial, i.e. they where created by expert users
- dbo: represets also properties! dbo:birthDate
- If I understand correctly you merged the QALD7 and QALD8 datasets and you made one yourself. I consider this as bad practice because it does not allow to compare easily your results with other ones.
- before "The dataset", a point is missing
- before "The system", a space
-Figure 1: I find figure 1 very difficult to understand at this stage of the paper, it is not clear how all the modules are connected to each other since there is no clear workflow, only many arrows
- The relation between firgure 1 and figure 2 is not clear
- more details in section 2.2.1, why is this appearing in section 2.2.2. Should the sections not be merged then?
- "classification is done by using patterns" -> how many patterns are there? Is this not limiting multilinguality?
- "Question solver" -> how many questiion solvers are there? how many custom rules and heuristics? Is this again not a problem for multilinguality?
- 2.2.3 -> I would move this to an offline phase, it cuts the explanation of the pipeline
- domains -> prefixes
- regarding the HashTable, do you not need stemming?
2.2.4
- how is the replacement done, no index needed?
- do you not perform lemmatization in the wrong order? is Queen expanded to "queens", "Queen" and "Queens", should it not be the other way around?
- lenght(e) not defined, number of tokens or letters
- it is not clear at this point what POS and dependency patterns are, moreover it is not clear what the baseline is then
-why use the levenstein distance, do you do stemming this way?
- the example with writer, write is strange because the levenstein distance is less than 3, so you do not need word to vec
- “sparql pattern extraction” is this included in the baseline or not? For a journal paper it is not a problem to repeet previews work
- [15], no citation Moreover a SPARQL query can be much more comprex than this or? The body can contain filters for example
- SPARQL triples -> are called triple patterns
- so 2.3. explains what a sparql query is, while 2.2.6 describes how they are generated, should the order not be inverse?

Dependency based patterns
- [16] is this citation right? ah, you say what a transition based dependency parser is. Should you not cite SyntaxNet
- 3.2 and pattern 4. What would if in the KB the information is encoded like
hobbbit written by Tolkien
silmarillion written by Tolkien
?
Do you have a pattern also for this?

POS-based pattrens
- I didn’t find a patterns for “Skype was developed by who?”, could you point it out for me?
- you mention “before 2010”, how do you recognize this expression and how do you normalize it?

Experiments
- I feel there is at least one frequent case that is not covered by dependency patterns which is: “Give me museums in Berlin”. Do you cover this case? Is it not appearing in QALD?
- typo in frequencies
- Partial answers are not accepted -> this means that you are not using the same evaluation metrics as in QALD!
- For me the case no-patterns is not clear at all!
- In the evaluation you use french dbpedia. But you can follow the wiki interlinks to map french labels to english dbpedia. Then you will not have the problem you are reporting.
- What is the pattern for “Which museum exhibits the scream by Munch?”
- You cannot choose your own evaluation if you want to compare with other systems!!!
- You do not compare with any other system. This is very strange

Future work
- .The -> space missing

Related works
- good place to cite a review paper on the subject like:
Survey on challenges of question answering in the semantic web
Core techniques of question answering systems over knowledge bases: a survey

Conclusions
- Web Semantics -> semantic web

Review #3
Anonymous submitted on 11/Jul/2019
Suggestion:
Reject
Review Comment:

This paper presents a QA approach that is based on lexico-syntactic patterns. The approach is multilingual in principle and is shown to work for English and French. The system is evaluated on the QALD as well as SQA dataset, showing improvements over a version of the system not using patterns. Patterns are defined over POS tag sequences as well as dependency relations.

Overall, I like the approach proposed and find it novel and interesting. However, I have to criticize the presentation and I find that the exposition of the approach lacks clarity and has technical shortcomings.

First of all, I would have liked to see at the beginning a clear motivation for using patterns. What is indeed surprising is that the authors propose a very small set of patterns. The first think one wonders is how many questions this small set of patterns is able to cover. The target is not clear neither. The patterns are only evaluated in an additive scenario, that is adding them on top of an existing system, in this case the LAMAs system. While the impact is there, the paper does not discuss the coverage of the patterns, i.e. how much questions can be covered correctly by the pattern-based approach alone. This would be important I think.

The exposition is odd in the sense that first a lot of information is given on lexicon generation, entity and property extraction etc. However, these are all helper functions that in some sense are not at the core of the pattern-based approach. I would have expected the paper to start explaining and formalizing what patterns are. It seems that for each of these patterns, there is a specific implemented algorithmic behaviour how the pattern is mapped to a query. However, this is not clearly articulated. I would have expected a formalization of patterns together with a framework / language / operation set in which the behaviour of the patterns can be specified. The introduction of patterns in Sections 3 and 4 comes way too late in the paper. For instance, regarding lexico-syntactic pattern representation, in the example introduced by "where we can use the following mapping", I can not differentiate which parts are generic and which parts are specific to the particular example as the language for these "mappings" has never been introduced. Neither the language for specifying or describing dependency-based patterns has been generally introduced. This makes it very difficult to understand the generic proposal from a technical point of view. In some sense I am missing a detailed description and explanation of what we see in Table 4. But having this just "hidden" in a table is not sufficient. The mechanisms for defining and matching patterns needs to be described from a general point of view in the text.

Also it was unclear to me how the POS and dependency-based patterns play together. Are they all matched? What if several patterns match? Is here a preference of one type of pattern over the other? This is not clear.

For the case of adapting the patterns to another language, French in this case, it is not clear from the paper how this is done? Do the patterns need to be re-engineered for another language? I miss details on this aspect in the paper.

Regarding the integration of the pattern-based approach into LAMA, I miss a clear description of how the pattern based approach is integrated into the system. This is not clearly exposed.

Overall, I found the proposed approach very interesting and relevant, but the presentation of the approach needs to be substantially improved before this paper can be accepted IMHO.