French and English Question Answering using Lexico-Syntactic Patterns

Tracking #: 2537-3751

Nikolay Radoev
Amal Zouaq2
Michel Gagnon

Responsible editor: 
Philipp Cimiano

Submission type: 
Full Paper
Continued work on existing knowledge bases (KBs) has given acces to a large amount of structured data. However, most of the existing QA systems using those KBs are designed to handle questions in a single language (mostly English) and those that handle multiple languages have a rather generic approach, leading to reduced overall performance. We present a different method for transforming natural language questions into SPARQL queries. Our method focuses on leveraging the syntactic information of questions to generate one or multiple SPARQL triples. We present a QA system aimed at multilingual query processing and describe a set of lexico-syntactic patterns used to generate the SPARQL queries. Our evaluation was done by applying the described patterns on our QA system over DBpedia and measure the impact on the overall performance.
Full PDF Version: 

Reject (Two Strikes)

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Ricardo Usbeck submitted on 21/Aug/2020
Minor Revision
Review Comment:

First of all, thanks to the authors for their very detailed comments and answers, they are appreciated.

# Reply to author's answer
The authors clarified all review remarks from version 1 and added corresponding remarks in the manuscript.

- Thanks for clarifying the KB-agnosticism issue. One can follow the intuition that adjustments to the core should suffice.
- Concerning the patterns' engineering, please add to the manuscript that for engineering, the patterns used the training dataset only for extracting the patterns. Sorry, if the pointer in the text was overlooked.
- "The system outputs always false if the timeout hits"/"no investigation into the parameters for the S(e) formula"/Other arguments instead of new experiments: While I see your points, e.g., of a closed-world assumption and the merging of QALD7 and QALD8 (which was done in QALD 9 [2,3]), not reevaluating the system hints towards a bad reproducibility of the system in later research. Using QALD 9 instead of an own dataset would greatly benefit reproducibility.
- Side note: There is consent with reviewer two w.r.t. the evaluation setting and not citing references. While it is clear that the manuscript shows the benefit of lexico-syntactic patterns, evident from the manuscript, there are little connection points for future approaches to reuse/reevaluate.

# Originality & Significance
Based on the references found in online databases, this article describes an extension of LAMA, which is an extension of AMAL. AMAL and this extended version of LAMA use the same or similar modules to classify the type of the question and DBpedia Spotlight for entity extraction. However, the way the property lexica are formed, complex questions are deconstructed into simple questions, and how a SPARQL query is formed are different and novel. Subparts of the extended system are known to the QA over KG community.

However, the paper describes a novel QA system and shows the benefit of just a view POS/dependency patterns. This system's significance lies in the simplicity of its solution (POS and Dependency Tree Patterns), which can be easily adapted to other languages).

The discussion of the datasets is well done.

The authors suggest in the introduction that the patterns can be used in any QA system. However, there is no source code or online demo available for this system, which blocks the reproduction of experiments and comparisons via platforms such as GERBIL QA [1]. However, since tables 4 and 6 seem to be complete, a resource-intensive recoding could work.

# Quality of Writing
The paper is well written and easy to follow. It can be a good entry paper for someone starting in QA. Every time LAMA is mentioned, it should be made clear whether the base system, i.e., reference [6], is meant or the system from this paper

# Major issues
- The evaluation is still not reproducible, see above.
- It would be good to see an ablation study for the single components (e.g., question classification) and their impact on the overall performance. What is the accuracy of the Entity Extraction component alone, and with Spotlight? What is the accuracy of the different parts of the Property Extraction component?

# Minor issues
- According to, the authors have named their dataset "LC-QuAD" my bad for giving the wrong abbreviation in the first place.
- Page 1, r, line 40, please add a sentence to the contribution of this paper over LAMA to clarify upfront.
- Page 3, l, line 5, insert space between LC-QuAD and (...)
- Page 3, r, line 44, pre-processing
- Page 8, r, line 22: "patterns used in LAMA" => "patterns used in the extended version of LAMA"?
- Page 10, r, If the manuscript gives the accuracy of Parsey McParseFace, please also give it for SyntaxNet. Please point here to Section 5.4, where it is mentioned.
- Page 13, l, explain the SPARQL build in the base system (No pattern). If one has no access to a Springer library, it is impossible to look up reference [6] legally.
- Page 15, r, line 4, add a space between sentences
- Page 16, l, line 12-14, Note, that this is a different F-Score than what the manuscript shows and thus misleads readers. Please clarify that or remove this part of the sentence. Performance comparison to other systems is not needed if we follow the intuition that this paper investigates pattern usage.

# References

Review #2
Anonymous submitted on 21/Oct/2020
Review Comment:

I thank the authors for reworking the paper. Unfortunately, many issues have not been satisfactorily addressed. The paper has somewhat improved, but the contribution remains unclear and large parts of the approach are not well described. The evaluation lacks comparability to other systems working on QALD and LC-QUAD systems. I detail my criticisms below.

Main contribution: The article is a solid piece of engineering work, presenting a nice approach to question answering based on linguistic analysis and patterns. However, the work is still not well positioned in the sense that the main contribution of the paper is not really well articulated. The main claim is that patterns help. But this is a very coarse claim that is not detailed enough. It is unclear what the baseline condition compared to really is. This should have been worked out in the introduction with a clear discussion of how this claimed contribution fits into the existing landscape of QALD approaches. The authors miss a number of related works here (see below). Neither the contribution is made clear nor is it well positioned in the context of existing work. Regarding the baseline condition: while the LAMA system is explained in more detail, it is not clear how SPARQL queries are generated in the LAMA system. One supposes that some mechanism similar to patterns is needed to map the (collapsed) dependency tree into a SPARQL query. But this mechanism is not detailed, so it is unclear what the pattern-based approach is really compared to.

Patterns: It is unclear how the patterns have been designed and using which data.

Mechanisms: The mechanism by which dependency patterns are mapped to the body of a SPARQL query remain unclear. On page 9 bottom the authors describe the relevant dependencies for the Tolkien example. Then some magic seems to happen to generate the triple patterns mentioned a few lines later. How this works is not explained properly. One could only guess here. Similar remarks hold for the POS-based patterns described in Section 4. The example discussed involving the conjunction shows that multiple POS patterns can match. However, it is not describe how the results of different patterns are combined to yield an overall consistent query interpretation. I suppose that some compositionality principle is involved here, but it is not explained.

Role of patterns: The role of patterns with respect to the baseline LAMA system is not clear as mentioned above. It remains thus unclear what the pattern-based approach is really compared to.

Interplay/conflicts between patterns: I am not satisfied with the answer of the authors on this issue. If one has multiple patterns, even at different levels (POS and dependency trees), then patterns will overlap in the sense that multiple patterns will match the same question. Then one needs to resolve whether the patterns are combined or whether the conflict with each other. There is not hint as how this conflict resolution or combination works.

Evaluation: The evaluation is non-standard for the English case as the way the F-Measure is computed differs from the standard way F-Measure is computed on the QALD task. I really wonder why the authors did not implement the same F-Measure so that their approach becomes comparable to related work. I see the point on the French evaluation following a different setting as the QALD data relies on queries on the English DBpedia only. So the manual evaluation step is acceptable for French.

Missing related work:

There are a number of very related approaches that the authors miss. I see some overlap in terms of methods so a discussion of these works would have been mandatory:

The work on the Aqualog system by Lopez et al. also relies on certain patterns that are matched over dependency parses. This is a seminal approach to question answering over linked data that is not even mentioned:

Vanessa López, Michele Pasin, Enrico Motta:
AquaLog: An Ontology-Portable Question Answering System for the Semantic Web. ESWC 2005: 546-562

More recent work has presented a multilingual QA system based on universal dependency parses that is also not even mentioned:

Sherzod Hakimov, Soufian Jebbara, Philipp Cimiano:
AMUSE: Multilingual Semantic Parsing for Question Answering over Linked Data. International Semantic Web Conference (1) 2017: 329-346

Due to the above reasons, I can not propose this paper for acceptance.