Review Comment:
# Originality & Significance
Based on the references found in online databases, this article describes an extension of LAMA [2] which is an extension of AMAL [1]. AMAL and this extended version of LAMA use the same or similar modules to classify the type of the question and DBpedia Spotlight for entity extraction. However, the way the property lexica are formed, complex questions are deconstructed into simple questions and how a SPARQL query is formed are different and novel. Thus, the paper describes a novel QA system, but even single subparts of the extended system are known to the QA over KG community. The significance of this system lies in the simplicity of its solution (POS and Dependency Tree Patterns, which can be easily adapted to other languages).
The discussion of the datasets is well done. Please refer to SQA as LC-Quad.
There is no source code or online demo available for this system, which blocks repetition of experiments and comparisons via platforms such as GERBIL QA [3].
# Quality of Writing
The paper is well written and easy to follow. It can be a good entry paper for someone starting in QA.
Unfortunately, the reader has to stop in Section 5.2 reading this paper and needs to read [2] first to understand the rest completely.
However, the reference section is not well done:
- citations 12, 15, 31 are empty
- citation 34 is strange
- citation 7 is not proper: Usbeck, R., Ngomo, A. C. N., Haarmann, B., Krithara, A., Röder, M., & Napolitano, G. (2017, May). 7th open challenge on question answering over linked data (QALD-7). In Semantic Web Evaluation Challenge (pp. 59-69). Springer, Cham.
- citation 8 is not proper: Trivedi, P., Maheshwari, G., Dubey, M., & Lehmann, J. (2017, October). Lc-quad: A corpus for complex question answering over knowledge graphs. In International Semantic Web Conference (pp. 210-218). Springer, Cham.
- citation 3, the latest citation is: Diefenbach, Dennis, et al. "QAnswer: A Question Answering prototype bridging the gap between a considerable part of the LOD cloud and end-users." 2019.
- Why are there three different citations for WDAqua? citations 32, 3, 23
- There is a dangling heading "French Patterns" at the end of the citations
- Citations 26 - 35 do not appear in the text
# Major issues
- The authors claim to be Knowledge Base agnostic (Page 2, l11 -17) but do not prove it. the SPARQL patterns are DBpedia centric (imagine a birthplace property which goes from a place to a person) Suggestion: remove the claim
- The entity extraction (page 6. l 45-51) is greedy and know from HAWK. It remains unclear how entities are chosen if shorter and longer combinations both carry valid URLs
- The formula on page 6 seems to be constructed ad hoc and also the factors 2 respectively length(e)/2 seem chosen ad hoc. Evaluation of the impact of these parameters would benefit the paper a lot.
- Also, the thresholds on page 7 of 3 (Levenshtein) and 0.6 (similarity) seem ad hoc. An evaluation would be appropriate here to increase the soundness of the approach.
- Assuming that a timeout for a boolean query needs to be considered false is wrong. The system should return null. Assuming a return value of true or false heavily influences the performance of the system w.r.t. ask queries in QALD and LCQUAD as both datasets are skewed and do not contain 50% false and 50% true answers.
- Section3/4: Without the evaluations mentioned above, the manual patterns seem to be implemented to reverse engineer the actual datasets. They do not seem to be universal or general enough. Maybe the authors are missing to describe how they crafted the pattern. Which system or methodology did they use? Are the two tables the complete lists?
- Page 11, l 19 - 22 What is the accuracy of the parsing mechanism. Without comparing it, this argument is invalid.
- Page 11, r 29 Why is pattern 7 applied first and not pattern 2? The description of the pattern combination is not satisfying at the moment.
- Page 11, r 44 Filter and Aggregation functions are strongly underexplained. "the rest of the system" Please describe these additions, even if they are ad hoc.
- Is it possible that one complex parse tree leads to different SPARQL queries? Please discuss how this is handled.
- Section 5.1. What pattern gets chosen and why?
- Sections 5.1.1 and 5.1.2 do not take into account how the QALD/LCQuad datasets are created. While QALD is made from real search queries, SQA is generated from patterns which in turn explain the distributions a bit more.
- Section 5.2 Which F-measure is used, macro or micro or QALD? See [3]. It is also unclear why the system is punished so hard if it does not return all answers.
- Table 9: What about the influence of aggregations and filters? What are the remaining errors at 0.9 F-measure? Why is 0.1 missing? A more in-depth error analysis would be appreciated.
- Page 16, right column: To compare QALD-7 F-measure, the authors should rather use task 1 instead of task 4.
- Conclusion: If the authors claim to improve the performance over the base system, they should at least mention the base system and its performance explicitly.
# Minor issue
- Page 5, r 47 => "For example, the solver"
- Page 8, r 17 => "for a triple"
- Page 13 l 46 => "Frequencies"
- Page 14 l 50 => "cannot"
- Page 16 l 45 whitespaces after comma missing
- Page 16 r 2 (REF)???
#References
[1] Radoev, Nikolay, et al. "Amal: Answering french natural language questions using DBpedia". In Mauro Dragoni, Monika Solanki, and Eva Blomqvist, editors, Semantic Web Challenges, pages 90–105, Cham, 2017. Springer International Publishing.
[2] Radoev, Nikolay, et al. "A language adaptive method for question answering on French and English." Semantic Web Evaluation Challenge. Springer, Cham, 2018.
[3] Usbeck, Ricardo, et al. "Benchmarking Question Answering Systems." Semantic Web Journal, 2019
|