Review Comment:
First of all, thanks to the authors for their very detailed comments and answers, they are appreciated.
# Reply to author's answer
The authors clarified all review remarks from version 1 and added corresponding remarks in the manuscript.
- Thanks for clarifying the KB-agnosticism issue. One can follow the intuition that adjustments to the core should suffice.
- Concerning the patterns' engineering, please add to the manuscript that for engineering, the patterns used the training dataset only for extracting the patterns. Sorry, if the pointer in the text was overlooked.
- "The system outputs always false if the timeout hits"/"no investigation into the parameters for the S(e) formula"/Other arguments instead of new experiments: While I see your points, e.g., of a closed-world assumption and the merging of QALD7 and QALD8 (which was done in QALD 9 [2,3]), not reevaluating the system hints towards a bad reproducibility of the system in later research. Using QALD 9 instead of an own dataset would greatly benefit reproducibility.
- Side note: There is consent with reviewer two w.r.t. the evaluation setting and not citing references. While it is clear that the manuscript shows the benefit of lexico-syntactic patterns, evident from the manuscript, there are little connection points for future approaches to reuse/reevaluate.
# Originality & Significance
Based on the references found in online databases, this article describes an extension of LAMA, which is an extension of AMAL. AMAL and this extended version of LAMA use the same or similar modules to classify the type of the question and DBpedia Spotlight for entity extraction. However, the way the property lexica are formed, complex questions are deconstructed into simple questions, and how a SPARQL query is formed are different and novel. Subparts of the extended system are known to the QA over KG community.
However, the paper describes a novel QA system and shows the benefit of just a view POS/dependency patterns. This system's significance lies in the simplicity of its solution (POS and Dependency Tree Patterns), which can be easily adapted to other languages).
The discussion of the datasets is well done.
The authors suggest in the introduction that the patterns can be used in any QA system. However, there is no source code or online demo available for this system, which blocks the reproduction of experiments and comparisons via platforms such as GERBIL QA [1]. However, since tables 4 and 6 seem to be complete, a resource-intensive recoding could work.
# Quality of Writing
The paper is well written and easy to follow. It can be a good entry paper for someone starting in QA. Every time LAMA is mentioned, it should be made clear whether the base system, i.e., reference [6], is meant or the system from this paper
# Major issues
- The evaluation is still not reproducible, see above.
- It would be good to see an ablation study for the single components (e.g., question classification) and their impact on the overall performance. What is the accuracy of the Entity Extraction component alone, and with Spotlight? What is the accuracy of the different parts of the Property Extraction component?
# Minor issues
- According to http://lc-quad.sda.tech/, the authors have named their dataset "LC-QuAD" my bad for giving the wrong abbreviation in the first place.
- Page 1, r, line 40, please add a sentence to the contribution of this paper over LAMA to clarify upfront.
- Page 3, l, line 5, insert space between LC-QuAD and (...)
- Page 3, r, line 44, pre-processing
- Page 8, r, line 22: "patterns used in LAMA" => "patterns used in the extended version of LAMA"?
- Page 10, r, If the manuscript gives the accuracy of Parsey McParseFace, please also give it for SyntaxNet. Please point here to Section 5.4, where it is mentioned.
- Page 13, l, explain the SPARQL build in the base system (No pattern). If one has no access to a Springer library, it is impossible to look up reference [6] legally.
- Page 15, r, line 4, add a space between sentences
- Page 16, l, line 12-14, Note, that this is a different F-Score than what the manuscript shows and thus misleads readers. Please clarify that or remove this part of the sentence. Performance comparison to other systems is not needed if we follow the intuition that this paper investigates pattern usage.
# References
[1] http://www.semantic-web-journal.net/system/files/swj1578.pdf
[2] QALD 9/ QUANT
[3] https://dblp.uni-trier.de/rec/journals/semweb/HoffnerWMULN17.html?view=b...
|