Review Comment:
Summary: The authors use an OWL/SWRL reasoner to help generate RDF representations of arithmetic word problems, involving multiple agents transferring multiple types of objects to each other. These are used to generate natural language word problems. The authors also use language models to translate natural language word problems into RDF representations and solve them using the ontology reasoner. They show that this approach performs better than simply asking ChatGPT 3.5 or Gemini to solve the problems.
(1) Originality: I am not an expert in this area, but I do not have reason to doubt the authors' claim that complex transfer AWP solving is a new problem. The bibliography is extensive.
(2) Significance: Large improvement over bare LLMs is demonstrated. Authors claim that existing transfer AWP datasets do not contain complex transfers, so existing work is not directly comparable. The LLM method used for comparison is admittedly naive compared to the way LLMs have been used on simple AWPs (with chain-of-thought prompting), and it remains to be seen whether the current approach beats other non-naive uses of LLMs. However, the performance gain over naive use of LLMs is more than incremental, and the problem is interesting, because it shows an intuitively simple domain where LLMs struggle.
(3) Quality of Writing: The paper could be better organized. The overall design of the system should be described in full before mentioning specifics. For example: the SWRL rules used in the system probably should not be presented until after explaining that the system calls an ontology reasoner (and that it calls the reasoner repeatedly, once for each transfer). The sentence "If value of any quantity gets updated using an SWRL rule and it is required in the transfer that follows, it creates the sequential reasoning situation" is perfectly intelligible now that I've read the whole paper, but when I encountered it, it was mysterious.
Questions: Algorithm 1, lines 9-11: I do not understand what Sync-Reasoner does. At each iteration, are the hasUpdatedValue(q,v) atoms used to actually update the quantValue(q,v) atoms, before applying the SWRL rule again? If not, I do not see how this process is correct sequential reasoning.
Data artifacts:
(A) Organization: Good. Readme described everything in the repo.
(B) Reproducibility: I do not think I could reproduce the experiments using these resources. I would like to have the actual code and info on the Bert-based language model used (version, training epochs etc.) - it would be easiest to just have the full code used for training. And due to other questions like the one about hasUpdatedValue above, without code I am not confident I understand exactly what the authors did.
(C) Repository stability: Github is good.
(D) Completeness: JSON file of problems used in the assessment only contains 80 of the 200 problems used. Why?
|