Generating and Solving Complex Transfer Type Arithmetic Word Problems: An Ontological Approach

Tracking #: 3722-4936

Authors: 
Suresh Kumar
P. Sreenivasa Kumar

Responsible editor: 
Guest Editors Education 2024

Submission type: 
Full Paper
Abstract: 
Most existing Arithmetic Word Problem (AWP) solvers focus on solving simple examples. Transfer-Case AWPs (TC-AWPs) involve scenarios where objects are transferred between agents. The widely used AWP datasets mainly consist of simple TC-AWPs (problems that involve single object-transfer). Current Large Language Models (LLMs) are capable of solving most of these simple TC-AWPs effectively. In this work, we focus on assessing the solving capability of LLMs (chatGPT and Gemini) for complex TC-AWPs (where multiple types of objects are transferred or more than one transfer of an object is performed). Since the popular AWP datasets contain only simple TC-AWPs, we first generate complex TC-AWPs using an ontological approach. We utilize these complex examples to assess LLMs' word-problem-solving capabilities. We observe that the accuracy of LLMs falls down rapidly as the number of object transfers is increased to 3 or 4. An approach for solving TC-AWPs using ontologies and M/L exists in the literature. We propose an extension of this approach that can handle complex TC-AWPs and find that compared to the current LLMs, the proposed solution gives better accuracy for complex TC-AWPs. We analyze the failed cases of the LLM approach and find that the reasoning capabilities of LLMs need a lot of improvement.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Joseph Zalewski submitted on 01/Oct/2024
Suggestion:
Minor Revision
Review Comment:

Summary: The authors use an OWL/SWRL reasoner to help generate RDF representations of arithmetic word problems, involving multiple agents transferring multiple types of objects to each other. These are used to generate natural language word problems. The authors also use language models to translate natural language word problems into RDF representations and solve them using the ontology reasoner. They show that this approach performs better than simply asking ChatGPT 3.5 or Gemini to solve the problems.

(1) Originality: I am not an expert in this area, but I do not have reason to doubt the authors' claim that complex transfer AWP solving is a new problem. The bibliography is extensive.

(2) Significance: Large improvement over bare LLMs is demonstrated. Authors claim that existing transfer AWP datasets do not contain complex transfers, so existing work is not directly comparable. The LLM method used for comparison is admittedly naive compared to the way LLMs have been used on simple AWPs (with chain-of-thought prompting), and it remains to be seen whether the current approach beats other non-naive uses of LLMs. However, the performance gain over naive use of LLMs is more than incremental, and the problem is interesting, because it shows an intuitively simple domain where LLMs struggle.

(3) Quality of Writing: The paper could be better organized. The overall design of the system should be described in full before mentioning specifics. For example: the SWRL rules used in the system probably should not be presented until after explaining that the system calls an ontology reasoner (and that it calls the reasoner repeatedly, once for each transfer). The sentence "If value of any quantity gets updated using an SWRL rule and it is required in the transfer that follows, it creates the sequential reasoning situation" is perfectly intelligible now that I've read the whole paper, but when I encountered it, it was mysterious.

Questions: Algorithm 1, lines 9-11: I do not understand what Sync-Reasoner does. At each iteration, are the hasUpdatedValue(q,v) atoms used to actually update the quantValue(q,v) atoms, before applying the SWRL rule again? If not, I do not see how this process is correct sequential reasoning.

Data artifacts:

(A) Organization: Good. Readme described everything in the repo.

(B) Reproducibility: I do not think I could reproduce the experiments using these resources. I would like to have the actual code and info on the Bert-based language model used (version, training epochs etc.) - it would be easiest to just have the full code used for training. And due to other questions like the one about hasUpdatedValue above, without code I am not confident I understand exactly what the authors did.

(C) Repository stability: Github is good.

(D) Completeness: JSON file of problems used in the assessment only contains 80 of the 200 problems used. Why?

Review #2
Anonymous submitted on 04/Mar/2025
Suggestion:
Major Revision
Review Comment:

Overall impression:
The paper is generally well written and contains a somewhat novel idea that may merit further investigation and development. The only thing I find deficient is that the paper presents a proposed approach rather than a fully developed approach with a complete methodology and dedicated software that goes beyond testing done in the browser of a website by hand. The ontology development, transformation, and reasoning aspects are much more developed, but these are only half of the picture.

As such I would recommend to accept the paper if it were heavily revised, and this recommendation would be significantly improved if additional test data was added beyond what is in the GitHub in order to automate the experiment and allow a much larger test set. Given the difficulty of interpreting the reason for success or failure of a back-box commercial LLM like we see in this experiment I think this is crucial for the scientific quality of the paper.

(1) originality
The experiment is original and contributes to a growing body of work that examines the benefits of combining LLMs with structured data in Ontologies. The approach appears novel, though it is more of a proposed approach rather than a fully realized approach

(2) significance of the results
The results are of minor significance but show potential for further investigation

(3) quality of writing
The writing is acceptable with few enough issues that anything significant came to my attention. The authors are recommended to revise the language regardless if they choose to rewrite.

(4) Long-term stable URL for resources
The GitHub is well organized an the content is clear with a sufficient README. I suspect there may be relevant files that are not included but would improve the completeness of the repo, such as the input ontologies and links in the readme to relevant tools that can be used to reproduce the results. All files load properly.

Notes:
- There are quite a few abbreviated terms (eg AWP, BT, etc). While this is to be expected in a technical paper it would be helpful to write out some of them for readability. This applies to the ontology classes as well which could benefit from descriptive labels (this may improve LLM results as well).
- The background section on Semantic Web standards is a bit too long, eg we probably don't need an explanation of what IRIs are
- I find the location of the related work section odd, it may fit better after the introduction
- Is there a reason why you use tc:quantType and not rdf:type? If so I would be interested, otherwise this may be interesting to explore in case it has an impact.
- the methodology is at times thorough and at times sparse. As a reader I had to jump around to grasp the entire picture. I would recommend a reorganization of the experimental sections to really clarify and define in concise steps the overall system that is proposed.