Deep Understanding of Everyday Activity Commands for Household Robots

Tracking #: 2973-4187

Sebastian Höffner
Robert Porzel
Maria M. Hedblom1
Mihai Pomarlan
Vanja Sophie Cangalovic
Johannes Pfau
John Bateman
Rainer Malaka

Responsible editor: 
Katja Hose

Submission type: 
Full Paper
Going from natural language directions to fully specified executable plans for household robots involves a challenging variety of reasoning steps. In this paper, a processing pipeline to tackle these steps for natural language directions is proposed and implemented. It uses the ontological Socio-physical Model of Activities (SOMA) as a common interface between its components. The pipeline includes a natural language parser and a module for natural language grounding. Several reasoning steps formulate simulation plans, in which robot actions are guided by data gathered using human computation. As a last step, the pipeline simulates the given natural language direction inside a virtual environment. The major advantage of employing an overarching ontological framework is that its asserted facts can be stored alongside the semantics of directions, contextual knowledge, and annotated activity models in one central knowledge base. This allows for a unified and efficient knowledge retrieval across all pipeline components, providing flexibility and reasoning capabilities as symbolic knowledge is combined with annotated sub-symbolic models.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 05/Jan/2022
Review Comment:

This revised version addresses my comments about being constrained to the kitchen scenario and about the need for clarifying how this work builds upon previous work, comments shared with other reviewers.
It addresses some other comments from other reviewers such as need for highlighting the robot scenario, and several clarifications about made assumptions.
Especially, the new Section 1.1 is important in narrowing down the specific focus on directions as subset of all possible instructions, and the context, as well as bringing earlier the properties interesting for evaluation purposes behind the running example justifying why it's used.
Enhanced related work Section 2, discussing literature suggested in other reviews, is also to acknowledge.

My questions regarding semiotic triangle and mean path depth are addressed appropriately in this version.

The proposal for a more quantitative evaluation in last section could render relevant for the end-to-end fashion.
This is proposed as a framework to study in future work so there are no additional quantitative results in this version under this fashion.
Still, and as also pointed by another reviewer, it would be beneficial to have discussed how to evaluate individual components in the pipeline, at least when a component seems deemed as sensible.

Minor comments:
- There are some occurrences (three or four at least) of a colon followed by an phrase starting in uppercase.

Review #2
By Stefano Borgo submitted on 09/Jan/2022
Review Comment:

The new version of the paper considerably improves the quality of the presentation, clarifies the actual contribution of this work, describes more clearly the limitations of the approach and explains the connections across the different components of the architecture (however, see below). The treatment of these points was weak in the previous version. Now they have been addressed and largely solved.
In this version the work is well presented and the original results are identified. I find particularly relevant the overall clean and well motivated framework that the authors obtain. I think it provides a reliable modular setting to develop research towards flexible, extensible and comprehensive robotic systems. Furthermore, having a common framework allows to systematically compare alternative solutions.

A couple of points:

In fig.1, in my pdf the arrow from "SCG Ontology" to "SCG" looks much thicker than the other solid arrows.

Sect.3 "After that, the pipeline extracts image schema theories..."
Expression "After that" suggests that the steps (e) and (f) are sequential. This means that knowledge of the context (d) and referents (f) may be used to extract the image schema(s). This would make sense. The difference is seen comparing command "hold the dish till I put the napkin under it" and command "raise the dish till I put the napkin under it".
It is unclear to me which schemas are triggered by the two commands. support? verticality? path? Differently from the running example, in this case there is no location which is the target of the movement, and there might be no suitable empty space on the table to play that role. This information is available from (f) which indeed would help to trigger the right image schemas independently of which command one uses.
The architecture (fig. 1) does not suggest that sequentiality is even possible. From the figure it looks like (d)+(f) and (e) run in parallel and independently. It would be great if the authors add a few more words on this point.

Review #3
By Christopher Myers submitted on 11/Jan/2022
Review Comment:

The authors present an architecture designed to take directions as natural language input (e.g., take the cup to the table) and act on the provided direction. The architecture, or pipeline, uses a combination of the Socio-physical Model of Activities (SOMA) and a Deep Language Understanding (DLU) system to parse perceptual and linguistic information and ground it to objects and actions required for carrying out the instruction.

General Comments & Criticisms

The authors present a concise and novel advance in the approach to providing direction to intelligent systems. In general, the authors have done a good job in addressing the issues I found in the originally submitted manuscript. It is my opinion that the revised manuscript should be published in the Semantic Web Journal.