Ontology-based Understanding of Everyday Activity Instructions

Tracking #: 2861-4075

Sebastian Höffner
Robert Porzel
Maria M. Hedblom
Mihai Pomarlan
Vanja Sophie Cangalovic
Johannes Pfau
John Bateman
Rainer Malaka

Responsible editor: 
Katja Hose

Submission type: 
Full Paper
Going from natural language instructions to fully specified executable plans for household robots involves a challenging variety of reasoning steps. In this paper, a processing pipeline to tackle these steps is proposed and implemented. It uses the ontological Socio-physical Model of Activities (SOMA) as a common interface between its components. The major advantage of employing an overarching ontological framework is that its asserted facts can be stored alongside the semantics of instructions, contextual knowledge, and annotated activity models in one central knowledge base. This allows for a unified and efficient knowledge retrieval across all pipeline components, providing flexibility and reasoning capabilities as symbolic knowledge is combined with annotated sub-symbolic models.
Full PDF Version: 

Major Revision

Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 22/Sep/2021
Major Revision
Review Comment:

This work proposes a processing system built on top of closely related literature, aiming for an ambitious understanding of natural language instructions.

The supplemental resources included in the live online platform provide a way of inspecting several of the components along some examples corresponding to the ones used through the paper.

A possible concern is the path towards generalization, as it's currently constrained to the kitchen scenario, with the moving-based action. Section 3.6, in particular, is heavily focused on the reliance on the proximity theory.
Scaling up is mentioned among the assumptions and the possible lines of future work, yet such an extension might correspond with a smaller scope than the one claimed in some parts of this work as regarding a wider space of possible commands.

* The paper would also benefit from providing further discussions, for example, about these possible suggested parts:

- In Section 3.4, from some perspective, it could seem that the mappings
from symbol to concept (performed by SCG), and
from concept to object (given in semantic map)
could be composed. But then it's processed further as this is not sufficient.
It might be initially confusing, why it's needed to complete the semiotic triangle.

- It's not clear whether the method in terms of maximal mean path depth is assumed in this work, or is rather a known part of terminology in literature.
In the first case, it might be dubious for the reader that this method rarely fails, instead, it might seem like there could be cases where this assumption is insufficient to find the object referred by the symbol.

- Semantic mapping is not performing and rather assumed to rely on another system that performs this kind of mapping (Section 3.3). Again the existence of the semantic map is listed as one of the assumptions in Section 4, but not necessarily discussed. Is there a system for this, already existing in literature, that could serve as hinting possible ways to perform semantic mapping for an actual complete system like the one in this work?

- Evaluation is also another area where the work would benefit from further clarifications, since it's not specified about a systematic, repeatable, comparable performance measurement. The problem itself to be addressed is also rather vaguely established, as corresponding to "simulating a command from user". A more concrete notion of what is the simulation goal could contribute to clarifying the evaluation strategy for the whole system. Discussions or observations about more detailed component-level evaluations would also be very interesting.

* Some minor issues and suggestions:

- In page 3, line 51:
"Next, DLU determines referents (f)" -> probably "Next, DLU determines referents (c)"?
- Several textual strings are strangely linked to the top of the file upon clicking (for example: page 1, line 25; page 2, lines 34, 43; page 3, line 41; page 4, line 22)
- In page 10, line 12: "error free..." -> "error-free..."
- In page 10, lines 47-48: "advance... advanced"

Review #2
By Stefano Borgo submitted on 06/Oct/2021
Major Revision
Review Comment:

Review swj2861

This paper describes parts of a pipeline aimed to achieve a formal understanding of commands given in natural language.
The topic is interesting to SWJ and the paper addresses problems that have received attention in robotics for many years, like symbol grounding, or in more recent research, like the use of image schema for planning.
The research question is explicitly stated: “how can we use ontological knowledge to extract and evaluate parameters from a natural language instruction in order to simulate it formally?” and the evaluation of the result is done via simulation in a virtual environment previously developed, at least in part, by the authors themselves. This kind of evaluation is suitable for the given research question and the paper content. Yet, more information should be provided to make the material accessible and reusable to the community. I must confess that at times I'm confused about what is assumed to be true in general and what is true because of the way the system elaborates the available data. This might explain some of the points I make below.

The paper’s title and the research question address instructions in natural language but the examples in the paper are about commands. The capability of the pipeline to manage instructions in a proper sense is unclear. For example, a food recipe is made of instructions but goes beyond the cases discussed in the paper: usually a recipe includes positive and negative commands, alternative suggestions, motivations and other kinds of explanatory text. Which of these the proposed pipeline covers is not clear to me.

The abstract should be expanded to better situate the research and to give a more informative description of the paper’s content, in particular the pipeline structure.
Also, the paper is entirely focused (motivations, development and simulation) on the robot scenario. If not the title, at least the keywords should include pointers to robotics or robotic applications.

The choice to evaluate the results via simulation in a virtual environment is suitable due to the domain of application. I would suggest to add also checks internal to the pipeline, e.g., a theoretical validation that all input receivable by a node in the pipeline is processable by that node; and I guess some check of consistency across annotations in the same scene might be useful: what happens if I have two objects dubbed fridge in the overall assumption that there is only one fridge in the kitchen?

My other concern is that the simulation case is limited with respect of the variety of possible commands. What is the outcome of the pipeline when the command itself presents some ambiguity? E.g., “Put the chair on the table” (where the chair is used by mistake while meaning the bowl previously put on it); “Take the cheese out with the bottle” (perhaps having said earlier “take the bottle out of the fridge”); “Put the bottle back” (meaning: back to the fridge).
It is unclear which of these cases are out of scope and how the others are managed by the system. For the first, information should be stated clearly from the start. For the latter, I believe more simulations are needed to verify if and up to which point the system works correctly.
In short, the research question asks how we can solve a general problem but the paper remains unclear about the restrictions assumed for the pipeline to solve the problem, and the evaluation section does not clarify this either.

I had to guess what is assumed at the different stages of the information flow. Overall the pipeline makes sense at first sight but what kind of input is assumed at each step with respect to a real scene, what kind of information is manipulated at each step and what is the result is only suggested (notwithstanding the Listing 1-4). I suggest a more systematic presentation of the information flowchart.

The authors could be clearer on the parts of the pipeline which are newly introduced (if any) and those that are preexisting it. This can be easily shown using thick boundaries (say for the new parts) and thin boundaries (for the old) to mark the nodes of Fig. 1.

Fig.1: in the legend for Data flow and Annotation put just the arrows (drop the slots).
According to the figure, the DLU ontology is sending annotation to context (d) without any input regarding the Scene which is really weird, how is this actually working?
In the caption of Listing 2 rephrase in natural language the content.

Similarly, in Listing 3 we have
ref:type dlu:Cup
ref:type dlu:Cup
ref:type soma:DesignedContainer
ref:type soma:DesignedContainer
where one is a symbol and the other an object.
In which sense are these of the same type?

Why the uuid used in listing 2 is not one of the identifiers occurring in Fig.3? Is there a reason?

The authors should clarify further the use of the term context as introduced in Sect 3.3. Apparently it is the scene but practically it seems to be the scene and the set of sentences parsed so far, that is, the internal representation of the knowledge of the scene at this point in time. Yet, the text uses also ‘context description’ as if the two notions should be kept apart.

pg. 4 suddenly restricts information to a special type of commands: “For a command to be executed, an environment with objects to manipulate is mandatory.”
Is this excluding commands of type “Go near the fridge”, “Stay away of the cooker” and “Check if the pot is on the burner” where no manipulation (except of itself) is needed nor expected?

It is unclear whether this claim “Each object in the kitchen scene is internally annotated with unique identifiers and semantic labels aligned with SOMA” is a prerequisite for the system to apply or an observation of how the input provided by component (f) is. In other words, can it happen that an object has no semantic label? What happens in this case?
(I must confess that I have this kind of reading problem across the paper especially when referring to the pipeline: statements are given as they were isolated which does not help to understand whether they are derived by the system or assumed to be true at that stage even though it might not be the case while running the system.)
I wonder whether the claim “a semantic map is assumed to be provided” trivializes the earlier discussion and the grounding problem itself. This should be clarified.

Sect. 3.4: Is the mapping btw objects and symbols bidirectional? All the other maps are described as being unidirectional.

I would be interesting to know where the heuristic of Table 1 fails. Could you add some examples. Are there checks for these cases?

In fig. 3, I have hard time making sense of links like this
id:8e065… (a trajectory) dul:classifies id:18b08.. (a cup)
Please, add more information on how they should be read.

Review #3
By Christopher Myers submitted on 07/Oct/2021
Minor Revision
Review Comment:

The authors present an architecture designed to take instructions as natural language input (e.g., take the cup to the table) and act on the provided instructions. The architecture, or pipeline, uses a combination of the Socio-physical Model of Activities (SOMA) and a Deep Language Understanding (DLU) system to to parse perceptual and linguistic information and ground it to objects and actions required for carrying out the instruction.

General Comments & Criticisms
The authors are attempting to tackle a set of problems critical to the increased deployment of robotics within the home, such as, but not limited to 1) natural language understanding, 2) context sensitivity, 3) grounding/referencing, and 4) action selection and execution. Each one of these are significant challenges and are distinct research foci. One of the contributions from this work is how to pull all of these pieces together by leveraging ontologies, and speaks to the originality of the reported research. Indeed, the DLU+SOMA approach is one of several possible for achieving (re)taskable robots. Another end-to-end system/framework/architecture/approach/solution that was not mentioned in the text is Interactive Task Learning (Gluck & Laird, 2018; Kirk, Mininger, & Laird, 2016).They provide an alternative solution to the DLU+SOMA solution presented in the paper, and it would be worth discussing differences b/t the approaches, strengths, weaknesses, etc. The writing was clear and concise and the various supplemental materials appear to be complete.

Another general criticism I have is that the entirety of the paper provides a description of the system w/o any real evaluation of the systems performance as a whole, issues with individual components, etc. This evaluation would provide readers good indicators where to focus future research (knowledge gaps, semantic map acquisition, etc.) that need to be tackled to make the proposed architecture more robust and add weight to the points made in the discussion and limitations section. I would recommend being as quantitative as possible (failure/success rates, processing speeds, etc. over multiple runs). Along these lines, one example instruction is provided: "put the cup on the table". It would be very informative to report the variety of instructions that are handled and the variety that are not handled by the system.

Specific Comments & Criticisms
• Page 1: a point was made that “Most commonly, any robot activity starts with the robot receiving instructions for that specific activity…verbally from a human or through written texts such as recipes or procedures found in online repositories.” This statement and description is more suggestive of a ‘could be possible’ or a ‘potential future’ than a ‘currently available method’ for providing instructions to household robots to complete tasks (especially wikiHow as an instruction set source), but I could be wrong about this.
• The questions posed at the end of the introduction sets the stage: “How can we use ontological knowledge to extract and evaluate parameters from a natural language instruction in order to simulate it formally?” and the authors specify their approach as a potential solution. I just want to note that research reported by Eberhart, et al. (2020)is also attempting to answer the same question in using a generalizable ontology of instruction.
• A game engine is used for simulating the agent’s/robot’s behavior. This is a common approach for demonstrating capabilities and gaps of robots and intelligent systems in general. However, the original context describing the instruction taking robot was helping out in a kitchen. Is it the case that the simulated instantiation and the physical context close enough that the demonstration is generalizable between the two contexts?
• In the sentence beginning on line 39, page 4, what happens if the underlying grammar cannot map the semantics onto the SOMA concepts?
• In the sentence beginning on line 39, page 6, it would be worthwhile to briefly describe how the terminal scene is constructed within the system.
• The paragraph beginning on line 28 of page 8, the DLU’s use of ‘human computation’ to solve the problem of relative spatial closeness. Is this a requirement of the system for any task presented that requires spatial approximation and guidance, or is this a side-effect of the intelligence residing within a virtual environment?

Eberhart, A., Shimizu, C., Stevens, C., Hitzler, P., Myers, C. W., & Maruyama, B. (2020). A Domain Ontology for Task Instructions. In B. Villazón-Terrazas, F. Ortiz-Rodríguezm, S. M. Tiwari, & S. K. Shandilya (Eds.), Knowledge Graphs and Semantic Web. Second Iberoamerican Conference and First Indo-American Conference, KGSWC 2020 (pp. 1–13). Mérida, Mexico: Communications in Computer and Information Science, vol. 1232.

Gluck, K. A., & Laird, J. E. (Eds.). (2018). Interactive task learning: Humans, robots, and agents acquiring new tasks through natural interactions. Interactive task learning: Humans, robots, and agents acquiring new tasks through natural interactions. Gluck, Kevin A.: Cognitive Models and Agents Branch, Air Force Research Laboratory, Wright-Patterson AFB, OH, US, 45433: The MIT Press.

Kirk, J., Mininger, A., & Laird, J. (2016). A demonstration of interactive task learning. IJCAI International Joint Conference on Artificial Intelligence, 2016-Janua, 4248–4249.