Let Me Help You!
Neuro-Symbolic Short-Context Action Anticipation

The Robotics Institute, Carnegie Melon University

Abstract

In an era where robots become available to the general public, the applicability of assistive robotics extends across numerous aspects of daily life, including in-home robotics. This work presents a novel approach for such systems, leveraging long-horizon action anticipation from short-observation contexts. In an assistive cooking task, we demonstrate that predicting human intention leads to effective collaboration between humans and robots. Compared to prior approaches, our method halves the required observation time of human behavior before accurate future predictions can be made, thus, allowing for quick and effective task support from short contexts. To provide sufficient context in such scenarios, our proposed method analyzes the human user and their interaction with surrounding scene objects by imbuing the system with additional domain knowledge, encoding the scene object's affordances. We integrate this knowledge into a transformer-based action anticipation architecture, which alters the attention mechanism between different visual features by either boosting or attenuating the attention between them. Through this approach, we achieve an up to $9\%$ improvement on two common action anticipation benchmarks, namely 50Salads and Breakfast. After predicting a sequence of future actions, our system selects an appropriate assistive action that is subsequently executed on a robot for a joint salad preparation task between a human and a robot.

Neuro-Symbolic Short-Context Action Anticipation

We employ a graph propagation approach to discern relevant affordances linked to each object in the scene and the necessary tools to afford them in the desired manner. The representations for these objects and affordances are then used to adjust attention for visual features in both the transformer encoder and decoder. Subsequently, we derive a sequence of actions predicted to occur in the future part of the video. The robot executes actions for which our model expresses sufficient confidence, utilizing the skill library (outlined in the section below) to assist the human.

Overview

Human-Robot Collaboration Demonstration

We present a sample video showcasing the humans robot collaboration facilitated by our system. In this scenario, a robot observes human actions through a top-view RGB camera. It employs our novel neurosymbolic action anticipation framework, NeSCA, to predict the sequence of actions that the human will most likely do. Subsequently, the robot assists the individual by executing actions it is confident about based on its predictions.

Skill Library

We perform the actions anticipated by NeSCA by utilizing a pre-defined skill library {S0, S1, S2 ... Sm} where each high-level skill Si corresponds to a specific sequence of low-level control inputs. The skills in the library are broadly categorized into three "grasp types": a top-down grasp, suitable for pick-and-place actions with items like vegetables; a sideways grasp, ideal for picking up and pouring objects such as olive oil or vinegar bottles; and an aligned grasp, designed for handling oriented tools like knives and spatulas. The aligned grasp feature is specifically engineered to bring and hand over tools to a human collaborator. In this process, the robot first brings the instructed tool near the potential area of use for easy accessibility.

Add Vinegar

sample_video_1

Hand Over Spatula

sample_video_2

Pick & Place Tomato

sample_video_2

Add Pepper

sample_video_3

Pick & Place Cheese

sample_video_4

Add Dressing

sample_video_5

Pick & Place Cucumber

sample_video_6

Add Salt

sample_video_7

Dummy Kitchen Action Anticipation Dataset

We open-source the collected trajectories in our real-world kitchen setup. This dataset promotes the application of action anticipation from videos for the purpose of real-world human robot interaction.

Sample videos from the dataset

sample_video_1_1
sample_video_1_2

Download our dataset !