MotIF: Motion Instruction Fine-tuning
While success in many robotics tasks can be determined by only observing the final state and how it differs from the initial state - e.g., if an apple is picked up - many tasks require observing the full motion of the robot to correctly determine success. For example, brushing hair requires repeated...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
16-09-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | While success in many robotics tasks can be determined by only observing the
final state and how it differs from the initial state - e.g., if an apple is
picked up - many tasks require observing the full motion of the robot to
correctly determine success. For example, brushing hair requires repeated
strokes that correspond to the contours and type of hair. Prior works often use
off-the-shelf vision-language models (VLMs) as success detectors; however, when
success depends on the full trajectory, VLMs struggle to make correct judgments
for two reasons. First, modern VLMs are trained only on single frames, and
cannot capture changes over a full trajectory. Second, even if we provide
state-of-the-art VLMs with an aggregate input of multiple frames, they still
fail to detect success due to a lack of robot data. Our key idea is to
fine-tune VLMs using abstract representations that are able to capture
trajectory-level information such as the path the robot takes by overlaying
keypoint trajectories on the final image. We propose motion instruction
fine-tuning (MotIF), a method that fine-tunes VLMs using the aforementioned
abstract representations to semantically ground the robot's behavior in the
environment. To benchmark and fine-tune VLMs for robotic motion understanding,
we introduce the MotIF-1K dataset containing 653 human and 369 robot
demonstrations across 13 task categories. MotIF assesses the success of robot
motion given the image observation of the trajectory, task instruction, and
motion description. Our model significantly outperforms state-of-the-art VLMs
by at least twice in precision and 56.1% in recall, generalizing across unseen
motions, tasks, and environments. Finally, we demonstrate practical
applications of MotIF in refining and terminating robot planning, and ranking
trajectories on how they align with task and motion descriptions. Project page:
https://motif-1k.github.io |
---|---|
DOI: | 10.48550/arxiv.2409.10683 |