Latent Plan Transformer for Trajectory Abstraction: Planning as Latent Space Inference
In tasks aiming for long-term returns, planning becomes essential. We study generative modeling for planning with datasets repurposed from offline reinforcement learning. Specifically, we identify temporal consistency in the absence of step-wise rewards as one key technical challenge. We introduce t...
Saved in:
Main Authors: | , , , , , , , , |
---|---|
Format: | Journal Article |
Language: | English |
Published: |
07-02-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In tasks aiming for long-term returns, planning becomes essential. We study
generative modeling for planning with datasets repurposed from offline
reinforcement learning. Specifically, we identify temporal consistency in the
absence of step-wise rewards as one key technical challenge. We introduce the
Latent Plan Transformer (LPT), a novel model that leverages a latent variable
to connect a Transformer-based trajectory generator and the final return. LPT
can be learned with maximum likelihood estimation on trajectory-return pairs.
In learning, posterior sampling of the latent variable naturally integrates
sub-trajectories to form a consistent abstraction despite the finite context.
At test time, the latent variable is inferred from an expected return before
policy execution, realizing the idea of planning as inference. Our experiments
demonstrate that LPT can discover improved decisions from sub-optimal
trajectories, achieving competitive performance across several benchmarks,
including Gym-Mujoco, Franka Kitchen, Maze2D, and Connect Four. It exhibits
capabilities in nuanced credit assignments, trajectory stitching, and
adaptation to environmental contingencies. These results validate that latent
variable inference can be a strong alternative to step-wise reward prompting. |
---|---|
DOI: | 10.48550/arxiv.2402.04647 |