Generating attentive goals for prioritized hindsight reinforcement learning

Typical reinforcement learning (RL) performs a single task and does not scale to problems in which an agent must perform multiple tasks, such as moving a robot arm to different locations. The multi-goal framework extends typical RL using a goal-conditional value function and policy, whereby the agen...

Full description

Saved in:

Bibliographic Details
Published in:	Knowledge-based systems Vol. 203; p. 106140
Main Authors:	Liu, Peng, Bai, Chenjia, Zhao, Yingnan, Bai, Chenyao, Zhao, Wei, Tang, Xianglong
Format:	Journal Article
Language:	English
Published:	Amsterdam Elsevier B.V 05-09-2020 Elsevier Science Ltd
Subjects:	Attentive goals generation Errors Hindsight experience replay Learning Prioritized hindsight model Reinforcement learning Robot arms Prioritized hindsight model Attentive goals generation Reinforcement learning Hindsight experience replay
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Typical reinforcement learning (RL) performs a single task and does not scale to problems in which an agent must perform multiple tasks, such as moving a robot arm to different locations. The multi-goal framework extends typical RL using a goal-conditional value function and policy, whereby the agent pursues different goals in different episodes. By treating a virtual goal as the desired one, and frequently giving the agent rewards, hindsight experience replay has achieved promising results in the sparse-reward setting of multi-goal RL. However, these virtual goals are uniformly sampled after the replay state from experiences, regardless of their significance. We propose a novel prioritized hindsight model for multi-goal RL in which the agent is provided with more valuable goals, as measured by the expected temporal-difference (TD) error. An attentive goals generation (AGG) network, which consists of temporal convolutions, multi-head dot product attentions, and a last-attention network, is structured to generate the virtual goals to replay. The AGG network is trained by following the gradient of TD-error calculated by an actor–critic model, and generates goals to maximize the expected TD-error with replay transitions. The whole network is fully differentiable and can be learned in an end-to-end manner. The proposed method is evaluated on several robotic manipulating tasks and demonstrates improved sample efficiency and performance.
ISSN:	0950-7051 1872-7409
DOI:	10.1016/j.knosys.2020.106140