Improve generated adversarial imitation learning with reward variance regularization

Imitation learning aims at recovering expert policies from limited demonstration data. Generative Adversarial Imitation Learning (GAIL) employs the generative adversarial learning framework for imitation learning and has shown great potentials. GAIL and its variants, however, are found highly sensit...

Full description

Saved in:

Bibliographic Details
Published in:	Machine learning Vol. 111; no. 3; pp. 977 - 995
Main Authors:	Zhang, Yi-Feng, Luo, Fan-Ming, Yu, Yang
Format:	Journal Article
Language:	English
Published:	New York Springer US 01-03-2022 Springer Nature B.V
Subjects:	Algorithms Artificial Intelligence Computer Science Control Locomotion Machine Learning Markov analysis Mechatronics Natural Language Processing (NLP) Optimization Regularization Robotics Simulation and Modeling Special Issue of the ACML 2021 Journal Track Supervised learning Training Zero sum games Imitation learning Generative adversarial model Discriminator reward Reinforcement learning
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Imitation learning aims at recovering expert policies from limited demonstration data. Generative Adversarial Imitation Learning (GAIL) employs the generative adversarial learning framework for imitation learning and has shown great potentials. GAIL and its variants, however, are found highly sensitive to hyperparameters and hard to converge well in practice. One key issue is that the supervised learning discriminator has a much faster learning speed than the reinforcement learning generator, making the generator gradient vanishing. Although GAIL is formulated as a zero-sum adversarial game, the ultimate goal of GAIL is to learn the generator, thus the discriminator should play the role more like a teacher rather than a real opponent. Therefore, the learning of the discriminator should consider how the generator could learn. In this paper, we disclose that enhancing the gradient of the generator training is equivalent to increase the variance of the fake reward provided by the discriminator output. We thus propose an improved version of GAIL, GAIL-VR, in which the discriminator also learns to avoid generator gradient vanishing through regularization of the fake rewards variance. Experiments in various tasks, including locomotion tasks and Atari games, indicate that GAIL-VR can improve the training stability and imitation scores.
ISSN:	0885-6125 1573-0565
DOI:	10.1007/s10994-021-06083-7