An Adaptive Policy Evaluation Network Based on Recursive Least Squares Temporal Difference With Gradient Correction

Reinforcement learning (RL) is an important machine learning paradigm that can be used for learning from the data obtained by the human-computer interface and the interaction in human-centered smart systems. One of the essential problems in RL algorithms is the value functions. Value functions are u...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE access Vol. 6; pp. 7515 - 7525
Main Authors:	Li, Dazi, Wang, Yuting, Song, Tianheng, Jin, Qibing
Format:	Journal Article
Language:	English
Published:	Piscataway IEEE 01-01-2018 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Adaptive algorithms Adaptive systems Algorithm design and analysis Algorithms Approximation Approximation algorithms Basis functions Computational complexity Function approximation Human computer interaction Human-computer interface Learning (artificial intelligence) Least squares Machine learning Mathematical analysis Policy evaluation recursive least squares temporal difference with gradient correction reinforcement learning value function approximation
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Reinforcement learning (RL) is an important machine learning paradigm that can be used for learning from the data obtained by the human-computer interface and the interaction in human-centered smart systems. One of the essential problems in RL algorithms is the value functions. Value functions are usually estimated via linearly parameterized value functions. Prior RL algorithms that generalize in this way required learning times tuning the linear weights leaving out the basis function. In fact, basis functions in value function approximation also have a significant influence on the performance. In this paper, a new adaptive policy evaluation network based on recursive least squares temporal difference (TD) with gradient correction (adaptive RC network) is proposed. Basis functions in the proposed algorithm were adaptive optimized, mainly aiming at the widths. In the proposed algorithm, TD error and value function were estimated by RC algorithm and value function approximation. The gradient derived from the squares of TD error was used to update the widths of basis functions. Therefore, the RC network can adjust its network parameters in an adaptive way with a self-organizing approach according to the progress in learning. Empirical results based on the three RL benchmarks show the performance and applicability of the proposed adaptive RC network.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2018.2805298