Kernel-Based Least Squares Temporal Difference With Gradient Correction

A least squares temporal difference with gradient correction (LS-TDC) algorithm and its kernel-based version kernel-based LS-TDC (KLS-TDC) are proposed as policy evaluation algorithms for reinforcement learning (RL). LS-TDC is derived from the TDC algorithm. Attributed to TDC derived by minimizing t...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transaction on neural networks and learning systems Vol. 27; no. 4; pp. 771 - 782
Main Authors:	Tianheng Song, Dazi Li, Liulin Cao, Hirasawa, Kotaro
Format:	Journal Article
Language:	English
Published:	United States IEEE 01-04-2016 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Algorithm design and analysis Algorithms Approximation algorithms Convergence Function approximation Kernel Kernel method Least squares approximations least squares method policy evaluation reinforcement learning (RL) Tuning value function approximation (VFA) least squares method reinforcement learning (RL) policy evaluation value function approximation (VFA) Kernel method
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	A least squares temporal difference with gradient correction (LS-TDC) algorithm and its kernel-based version kernel-based LS-TDC (KLS-TDC) are proposed as policy evaluation algorithms for reinforcement learning (RL). LS-TDC is derived from the TDC algorithm. Attributed to TDC derived by minimizing the mean-square projected Bellman error, LS-TDC has better convergence performance. The least squares technique is used to omit the size-step tuning of the original TDC and enhance robustness. For KLS-TDC, since the kernel method is used, feature vectors can be selected automatically. The approximate linear dependence analysis is performed to realize kernel sparsification. In addition, a policy iteration strategy motivated by KLS-TDC is constructed to solve control learning problems. The convergence and parameter sensitivities of both LS-TDC and KLS-TDC are tested through on-policy learning, off-policy learning, and control learning problems. Experimental results, as compared with a series of corresponding RL algorithms, demonstrate that both LS-TDC and KLS-TDC have better approximation and convergence performance, higher efficiency for sample usage, smaller burden of parameter tuning, and less sensitivity to parameters.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2162-237X 2162-2388
DOI:	10.1109/TNNLS.2015.2424233