A unified framework to control estimation error in reinforcement learning

In reinforcement learning, accurate estimation of the Q-value is crucial for acquiring an optimal policy. However, current successful Actor–Critic methods still suffer from underestimation bias. Additionally, there exists a significant estimation bias, regardless of the method used in the critic ini...

Full description

Saved in:
Bibliographic Details
Published in:Neural networks Vol. 178; p. 106483
Main Authors: Zhang, Yujia, Li, Lin, Wei, Wei, Lv, Yunpeng, Liang, Jiye
Format: Journal Article
Language:English
Published: United States Elsevier Ltd 01-10-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:In reinforcement learning, accurate estimation of the Q-value is crucial for acquiring an optimal policy. However, current successful Actor–Critic methods still suffer from underestimation bias. Additionally, there exists a significant estimation bias, regardless of the method used in the critic initialization phase. To address these challenges and reduce estimation errors, we propose CEILING, a simple and compatible framework that can be applied to any model-free Actor–Critic methods. The core idea of CEILING is to evaluate the superiority of different estimation methods by incorporating the true Q-value, calculated using Monte Carlo, during the training process. CEILING consists of two implementations: the Direct Picking Operation and the Exponential Softmax Weighting Operation. The first implementation selects the optimal method at each fixed step and applies it in subsequent interactions until the next selection. The other implementation utilizes a nonlinear weighting function that dynamically assigns larger weights to more accurate methods. Theoretically, we demonstrate that our methods provide a more accurate and stable Q-value estimation. Additionally, we analyze the upper bound of the estimation bias. Based on two implementations, we propose specific algorithms and their variants, and our methods achieve superior performance on several benchmark tasks. •A framework that can combine any Actor–Critic estimation methods is proposed.•Two algorithms and their variants are introduced in the framework.•Elegant weighting methods are designed for more accurate value estimation.•Estimation errors and bias upper bounds are analyzed for different methods.•State-of-the-art performance achieved by proposed methods on multiple benchmarks.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0893-6080
1879-2782
1879-2782
DOI:10.1016/j.neunet.2024.106483