A unified framework to control estimation error in reinforcement learning
In reinforcement learning, accurate estimation of the Q-value is crucial for acquiring an optimal policy. However, current successful Actor–Critic methods still suffer from underestimation bias. Additionally, there exists a significant estimation bias, regardless of the method used in the critic ini...
Saved in:
Published in: | Neural networks Vol. 178; p. 106483 |
---|---|
Main Authors: | , , , , |
Format: | Journal Article |
Language: | English |
Published: |
United States
Elsevier Ltd
01-10-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | In reinforcement learning, accurate estimation of the Q-value is crucial for acquiring an optimal policy. However, current successful Actor–Critic methods still suffer from underestimation bias. Additionally, there exists a significant estimation bias, regardless of the method used in the critic initialization phase. To address these challenges and reduce estimation errors, we propose CEILING, a simple and compatible framework that can be applied to any model-free Actor–Critic methods. The core idea of CEILING is to evaluate the superiority of different estimation methods by incorporating the true Q-value, calculated using Monte Carlo, during the training process. CEILING consists of two implementations: the Direct Picking Operation and the Exponential Softmax Weighting Operation. The first implementation selects the optimal method at each fixed step and applies it in subsequent interactions until the next selection. The other implementation utilizes a nonlinear weighting function that dynamically assigns larger weights to more accurate methods. Theoretically, we demonstrate that our methods provide a more accurate and stable Q-value estimation. Additionally, we analyze the upper bound of the estimation bias. Based on two implementations, we propose specific algorithms and their variants, and our methods achieve superior performance on several benchmark tasks.
•A framework that can combine any Actor–Critic estimation methods is proposed.•Two algorithms and their variants are introduced in the framework.•Elegant weighting methods are designed for more accurate value estimation.•Estimation errors and bias upper bounds are analyzed for different methods.•State-of-the-art performance achieved by proposed methods on multiple benchmarks. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 0893-6080 1879-2782 1879-2782 |
DOI: | 10.1016/j.neunet.2024.106483 |