Partially observable environment estimation with uplift inference for reinforcement learning based recommendation

Reinforcement learning (RL) aims at searching the best policy model for decision making, and has been shown powerful for sequential recommendations. The training of the policy by RL, however, is placed in an environment. In many real-world applications, the policy training in the real environment ca...

Full description

Saved in:

Bibliographic Details
Published in:	Machine learning Vol. 110; no. 9; pp. 2603 - 2640
Main Authors:	Shang, Wenjie, Li, Qingyang, Qin, Zhiwei, Yu, Yang, Meng, Yiping, Ye, Jieping
Format:	Journal Article
Language:	English
Published:	New York Springer US 01-09-2021 Springer Nature B.V
Subjects:	Artificial Intelligence Computer Science Control Control theory Decision making Environment models Inference Learning Machine Learning Mechatronics Multiagent systems Natural Language Processing (NLP) Recommender systems Robotics Simulation and Modeling Special Issue on Reinforcement Learning for Real Life Training Uplift Virtual environments Hidden state Environment estimation Recommender system Reinforcement learning Uplift modeling
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Reinforcement learning (RL) aims at searching the best policy model for decision making, and has been shown powerful for sequential recommendations. The training of the policy by RL, however, is placed in an environment. In many real-world applications, the policy training in the real environment can cause an unbearable cost due to the exploration. Environment estimation from the past data is thus an appealing way to release the power of RL in these applications. The estimation of the environment is, basically, to extract the causal effect model from the data. However, real-world applications are often too complex to offer fully observable environment information. Therefore, quite possibly there are unobserved variables lying behind the data, which can obstruct an effective estimation of the environment. In this paper, by treating the hidden variables as a hidden policy, we propose a partially-observed multi-agent environment estimation (POMEE) approach to learn the partially-observed environment. To make a better extraction of the causal relationship between actions and rewards, we design a deep uplift inference network (DUIN) model to learn the causal effects of different actions. By implementing the environment model in the DUIN structure, we propose a POMEE with uplift inference (POMEE-UI) approach to generate a partially-observed environment with a causal reward mechanism. We analyze the effect of our method in both artificial and real-world environments. We first use an artificial recommender environment, abstracted from a real-world application, to verify the effectiveness of POMEE-UI. We then test POMEE-UI in the real application of Didi Chuxing. Experiment results show that POMEE-UI can effectively estimate the hidden variables, leading to a more reliable virtual environment. The online A/B testing results show that POMEE can derive a well-performing recommender policy in the real-world application.
ISSN:	0885-6125 1573-0565
DOI:	10.1007/s10994-021-05969-w