Lean Reinforcement Learning

1 minute read

Published:

Despite huge successes in breaking human records, current training of RL agents is prohibitively expensive in terms of time, GPUs, and samples. For example, it requires hundreds of millions or even billions of environment steps to reach human-level performance on Atari games-a common benchmark in modern RL. That is only doable with simulation, not real-world problems like robotics or industrial planning. The problem of sample-inefficiency is exacerbated in real environments, which can be stochastic, partially observable, noisy or long-term. Another issue is model complexity. RL algorithms are getting more complicated, coupled with numerous hyperparameters that need to be tuned carefully. That again accelerates the cost of training RL agents.


Reinforcement Learning (source: Wikipedia)

Memory is the key for sample-efficiency

One typical example of memory usage in RL is replay buffer. Firstly introduced in Deep Q-Networks, the replay buffer stores past observations (transitions) to build a database for training the value network via supervised learning. This does not enable sample-efficiency immediately, however, since training networks with gradient descent is slow, and low-quality value networks often induces unreasonable policies. A complementary solution is to augment the training with explicit working or episodic memory. Working memory helps most in partially observable setting by maintaining trajectory history that is critical to making decision in late timesteps. In a different light, episodic memory enables quick utilization of past good behaviors, directly encouraging taking actions that ripe high rewards in the past. Both are beneficial to reduce training complexity. These are only few among many examples of how memory emerges in RL systems. In the next article, we will have a tour on fundamental memory-based approaches for RL to achieve sample-efficiency.

Some of our works on this topic