Speeding up DQN on PyTorch: how to solve Pong in 30 minutes

Intro

Some time ago I’ve implemented all models from the article Rainbow: Combining Improvements in Deep Reinforcement Learning using PyTorch and my small RL library called PTAN. The code of eight systems is here if you’re curious.

Initial numbers

As a starting point, I’ve taken the classical DQN version with the following hyperparameters:

  • Environment PongNoFrameskip-v4 from gym 0.9.3 was used,
  • Epsilon decays from 1.0 to 0.02 for the first 100k frames, then epsilon kept 0.02,
  • Target network synched every 1k frames,
  • Simple replay buffer with size 100k was initially prefetched with 10k transitions before training,
  • Gamma=0.99,
  • Adam with learning rate 1e-4,
  • Every training step, one transition from the environment is added to the replay buffer and training is performed on 32 transitions uniformly sampled from the replay buffer,
  • Pong is considered solved when the mean score for the last 100 games becomes larger than 18.
  • EpisodicLifeEnv: ends episode at every life lost which helps to converge faster,
  • NoopResetEnv: performs random amount of NOOP actions on the reset,
  • MaxAndSkipEnv: repeats chosen action for 4 Atai environment frames to speed up training,
  • FireResetEnv: presses fire in the beginning. Some environments require this to start the game.
  • ProcessFrame84: Frame converted to grayscale and scaled down to 84*84 pixels,
  • FrameStack: passes the last 4 frames as observation,
  • ClippedRewardWrapper: clips reward to -1..+1 range.
Convergence dynamics of original version

Change 1: larger batch size + several steps

The first idea we usually apply to speed up Deep Learning training is larger batch size. It’s applicable to the domain of Deep Reinforcement Learning, but you need to be careful here. In the normal Supervised Learning case, a simple rule “large batch is better” is usually true: you just increase your batch until your GPU memory allows and larger batch normally means more samples will be processed in a unit of time, thanks to the enormous GPU parallelism.

  1. Your network is trained to get better predictions on current data,
  2. Your agent is exploring the environment.
  • steps=1: speed 154 f/s (obviously, it’s the same as the original version)
  • steps=2: speed 200 f/s (+30%)
  • steps=3: speed 212 f/s (+37%)
  • steps=4: speed 227 f/s (+47%)
  • steps=5: speed 228 f/s (+48%)
  • steps=6: speed 232 f/s (+50%)
Runs with steps varying from 1 to 6

Change 2: play and train in separate processes

In this step we’re going to check our training loop, which basically contains repetition of the following steps:

  1. play N steps in the environment using the current network to choose actions,
  2. put observations from those steps into replay buffer,
  3. randomly sample batch from replay buffer,
  4. train on this batch.
Serial version
  • the first one will communicate with the environment, feeding the replay buffer with fresh data,
  • the second will sample training batch from the replay buffer and perform training.
Parallel version

Change 3: async cuda transfers

The next step is simple: every time we call cuda() method of Tensor we pass async=True argument, which disables waiting for transfer to complete. It won’t give you very impressive speed up, but sometimes gives you something and very simple to implement.

Change 4: latest Atari wrappers

As I’ve said before, original version of DQN used some old Atari wrappers from OpenAI baselines project. Several days ago those wrappers were changed with commit named “change atari preprocessing to use faster opencv”, which is definetely worth to try.

Summary

Thanks for reading!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store