My Deep RL book has been published

Max Lapan
6 min readJul 8, 2018

Hi!

Almost a year ago I was contacted by Packt publisher with a proposition to write a practical book about the modern Deep Reinforcement Learning. For me, being just a self-educated Deep RL enthusiast, it was a bit scary decision, but, after some hesitation I agreed, optimistically thinking “it gonna be a fun experience”.

It took almost a year and it was much more than that. Not only lots of fun, but lots of new knowledge about the field, tons of papers studied, methods implemented and experimented with.

I’m not going to say it was completely smooth experience, of course no. No weekends and free time, constant fear of “writing something stupid” and chasing of chapter deadlines (two weeks per chapter + example code). But overall it was positive and very interesting experience.

And finally: TADA! The book has been published: Deep Reinforcement Learning Hands-On.

Before I’ll quickly skim through the chapters, let me first say something about the idea of the book.

When I started to experiment with RL field three years ago, you had those sources of information available:

Maybe, I’ve missed something minor, but the most important sources of information looked like this. All of them are very far from practice:

  • Sutton and Barto book is also known as “The RL book”, gives only theoretical foundation of the field.
  • Papers are being published almost every day, but also very rarely contain links to the actual code. Only formulas and algorithms. If you’re lucky, hyperparameters are given.
  • Course from David Silver was taught at UCL in 2015 and gives a very good overview of the methods and intuition behind them, but, again, theory dominates over practice.

At the same time I was fascinated by DeepMind paper (“Neural net can learn how to play Atari from pixels! WOW!”) and had a feeling that lots of practical value hides behind this cold theory. So, I’ve spent lots of time learning the theory, implementing various methods and debugging them. As you may guess, it’s not very easy process, as you might waste couple of weeks tuning the method and finally discover that your implementation is wrong (or, even worse, you understood the formula in a wrong way). I’m not saying this way of learning is a waste of time, in fact I think it’s the most reliable way to make yourself familiar with something. But it takes tons of time.

Two years later, when I started to write, my basic intention was to give a solid, practical information about RL methods to somebody who is getting familiar with this fascinating field, just like myself in the past :).

Now, a bit about the book. The main orientation is practice, and the book tries to limit the amount of theory and formulas to the minimum. It includes key formulas, but no proofs are given and intuitive understanding of what’s going on gets much more attention than strictness.

At the same time, some basic knowledge in Deep Learning and statistics are assumed. The book contains a chapter with PyTorch overview (as all the examples are on PyTorch), but this chapter is not expected to be a self-contained neural network source of information. If you’ve never heard about loss and activation functions before, you should start with other books, there are plenty of them nowadays.

The book has tons of examples of various complexity, starting with very simple ones (CrossEntropy method on CartPole environment has ~100 lines of python) and finishing with medium-sized projects like AlphGo Zero or RL agent which trades stocks. Example code is on github and it has more than 14k lines of python code in total.

In total there are 18 chapters covering the most important aspects of modern Deep RL:

  • Chapter 1: Gives introduction to Reinforcement Learning model, shows how it is different to supervised and unsupervised learning, contains central mathematical model of RL: Markov Decision Processes (MDP). MDPs are introduced step-by-step, starting from Markov Chains, which are converted into Markov Reward Processes by adding reward, and, finally, full Markov Decision Processes are introduced by adding agent’s actions to the picture.
  • Chapter 2: Covers OpenAI Gym, the unifying API for RL, providing tons of environments, including Atari, classical problems like CartPole, continuous control problems and others.
  • Chapter 3: Provides quick overview of PyTorch API. This chapter is not supposed to be a full DL tutorial, but providing more the unifying ground for later chapters. If you’ve used another toolkit for Deep Learning it should provide you with good introduction to the elegant PyTorch model to understand the examples from upcoming chapters. At the end of the chapter, simple GAN will be trained to generate and discriminate Atari screenshots from different games.
  • Chapter 4: Covers one of the simplest, but nevertheless quite powerful method: CrossEntropy. In this chapter, the first net is trained to solve CartPole environment.
  • Chapter 5: This chapter starts part 2 of the book which is dedicated to Value Iteration family method. In chapter 5, simple case of tabular learning using Bellman equation is covered to solve FrozenLake environment.
  • Chapter 6: In this chapter, DQN is introduced to solve Atari game. Architecture of the agent is the same as in the famous DeepMind paper.
  • Chapter 7: Covers several modern DQN extensions to improve stability and performance of the basic DQN. This chapter follows the methods from “Rainbow: Combining improvements in Deep RL” and implements all of them, explaining their ideas and intuition. Methods covered: N-step DQN, Double DQN, Noisy Networks, Prioritized Replay Buffer, Dueling DQN and Categorical DQN. Finally, all of the methods above are combined into the single code, the same way done in the “rainbow paper”.
  • Chapter 8: Provides the first medium-sized project in the book, showing practical aspect of RL applied to real problems. In this chapter, stocks trading agent is trained using DQN.
  • Chapter 9: This chapter opens part 3 of the book, dedicated to Policy Gradient family of methods. This chapter introduces PG methods, their strong and weak sides in comparison to already familiar Value Iteration methods. The first method from the family is REINFORCE.
  • Chapter 10: Describes how to fight with one of the most serious issue of RL: variance of policy gradient. After experimenting with PG baselines, Actor-Critic method is introduced.
  • Chapter 11: Talks about the ways to parallelise Actor-Critic on modern hardware. Two methods are implemented: samples and gradients parallelism.
  • Chapter 12: The second practical example, covering NLP-related tasks. In this chapter simple chatbot is trained using RL methods on Cornell Movie-Dialogs Corpus.
  • Chapter 13: Another practical example dedicated to web automation, using MiniWoB as a platform. Unfortunatelly, MiniWoB was abandoned by OpenAI, so, it’s hard to find the information about it (here and here, some leftovers). But the idea of MiniWoB is brilliant, so, in the chapter I show how to set it up and train the agent to solve some of the problems. Human recordings are also incorporated into the training process to increase training speed and the agent performance (but this required a bit of VNC hacking).
  • Chapter 14: Starts the final part 4 of the book, which includes more advanced methods and techniques. Chapter 14 is about continuous control problems and covers A3C, DDPG and D4PG methods to solve several PyBullet environments.
  • Chapter 15: Talks more about continuous control problems and introduces Trust Region technique in methods TRPO, PPO and ACKTR.
  • Chapter 16: Dedicated to gradient-free (or black-box) methods in RL, which supposed to be more scalable alternative to DQN and PG methods. Evolution Strategies and Genetic Algorithms are implemented to solve several continuous control problems.
  • Chapter 17: Covers model-based approaches to RL and DeepMind attempt to bridge the gap between model-free and model-based methods. In the chapter I2A agent is implemented to solve Breakout game using imagination.
  • Chapter 18: The final chapter of the book is dedicated to AlphaGo Zero method, which is applied to Connect4 game. The final agent then is used in telegram bot to perform human validation of the result.

So, that’s it. The book is here: Deep Reinforcement Learning Hands-On. I’ll be glad to hear your opinion about it!

--

--