Summary: Noisy Networks for Exploration

Max Lapan
2 min readOct 31, 2017

Original article

Article from DeepMind addresses one fundamental problem of Reinforcement Learning: Exploration/Exploitation dilemma. This issue raises from the fact that our agent need to keep the balance between exploring the environment and using things it has learned from this exploration.

There are two “classic” approaches to this task:

  1. epsilon-greedy: with some probability epsilon (given as hyperparameter) your agent takes a random step instead of acting according to policy it has learned. It’s a common practice to have this epsilon equal to 1 in the beginning of training and slowly decrease to some small value, like 0.1 or 0.02.
  2. entropy regularisation: is used in policy gradient methods when we’re adding entropy of our policy to the loss function, punishing our model for being too certain in it’s actions.

There are other ways to handle this, but those two are the most commonly used. And both are not perfect, as they need to be adjusted to the environment and not taking in the account the current situation agent experiencing.

In the paper from DeepMind they propose very simple and surprisingly efficient way of tackling the above issue. The method basically consists of adding the gaussian noise to the last (fully-connected) layers of the network. Parameters of this noise can be adjusted by the model during training, which allows the agent to decide when and in what proportion it wants to introduce the uncertanty in it’s weights.

In the paper, they have applied this method to both DQN and A3C algorithms, getting rid of epsilon-greedy and entropy regularization. For both methods major improvement in final score have been obtained.

In paper they experiment with two ways to introduce the noise into the model:

  1. Independent Gaussian Noise: every weight of noisy layer is independent and has it’s own mu and sigma learned by the model.
  2. Factorised Gaussian Noise: we have two vectors: the first has length of input, the second has length of the output, then we apply special function to both vections and calculate matrix multiplication of them. The result is then used as a random matrix which added to the weights.

I’ve implemented both methods in PyTorch. The layers which can be used as drop-in replacement for nn.Linear are here: https://github.com/Shmuma/ptan/blob/master/samples/rainbow/lib/dqn_model.py

Here is the working example which implements NoisyNets in DQN model using my small RL library PTAN: https://github.com/Shmuma/ptan/blob/master/samples/rainbow/04_dqn_noisy_net.py

The difference in training dynamics is quite impressive. On the image below is partially-trained DQN on Pong with NoisyNets (blue line) and classical DQN (orange line). The first chart is raw reward values, the second one is mean for last 100 episodes.

So, the method is simple and efficient and could be easily applied to both DQN and PG-methods.

--

--