RL Series-REINFORCE
This is part of my RL-series posts.
In this post, we want to review the REINFORCE algorithm. It is a Monte-Carlo Policy Gradient (PG) method. In PGs, we try to find a policy to map the state into action directly.
In value-based methods, we find a value function and use it to find the optimal policy. Policy gradient methods can be used for stochastic policies and continuous action spaces. If you want to use DQN for continuous action spaces, you have to discretize your action space. This will reduce the performance and if the number of actions is high, it will be difficult and impossible. But REINFORCE algorithms can be used for discrete or continuous action spaces. They are on-policy because they use the samples gathered from the current policy.
There are different versions of REINFORCE. The first one is without a baseline. It is as follows:
In this version, we consider a policy (here a neural network) and initialize it with some random weights. Then we play for one episode and after that, we calculate discounted reward from each time step towards the end of the episode. This discounted reward (G in the above sudo code) will be multiplied by the gradient. This G is…