RL Series-A2C and A3C
This is part of my RL-series posts.
In the last post, we talked about REINFORCE and policy gradients. We saw that the variance in vanilla policy gradients and REINFORCE is high and we can reduce this variance by subtracting the baseline from G. This baseline can be a value function and be learned using gradient descent. I think we can call the rescaler G-V advantage function (it’s actually an advantage estimate). In REINFORCE, we update policy and value functions after each episode and calculate accumulated reward in that episode. As long as we don’t use the learned value function as a critic, it is not actor-critic yet. Here we want to talk about a family of algorithms that gives us the capability to update our policy and value functions in every time-step or every N time-step. We are not limited to wait until the end of the episode. We also use the learned value function as a critic. It means that we use bootstrapping (estimating a value based on the value of the next states). I think the following part from Sutton&Barto book can help us to understand this concept better:
Although the REINFORCE-with-baseline method learns both a policy and a state-value function, we do not consider it to be an actor–critic method because its state-value function is used only as a baseline, not as a critic. That is, it is not used for bootstrapping (updating the value estimate for a state from the estimated values of subsequent states), but only as a baseline for the state whose estimate is being updated. This is a useful distinction, for only through bootstrapping do we introduce bias and an asymptotic dependence on the quality of the function approximation. As we have seen, the bias introduced through bootstrapping and reliance on the state representation is often beneficial because it reduces variance and accelerates learning. REINFORCE with baseline is unbiased and will converge asymptotically to a local minimum, but like all Monte Carlo methods it tends to learn slowly (produce estimates of high variance) and to be inconvenient to implement online or for continuing problems. As we have seen earlier in this book, with temporal-difference methods we can eliminate these inconveniences, and through multi-step methods we can flexibly choose the degree of bootstrapping. In order to gain these advantages in the case of policy gradient methods we use actor–critic methods with a bootstrapping critic.
So here is the starting point for the actor-critic family. By using the value function as a critic and use the advantage function as the rescaler of the gradient, we will have an advantage actor-critic or A2C.
There is another version of actor-critics called A3C which stands for Asynchronous Advantage Actor-Critic. In this method, there are several parallel agents interacting with their own environment (can be implemented and work on multiple cores of CPU for example). Here is the sudo-code for this algorithm from the paper:
And here is a high-level architecture for A3C:
Here we have one global network, n workers, and n environments. The workers have their own unique parameters. They interact with their environment and then gather some data. Then calculate d_theta using their own unique parameters. Then in a point, they send this gradient to the global network and it updates its weight using :
theta = prev_theta + alfa*d_theta →d_theta is from one of the workers and alfa is learning rate
After updating the weights of the global network, the worker that sends d_theta to the global network will get the updated weights and start again to interact with its own environment. Then in some other point, another worker will do the same and send its calculated d_theta to the global network and it will update its weights (the weights contain information of the previous worker) and the worker will update its weights with new weights from the global network. And the same thing for other n-2 workers. As you see, everything is asynchronous. We are using an advantage function in all of the above procedures. We also are using value function as a critic. That A3C!! The updating phase of the global network and workers are not at the same point. The pictures below can help to understand the concept better:
Let’s talk about the synchronous version of A3C!! So one A will be removed! A2C!! again!! Synchronous Advantage Actor-Critic. Here we will have the same approach as A3C but in a synchronous way. Every worker will calculate its d_theta and send it to the global network. The global network will wait until all the workers give their d_theta and average them and then update its weights. Then all the workers get new weights and start to work again. Here we will have the same weights for all the workers all the time.
To compare the performance of A2C and A3C we can refer to the following paragraphs from OpenAI’s post:
After reading the paper, AI researchers wondered whether the asynchrony led to improved performance (e.g. “perhaps the added noise would provide some regularization or exploration?”), or if it was just an implementation detail that allowed for faster training with a CPU-based implementation.
As an alternative to the asynchronous implementation, researchers found you can write a synchronous, deterministic implementation that waits for each actor to finish its segment of experience before performing an update, averaging over all of the actors. One advantage of this method is that it can more effectively use of GPUs, which perform best with large batch sizes. This algorithm is naturally called A2C, short for advantage actor critic. (This term has been used in several papers.)
Our synchronous A2C implementation performs better than our asynchronous implementations — we have not seen any evidence that the noise introduced by asynchrony provides any performance benefit. This A2C implementation is more cost-effective than A3C when using single-GPU machines, and is faster than a CPU-only A3C implementation when using larger policies.
So it seems that they call this synchronous version A2C again. As we go forward and see more new and more advanced RL methods, the naming and categorization will become less clear and more tricky.
Here are some resources to learn more: