Model-Based Reinforcement Learning (MBRL) — Part 2
Let’s continue from where we left off with MCTS.
MCTS (Monte Carlo tree search)
This algorithm is in a discrete action group and is used in alpha-go and alpha-zero. You keep track of Q-value, which is long term reward, for all states and actions that you want to consider. And also the number of times that the state and action have been previously visited.
$$\pi_k(s) = Q_k(s,a)$$
- Backup: Propagate the Q-values to parent nodes:
$$ Q_{k+1}(s, a) = \frac{Q_k(s,a) N_k(s,a) + R}{N_k(s,a)+1} $$$$ N_{k+1}(s,a) = N_k(s,a)+1 $$
Trajectory Optimization
Instead of keeping track of a tree of many possibilities, you keep track of one possible action sequence.
- Evaluation: get trajectory reward $J(a) = \sum_{t=0}^H r_t$
- Back-propagation: because everything is differentiable, you can just calculate the gradient of the reward via back-propagation using reward model derivatives and transition model derivatives.
$$ \nabla_a J = \sum_{t=0}^H \nabla_a r_t $$$$ \nabla_a r_t = \nabla_s f_r(s_t, a_t) \nabla_a s_t + \nabla_a f_r (s_t, a_t) $$$$ \nabla_a s_t = \nabla_a f_s(s_{t-1}, a_{t-1}) + \nabla_s f_s(s_{t-1}, a_{t-1})\nabla_a s_{t-1} $$$$ \nabla_a s_{t-1} = … $$
The differences between discrete and continuous actions can be summarized as follows:
The continuous example we saw above can be categorized in shooting methods.
Why so many variations? They all try to mitigate the issues we looked at like:
Addressing each leads to a different class of methods.
Sensitivity and poor conditioning
Shooting methods that we have seen have this particular issue that small changes in early actions lead to very large changes downstream.
By expanding the objective function, this can be understood more clearly.
$$ \max_{a_0,…,a_H} \sum_{t=0}^H r(s_t, a_t), \quad s_{t+1} = f(s_t, a_t) $$$$ \sum_{t=0}^H r(s_t, a_t) = r(s_0, a_0) + r(f(s_0, a_0), a_1)+…+r(f(f(…),…), a_H) $$
It means that each state implicitly is dependent on all actions that came before it. This is similar to the exploding/vanishing gradient problem in RNNs that hurts long-term credit assignments. But unlike the RNN training, we cannot change the transition function because it is dictated to us by the environment.
To address this problem, Collocation is introduced, which is optimizing for states and/or actions directly, instead of actions only. So we have a different set of parameters that we are optimizing over.
$$ \max_{s_0,a_0,…,s_H,a_H} \sum_{t=0}^H r(s_t, a_t), \quad ||s_{t+1} — f(s_t, a_t) || = 0 \leftarrow \text{explicit optimization constraint} $$
It is an explicit constrained optimization problem, rather than just being satisfied by construction as in shooting methods.
As a result, you only have pairwise dependencies between variables, unlike the dense activity graph in the previous figure for shooting methods.
These methods have:
- Good conditioning: changing $s_0, a_0$ has a similar effect as changing $s_H, a_H$.
- Larger but easier to optimize search space. It is useful for contact-rich problems such as some robotics applications.
Only reaches local optimum
Some approaches try to avoid local optima like sampling-based methods: Cross-Entropy Methods (CEM) and $\text{PI}²$.
For example, in CEMs, instead of just maintaining the optimal trajectory, it maintains the optimal trajectory’s mean and covariance.
Despite being very simple, this works surprisingly well and has very nice guarantees of performance.
Why does this work?
- Search space of decision-time plans much smaller than space of policy parameters: ex. 30x32 vs 32x644x32
- More feasible plans than policy parameters
That’s it for this post. we will continue in the next post.
Thank you for taking the time to read my post. If you found it helpful or enjoyable, please consider giving it a like and sharing it with your friends. Your support means the world to me and helps me to continue creating valuable content for you.