Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

What?

Deep Learning for semi-batch RL with a Dyna-style algorithm.

Why?

Online RL is too unrealistic to use in the real-life due to deployment difficulties, offline RL is maybe too unrealistic and needs a lot of data. We need to find a compromise.

How?

source: original paper

TLDR: We want to reduce the number of policy deployments, where a policy deployment collects the data for the policy update.

The authors propose BREMEN that is:

Dyna-style Model-based:
- we have a set of transition models $\{\hat{f}{\phi_1}, \hat{f}{\phi_2}...,\hat{f}_{\phi_K}\}$ and sample from it randomly before getting an imaginary rollout;
- The models are updated by just minimising the MSE:
  
  $\min_{\phi_i}\frac{1}{\lvert \mathcal{D}\rvert} \sum_{(s,a,s')\in \mathcal{D}}\frac{1}{2}\lvert\lvert s'-\hat{f}_{\phi_i}(s,a) \rvert\rvert^2_2$
- Imaginary trajectories are generated by interleaving sampling from the policy and getting the next state from the transition model.
- The model is updated on all of the data we've seen during training (not only on the last batch sampled)
After we've collected a new batch of data, we learn the behaviour policy $\hat{\pi}\beta$ by doing behaviour cloning: $\min\beta\frac{1}{\mathcal{D}t}\sum{(s,a)\in \mathcal{D}}\frac{1}{2}\lvert\lvert a-\hat{\pi}_\beta(s)\rvert \rvert^2_2$;
Now use the policy above as initialisation for the policy update:

$\theta_{k+1} = \arg\max_\theta \mathbb{E}{(s,a)\sim\pi{\theta_k}, \hat{f}{\phi_i}} [\frac{\pi\theta(a\mid s)}{\pi_k(a\mid s)}A^{\pi_{\theta_k}}(s,a)]$ s.t. $\mathbb{E}{(s,a)\sim\pi{\theta_k}, \hat{f}{\phi_i}}[D\text{KL}(\pi_\theta(\cdot\mid s)\mid\mid \pi_{\theta_k})(\cdot, s))] \leq \delta$ and $\pi_{\theta_0} = \text{Normal}(\hat{\pi}_\beta, 1)$.

Both of the two points above are used to counter the distributional shift problem, when the policy goes to the region of the state space where the model fails, and everything goes to hell.

Traditional pseudo~~science~~code:

def train(N, K, T):
  buffer_all = set()
  dynamics_models = [init_model() for _ in range(K)]
  pi_target = random_init()
  for deployment in range(N):
    current_batch = sample_rollouts(pi_target)
    buffer_all = buffer_all.union(current_batch)
    update_dynamics_model(buffer_all, sample(dynamics_models, size=1))
    pi_prev = wrap_policy(mu=pi_target, std=1)
    pi_beta = behaviour_cloning(current_batch)
    for ep in T:
      imaginary_rollout = poll_transition(pi_target, sample(dynamics_models)))
      policy_update(imaginary_rollout, pi_target, pi_prev)
  return pi_target

And?

I think the setting is great, I was unaware of the semi-batch RL literature before, and I find the concept useful.
At the same time, I believe the authors should have discussed this paper in the related work section. The replay ratio from that paper is directly related to the number of deployments from this paper.
I found the presentation of the method quite cryptic. From the implementational perspective, it's clear. However, I have no idea why this or that idea was used. For instance, why do we want to have multiple transition models? The reasons are mentioned only on the last in the related work section without mentioning any intuition on why.