What?

Deep Learning for semi-batch RL with a Dyna-style algorithm.

Why?

Online RL is too unrealistic to use in the real-life due to deployment difficulties, offline RL is maybe too unrealistic and needs a lot of data. We need to find a compromise.

How?

source: original paper

source: original paper

TLDR: We want to reduce the number of policy deployments, where a policy deployment collects the data for the policy update.

The authors propose BREMEN that is:

$\theta_{k+1} = \arg\max_\theta \mathbb{E}{(s,a)\sim\pi{\theta_k}, \hat{f}{\phi_i}} [\frac{\pi\theta(a\mid s)}{\pi_k(a\mid s)}A^{\pi_{\theta_k}}(s,a)]$ s.t. $\mathbb{E}{(s,a)\sim\pi{\theta_k}, \hat{f}{\phi_i}}[D\text{KL}(\pi_\theta(\cdot\mid s)\mid\mid \pi_{\theta_k})(\cdot, s))] \leq \delta$ and $\pi_{\theta_0} = \text{Normal}(\hat{\pi}_\beta, 1)$.

Traditional pseudosciencecode:

def train(N, K, T):
  buffer_all = set()
  dynamics_models = [init_model() for _ in range(K)]
  pi_target = random_init()
  for deployment in range(N):
    current_batch = sample_rollouts(pi_target)
    buffer_all = buffer_all.union(current_batch)
    update_dynamics_model(buffer_all, sample(dynamics_models, size=1))
    pi_prev = wrap_policy(mu=pi_target, std=1)
    pi_beta = behaviour_cloning(current_batch)
    for ep in T:
      imaginary_rollout = poll_transition(pi_target, sample(dynamics_models)))
      policy_update(imaginary_rollout, pi_target, pi_prev)
  return pi_target

And?