Deep Learning for semi-batch RL with a Dyna-style algorithm.
Online RL is too unrealistic to use in the real-life due to deployment difficulties, offline RL is maybe too unrealistic and needs a lot of data. We need to find a compromise.
source: original paper
TLDR: We want to reduce the number of policy deployments, where a policy deployment collects the data for the policy update.
The authors propose BREMEN that is:
we have a set of transition models $\{\hat{f}{\phi_1}, \hat{f}{\phi_2}...,\hat{f}_{\phi_K}\}$ and sample from it randomly before getting an imaginary rollout;
The models are updated by just minimising the MSE:
$\min_{\phi_i}\frac{1}{\lvert \mathcal{D}\rvert} \sum_{(s,a,s')\in \mathcal{D}}\frac{1}{2}\lvert\lvert s'-\hat{f}_{\phi_i}(s,a) \rvert\rvert^2_2$
Imaginary trajectories are generated by interleaving sampling from the policy and getting the next state from the transition model.
The model is updated on all of the data we've seen during training (not only on the last batch sampled)
$\theta_{k+1} = \arg\max_\theta \mathbb{E}{(s,a)\sim\pi{\theta_k}, \hat{f}{\phi_i}} [\frac{\pi\theta(a\mid s)}{\pi_k(a\mid s)}A^{\pi_{\theta_k}}(s,a)]$ s.t. $\mathbb{E}{(s,a)\sim\pi{\theta_k}, \hat{f}{\phi_i}}[D\text{KL}(\pi_\theta(\cdot\mid s)\mid\mid \pi_{\theta_k})(\cdot, s))] \leq \delta$ and $\pi_{\theta_0} = \text{Normal}(\hat{\pi}_\beta, 1)$.
Traditional pseudosciencecode:
def train(N, K, T):
buffer_all = set()
dynamics_models = [init_model() for _ in range(K)]
pi_target = random_init()
for deployment in range(N):
current_batch = sample_rollouts(pi_target)
buffer_all = buffer_all.union(current_batch)
update_dynamics_model(buffer_all, sample(dynamics_models, size=1))
pi_prev = wrap_policy(mu=pi_target, std=1)
pi_beta = behaviour_cloning(current_batch)
for ep in T:
imaginary_rollout = poll_transition(pi_target, sample(dynamics_models)))
policy_update(imaginary_rollout, pi_target, pi_prev)
return pi_target