When to use parametric models in reinforcement learning?

What?

The study of replay buffer (nonparametric transition model) vs a parametric transition model.

Why?

Can a model-free with experience replay be better than a Dyna-like algorithm?

How?

source: original paper

The authors use planning in a very specific way. Actually, they use in two ways. First, they use planning to denote any additional computation to improve the predictions. Second, they use planning as a replacement for updating the agent with the data sampled from the model.

A general boilerplate for the Dyna-style algorithm:

def mbrl(state_distr, model, policy, value, env):
  s = env.reset() # reset/done omitted for simplicity
  for iter in range(K):
    # usual RL stage + learning the model
    for step in range(M):
      a = policy(s)
      r,snext = env.step(a)
      m,state_distr = update_model(s,a,r,snext, model, state_distr)
      pi,v = update_agent(s,a,r,s', pi, v)
      s = snext
   # planning stage
   # data sampled from the model only!
   for pstep in range(P):
     s, a = state_distr.sample()
     r, snext = model(s,a)
     pi,v = update_agent(s,a,r,snext, pi, v)

Computational considerations:
- Model-based is clearly more computationally heavy.
- However, need memory for the replay buffer!
- If memory is limited, effectively, the model capacity is limited.
- However, a parametric model can compress and avoid spending a lot of memory.
Equivalences
- For the states we observed, replay = perfect model.
- → learning a model for these states will not give us a better result.
  - I believe this argument works when we do not remove transitions from the memory.
- For the linear case, fitting a linear model and then solving for it is equivalent to the least squares TD learning (LSTD).
When do parametric models help learning?
- Useful for forward planning to help the policy.
- Replay does not allow to easily generate imaginary rollouts. Also, the current state is most likely not to appear in the buffer to start sampling with.
- Using a model for selecting actions does not require as accurate model as for pixel prediction, for example.
- Backward planning is cool!
  - If we plan backwards, updating a fake state will not affect the forward performance.
  - However, if we screw up the forward planning, our prediction might be misled.
- In a FourRooms grid example, the authors show that:
  - Planning ahead (before making a decision) even 2 steps helps to make a decision.
  - Backward Dyna > Forward Dyna
    - More sample-effeicient for the deterministic case.
    - Forward diverges for the stochastic case.
      - The authors say "...may instead be due to the independent sampling of the successor state and reward which may result in inconsistent transition".
      - I'd like to understand what that means, but I have no idea.
A failure to learn
- A deadly triad comes up for model learning as well.
  - Sampling from the model → state distribution different from the real dynamics → we will be solving a different MDP.
- For the linear model:
  - $w \leftarrow (I-\alpha A)w + \alpha b$, where $A = \mathbb{E}[x_tx_t^\top - \gamma x_t x_{t+1}] = X^\top D(I-\gamma P^\top)X$, where $P$ is the transition dynamics under $\pi$, and $D$ is a diagonal matrix with $[D]_{ij} = d(i) = P(S_t = i \mid \pi)$.
  - Now if $D$ differs from $P$, $A$ won't be positive semi-definite and the method might diverge.
  - Proposition 1:
    - When uniformly replaying transitions from a buffer containing full episodes and using them for the update above guarantees a stable algorithm.
  - Proposition 2:
    - Consider uniformly replaying states from a replay buffer, generating transitions with a learnt model and using the latter for the TD update above can diverge.
  - We are not completely doomed:
    - We can iterate the model & sample transitions from the states and to the states.
      - I did not get this one. Is the idea that impossible states won't be generated because we have no way of getting to them (the 'to' case above).
    - Use multi-step returns.
      - Helps but leads to higher variance.
    - There are some algorithms to help (but mostly for the linear case) or sample the trajectories.

And?

When compared to SimPLe with a similar number of samples from the model, Rainbow DQN outperforms SimPLe (model-based algorithm).
In the intro, the authors say: "There are good reasons for building the capability to learn some sort of model of the world...models may allow transfer of knowledge in ways that policies and scalar value predictions do not."
- Why should a model transfer better than a policy?
It is unclear to me if, both in SimPLe and in this paper, the replay buffer is used when learning a model. If not, then it becomes unfair to compare DQN+replay buffer vs a model learnt online. If the buffer is used, then the setup becomes more complex, and it looks like we do a distillation from a nonparametric model (buffer) into the transition model.
It would be great if the authors explained some of the design choices:
- Why does a model output the discounting coefficient?
  - Soft termitation?
- Why is Dirichlet(1) prior is used for the model?
It would be cool if the authors did some experiments or reasoned about the following questions:
- It's believed that the larger the buffer is, the slower (but more stable) the learning is. How does this fit into the nonparametric model story?
- How does a choice of the problem affect the model-based vs model-free comparison? In my intuition, learning a model is harder than learning a $Q$-function, since for the latter we just need to know the $Q$ of the maximal state, but for the transition model, we have to know the whole distribution.
  - That's probably very related to the backward vs forward planning question.
- Does exploration for learning the $Q$-function differ from the exploration needed to learn a model?