What?

$\epsilon$-greedy with randomly-sampled action-repeat.

Why?

Exploration is an important problem for RL. $\epsilon$-greedy dumb, but it is simple and often beats other more sophisticated approaches. Can we keep it simple and do better than $\epsilon$-greedy?

How?

Source: original paper. Check out my version below!

Source: original paper. Check out my version below!

<aside> 💡 Main idea: Take $\epsilon$-greedy, sample action duration < horizon in addition to action and repeat this action the sampled number of times.

</aside>

Pseudocode (Appendix B1 from the paper):

def ez_greedy(Q, eps, z):
  n = 0 # duration
  w = -1 # assigned action
  s = env.reset()
  while True:
    if n == 0:
      if random()<eps:
        n = z.sample()
        w = action_space.uniform()
        a = w
    else:
      a = w
      n = n-1 # reduce duration left by one
    s,r,done,_ = env.step(a)

The sampling can be done in multiple ways, but the authors stick to zeta (Zipf) distribution (hence the name):

$$ z(n) \propto n^{-\mu}, $$

with $\mu=2$.

And?