What?
A massive multitask (MTRL) and meta-RL (MLRL benchmark.
Why?
Benchmarks have often driven progress in machine learning (and RL especially). There hasn't been an established benchmark for MTRL and MLRL.
How?
source: original paper
- MTRL problem statement:
- Maximise the average expected discounter return across all tasks:
- $\mathbb{E}{\mathcal{T}\sim p(\mathcal{T})}[\mathbb{E}\pi\sum_{t=0}^T\gamma^t R_t(s_t, a_t)]$
- Meta-RL problem statement:
- Quickly adapt to solving test tasks, where the test tasks come from the same distribution as meta-train and meta-test tasks.
- Two axes of variability:
- parametric variability:
- Parameters induce a task;
- Parameters are sampled from a continuous distribution
- Typical MLRL scenario.
- non-parametric variability
- drastic discrete changes across tasks (e.g. open the window vs open the drawer)
- more common in MTRL.
- Environments
- Shared action space;
- Same observation dimension across the environments:
- for convenience;
- some dimensions are not used for some of the tasks;
- Benchmarks:
- ML1
- few-shot adaptation to goal variation
- typical meta RL setting
- MT10, MT50
- learn one policy for all of the tasks;
- task id is provided as a part of the observation;
- ML10/ML45
- few-shot adaptation to new test tasks
- More challenging MLRL setting;
- Success metrics:
- Rewards are often not indicative of how successful the policy is.
- In meta-world, the distance to the final goal position is used as a metric of success;
- Experiments:
- MTRL
- Still an unsolved problem.
- With 50 tasks, the performance is really bad (only less that 50% of the tasks are solved)
- Multiheaded MTRL SAC is the best;
- MLRL
- Same sub-par performance on the whole suite.
- Interestingly, RL^2 is much better on ML45, but the gap between it and PEARL is smaller on ML10.
- Nothing is solved, everything is exciting.
And?
- It will be great if the benchmark will be a standard benchmark in MTRL.
- The ability to compare multiple SOTA methods is priceless. It would be great to have more analysis on why the gap between methods is drastically different on ML10 and ML45.
- I wanted to check out the code, and it turned out that the baselines are not in the repo. There is a link to another repo there. That repo is huge, and it would be great if the authors provided the instructions on how to replicate the results.
This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.