Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning

What?

A massive multitask (MTRL) and meta-RL (MLRL benchmark.

Why?

Benchmarks have often driven progress in machine learning (and RL especially). There hasn't been an established benchmark for MTRL and MLRL.

How?

source: original paper

MTRL problem statement:
- Maximise the average expected discounter return across all tasks:
- $\mathbb{E}{\mathcal{T}\sim p(\mathcal{T})}[\mathbb{E}\pi\sum_{t=0}^T\gamma^t R_t(s_t, a_t)]$
Meta-RL problem statement:
- Quickly adapt to solving test tasks, where the test tasks come from the same distribution as meta-train and meta-test tasks.
Two axes of variability:
- parametric variability:
  - Parameters induce a task;
  - Parameters are sampled from a continuous distribution
  - Typical MLRL scenario.
- non-parametric variability
  - drastic discrete changes across tasks (e.g. open the window vs open the drawer)
  - more common in MTRL.
Environments
- Shared action space;
- Same observation dimension across the environments:
  - for convenience;
  - some dimensions are not used for some of the tasks;
Benchmarks:
- ML1
  - few-shot adaptation to goal variation
  - typical meta RL setting
- MT10, MT50
  - learn one policy for all of the tasks;
  - task id is provided as a part of the observation;
- ML10/ML45
  - few-shot adaptation to new test tasks
  - More challenging MLRL setting;
- Success metrics:
  - Rewards are often not indicative of how successful the policy is.
  - In meta-world, the distance to the final goal position is used as a metric of success;
Experiments:
- MTRL
  - Still an unsolved problem.
  - With 50 tasks, the performance is really bad (only less that 50% of the tasks are solved)
  - Multiheaded MTRL SAC is the best;
- MLRL
  - Same sub-par performance on the whole suite.
  - Interestingly, RL^2 is much better on ML45, but the gap between it and PEARL is smaller on ML10.
- Nothing is solved, everything is exciting.

And?

It will be great if the benchmark will be a standard benchmark in MTRL.
The ability to compare multiple SOTA methods is priceless. It would be great to have more analysis on why the gap between methods is drastically different on ML10 and ML45.
I wanted to check out the code, and it turned out that the baselines are not in the repo. There is a link to another repo there. That repo is huge, and it would be great if the authors provided the instructions on how to replicate the results.

This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.