RLlib Reference Results
Benchmarks of RLlib algorithms against published results. These benchmarks are a work in progress. For other results to compare against, see yarlp and more plots from OpenAI.
Ape-X Distributed Prioritized Experience Replay
rllib train -f atari-apex/atari-apex.yaml
Comparison of RLlib Ape-X to Async DQN after 10M time-steps (40M frames). Results compared to learning curves from Mnih et al, 2016 extracted at 10M time-steps from Figure 3.
env | RLlib Ape-X 8-workers | Mnih et al Async DQN 16-workers | Mnih et al DQN 1-worker |
---|---|---|---|
BeamRider | 6134 | ~6000 | ~3000 |
Breakout | 123 | ~50 | ~10 |
QBert | 15302 | ~1200 | ~500 |
SpaceInvaders | 686 | ~600 | ~500 |
Here we use only eight workers per environment in order to run all experiments concurrently on a single g3.16xl machine. Further speedups may be obtained by using more workers. Comparing wall-time performance after 1 hour of training:
env | RLlib Ape-X 8-workers | Mnih et al Async DQN 16-workers | Mnih et al DQN 1-worker |
---|---|---|---|
BeamRider | 4873 | ~1000 | ~300 |
Breakout | 77 | ~10 | ~1 |
QBert | 4083 | ~500 | ~150 |
SpaceInvaders | 646 | ~300 | ~160 |
IMPALA and A2C
rllib train -f atari-impala/atari-impala.yaml
rllib train -f atari-a2c/atari-a2c.yaml
RLlib IMPALA and A2C on 10M time-steps (40M frames). Results compared to learning curves from Mnih et al, 2016 extracted at 10M time-steps from Figure 3.
env | RLlib IMPALA 32-workers | RLlib A2C 5-workers | Mnih et al A3C 16-workers |
---|---|---|---|
BeamRider | 2071 | 1401 | ~3000 |
Breakout | 385 | 374 | ~150 |
QBert | 4068 | 3620 | ~1000 |
SpaceInvaders | 719 | 692 | ~600 |
IMPALA and A2C vs A3C after 1 hour of training:
env | RLlib IMPALA 32-workers | RLlib A2C 5-workers | Mnih et al A3C 16-workers |
---|---|---|---|
BeamRider | 3181 | 874 | ~1000 |
Breakout | 538 | 268 | ~10 |
QBert | 10850 | 1212 | ~500 |
SpaceInvaders | 843 | 518 | ~300 |
Pong in 3 minutes
With a bit of tuning, RLlib IMPALA can solve Pong in ~3 minutes:
rllib train -f pong-speedrun/pong-impala-fast.yaml
DQN / Rainbow
rllib train -f atari-dqn/basic-dqn.yaml
rllib train -f atari-dqn/duel-ddqn.yaml
rllib train -f atari-dqn/dist-dqn.yaml
RLlib DQN after 10M time-steps (40M frames). Note that RLlib evaluation scores include the 1% random actions of epsilon-greedy exploration. You can expect slightly higher rewards when rolling out the policies without any exploration at all.
env | RLlib Basic DQN | RLlib Dueling DDQN | RLlib Distributional DQN | Hessel et al. DQN | Hessel et al. Rainbow |
---|---|---|---|---|---|
BeamRider | 2869 | 1910 | 4447 | ~2000 | ~13000 |
Breakout | 287 | 312 | 410 | ~150 | ~300 |
QBert | 3921 | 7968 | 15780 | ~4000 | ~20000 |
SpaceInvaders | 650 | 1001 | 1025 | ~500 | ~2000 |
Proximal Policy Optimization
rllib train -f atari-ppo/atari-ppo.yaml
rllib train -f halfcheetah-ppo/halfcheetah-ppo.yaml
2018-09:
RLlib PPO with 10 workers (5 envs per worker) after 10M and 25M time-steps (40M/100M frames). Note that RLlib does not use clip parameter annealing.
env | RLlib PPO @10M | RLlib PPO @25M | Baselines PPO @10M |
---|---|---|---|
BeamRider | 2807 | 4480 | ~1800 |
Breakout | 104 | 201 | ~250 |
QBert | 11085 | 14247 | ~14000 |
SpaceInvaders | 671 | 944 | ~800 |
RLlib PPO wall-time performance vs other implementations using a single Titan XP and the same number of CPUs. Results compared to learning curves from Fan et al, 2018 extracted at 1 hour of training from Figure 7. Here we get optimal results with a vectorization of 32 environment instances per worker:
env | RLlib PPO 16-workers | Fan et al PPO 16-workers | TF BatchPPO 16-workers |
---|---|---|---|
HalfCheetah | 9664 | ~7700 | ~3200 |
2020-01:
Same as 2018-09, comparing only RLlib PPO-tf vs PPO-torch.
env | RLlib PPO @20M (tf) | RLlib PPO @20M (torch) | plot |
---|---|---|---|
BeamRider | 4142 | 3850 | |
Breakout | 132 | 166 | |
QBert | 7987 | 14294 | |
SpaceInvaders | 956 | 1016 |
Soft Actor Critic
rllib train -f halfcheetah-sac/halfcheetah-sac.yaml
RLlib SAC after 3M time-steps.
RLlib SAC versus SoftLearning implementation Haarnoja et al, 2018 benchmarked at 500k and 3M timesteps respectively.
env | RLlib SAC @500K | Haarnoja et al SAC @500K | RLlib SAC @3M | Haarnoja et al SAC @3M |
---|---|---|---|---|
HalfCheetah | 9000 | ~9000 | 13000 | ~15000 |
MAML
MAML uses additional metrics to measure performance; episode_reward_mean
measures the agent's returns before adaptation, episode_reward_mean_adapt_N
measures the agent's returns after N gradient steps of inner adaptation, and adaptation_delta
measures the difference in performance before and after adaptation.
rllib train -f maml/halfcheetah-rand-direc-maml.yaml
rllib train -f maml/ant-rand-goal-maml.yaml
rllib train -f maml/pendulum-mass-maml.yaml
MB-MPO
rllib train -f mbmpo/halfcheetah-mbmpo.yaml
rllib train -f mbmpo/hopper-mbmpo.yaml
MBMPO uses additional metrics to measure performance. For each MBMPO iteration, MBMPO samples fake data from the transition dynamics workers and steps through MAML for N
iterations. MAMLIter$i$_DynaTrajInner_$j$_episode_reward_mean
corresponds to agent's performance across the dynamics models at the i
th iteration of MAML and the j
th step of inner adaptation.
RLlib MBMPO versus Clavera et al, 2018 benchmarked at 100k timesteps. Results reported below were ran on RLLib and the master branch of the original codebase respectively.
env | RLlib MBPO @100K | Clavera et al MBMPO @100K |
---|---|---|
HalfCheetah | 520 | ~550 |
Hopper | 620 | ~650 |
Dreamer
rllib train -f dreamer/dreamer-deepmind-control.yaml
RLlib Dreamer at 1M time-steps.
RLlib Dreamer versus Google implementation Danijar et al, 2020 benchmarked at 100k and 1M timesteps respectively.
env | RLlib Dreamer @100K | Danijar et al Dreamer @100K | RLlib Dreamer @1M | Danijar et al Dreamer @1M |
---|---|---|---|---|
Walker | 320 | ~250 | 920 | ~930 |
Cheetah | 300 | ~250 | 640 | ~800 |
RLlib Dreamer also logs gifs of Dreamer's imagined trajectories (Top: Ground truth, Middle: Model prediction, Bottom: Delta).
CQL
rllib train -f halfcheetah-cql/halfcheetah-cql.yaml
rllib train -f halfcheetah-cql/halfcheetah-bc.yaml
Since CQL is an offline RL algorithm, CQL's returns are evaluated only during the evaluation loop (once every 1000 gradient steps for Mujoco-based envs).
RLlib CQL versus Behavior Cloning (BC) benchmarked at 1M gradient steps over the dataset derived from the D4RL benchmark (Fu et al, 2020). Results reported below were ran on RLLib. The only difference between BC and CQL is the bc_iters
parameter in CQL (how many iterations to run BC loss).
RLlib's CQL is evaluated on four different enviornments: HalfCheetah-Random-v0
and Hopper-Random-v0
contain datasets collected by a random policy, while HalfCheetah-Medium-v0
and Hopper-Medium-v0
contain datasets collected by a policy trained 1/3 of the way through. In all envs, CQL does better than BC by a significant margin (especially HalfCheetah-Random-v0
).
env | RLlib BC @1M | RLlib CQL @1M |
---|---|---|
HalfCheetah-Random-v0 | -320 | 3000 |
Hopper-Random-v0 | 290 | 320 |
HalfCheetah-Medium-v0 | 3450 | 3850 |
Hopper-Medium-v0 | 1000 | 2000 |
rllib train -f cql/halfcheetah-cql.yaml
& rllib train -f cql/halfcheetah-bc.yaml
rllib train -f cql/hopper-cql.yaml
& rllib train -f cql/hopper-bc.yaml
Transformers
rllib train -f vizdoom-attention/vizdoom-attention.yaml
RLlib's model catalog feature implements a variety of different models for the policy and value network, one of which supports using attention in RL. In particular, RLlib implements a Gated Transformer (Parisotta et al, 2019), abbreviated as GTrXL.
GTrXL is benchmarked in the Vizdoom environment, where the goal is to shoot a monster as quickly as possible. With PPO as the algorithm and GTrXL as the model, RLlib can successfuly solve the Vizdoom environment and reach human level performance.
env | RLlib Transformer @2M |
---|---|
VizdoomBasic-v0 | ~75 |