Safe Policy Optimization (SafePO) is a comprehensive algorithm benchmark for Safe Reinforcement Learning (Safe RL). It provides RL research community with a unified platform for processing and evaluating algorithms in various safe reinforcement learning environments. In order to better help the community study this problem, SafePO is developed with the following key features:
- Comprehensive Safe RL benchmark: We offer high-quality implementation of both single-agent safe reinforcement learning algorithms (CPO, PCPO, FOCOPS, PPO-Lag, TRPO-Lag, CUP, CPPO-PID, and RCPO) and multi-agent safe reinforcement learning algorithms (HAPPO, MAPPO-Lag, IPPO, MACPO, and MAPPO).
- Richer interfaces:In SafePO, you can modify the parameters of the algorithm according to your requirements. You can pass in the parameters you want to change via
argparse
at the terminal. - Single file style:SafePO adopts a single-file style to implement algorithms, aiming to function as an algorithm library that integrates tutorial and tool capabilities. This design choice prioritizes readability and extensibility, albeit at the expense of inheritance and code simplicity. Unlike modular frameworks, users can grasp the essence of the algorithms without the need to extensively navigate through the entire library.
- More information:We provide rich data visualization methods. Reinforcement learning algorithms typically involves huge number of parameters. In order to better understand the changes of each parameter in the training process, we use log files, TensorBoard, and wandb to visualize them. We believe this will help developers tune each algorithm more efficiently.
- Overview of Algorithms
- Supported Environments
- Safety-Gymnasium
- Safe-Dexterous-Hands
- What's More
- Pre-requisites
- Conda-Environment
- Getting Started
- Machine Configuration
- Ethical and Responsible Use
- PKU-Alignment Team
Overview of Algorithms
Here we provide a table of Safe RL algorithms that the benchmark includes.
Algorithm | Proceedings&Cites | Official Code Repo | Official Code Last Update | Official Github Stars |
---|---|---|---|---|
PPO-Lag | Tensorflow 1 | |||
TRPO-Lag | Tensorflow 1 | |||
CUP | Neurips 2022 (Cite: 6) | Pytorch | ||
FOCOPS | Neurips 2020 (Cite: 27) | Pytorch | ||
CPO | ICML 2017(Cite: 663) | |||
PCPO | ICLR 2020(Cite: 67) | Theano | ||
RCPO | ICLR 2019 (Cite: 238) | |||
CPPO-PID | Neurips 2020(Cite: 71) | Pytorch | ||
MACPO | Preprint(Cite: 4) | Pytorch | ||
MAPPO-Lag | Preprint(Cite: 4) | Pytorch | ||
HAPPO (Purely reward optimisation) | ICLR 2022 (Cite: 10) | Pytorch | ||
MAPPO (Purely reward optimisation) | Preprint(Cite: 98) | Pytorch |
Supported Environments
Safety-Gymnasium
Here is a list of all the environments Saty-Gymnasiumn support for now; some are being tested in our baselines, and we will gradually release them in later updates. For more details, please refer to Safety-Gymnasium.
Category | Task | Agent | Example |
---|---|---|---|
Safe Navigation | Goal[012] | Point, Car, Doggo, Racecar, Ant | SafetyPointGoal1-v0 |
Button[012] | |||
Push[012] | |||
Circle[012] | |||
Velocity | Velocity | HalfCheetah, Hopper, Swimmer, Walker2d, Ant, Humanoid | SafetyAntVelocity-v1 |
note: Safe velocity tasks support both single-agent and multi-agent algorithms, while safe navigation tasks only support single-agent algorithms currently.
Safe-Dexterous-Hands
note: These tasks support multi-agent algorithms only currently.
Prerequisites
It uses Anaconda to create virtual environments. To install Anaconda, follow instructions here.
Ensure that Isaac Gym works on your system by running one of the examples from the python/examples
directory, like joint_monkey.py
. Please follow troubleshooting steps described in the Isaac Gym Preview Release 3/4
install instructions if you have any trouble running the samples.
Selected Tasks
We implement some different constraints to the base environments, expanding the setting to both single-agent and multi-agent.
What's More
Our team has also designed a number of more interesting safety tasks for two-handed dexterous manipulation, and this work will soon be releasing code for use by more Safe RL researchers.
Base Environments | Description | Demo |
---|---|---|
ShadowHandOverWall | None | |
ShadowHandOverWallDown | None | |
ShadowHandCatchOver2UnderarmWall | None | |
ShadowHandCatchOver2UnderarmWallDown | None |
Pre-requisites
To use SafePO-Baselines, you need to install environments. Please refer to Mujoco, Safety-Gymnasium for more details on installation. Details regarding the installation of IsaacGym can be found here. We currently support the Preview Release 3
version of IsaacGym.
Conda-Environment
conda create -n safe python=3.8
conda activate safe
# because the cuda version, we recommend you install pytorch manual.
pip install -e .
For detailed instructions, please refer to Installation.md.
Getting Started
Single-Agent
each algorithm file is the entrance. Running ALGO.py
with arguments about algorithms and environments does the training. For example, to run PPO-Lag in SafetyPointGoal1-v0 with seed 0, you can use the following command:
cd safepo/single_agent
python ppo_lag.py --env-id SafetyPointGoal1-v0 --seed 0
To run a benchamrk parallelly, for example, you can use the following command to run PPO-Lag
, TRPO-Lag
in SafetyAntVelocity-v1
, SafetyHalfCheetahVelocity-v1
:
cd safepo/single_agent
python benchmark.py --env-id SafetyAntVelocity-v1 SafetyHalfCheetahVelocity-v1 --algo ppo_lag trpo_lag --workers 2
The command above will run two processes in parallel, each process will run one algorithm in one environment. The results will be saved in ./runs/
.
Here we provide the list of arguments:
Argument | Default | Info |
---|---|---|
--seed | 0 | the random seed of the experiment |
--device | cpu | the device (cpu or cuda) to run the code |
--torch-threads | 4 | number of threads for torch |
--total-steps | 10000000 | total timesteps of the experiments |
--env-id | SafetyPointGoal1-v0 | the id of the environment |
--use-eval | False | toggles evaluation |
--eval-episodes | 1 | the number of episodes for final evaluation |
--steps-per-epoch | 20000 | the number of steps to run in each environment per rollout |
--update-iters | 10 | the max iteration to update the policy |
--batch-size | 64/128 | the number of mini-batches |
--entropy-coef | 0.0 | coefficient of the entropy |
--target-kl | 0.01/0.02 | the target KL divergence threshold |
--max-grad-norm | 40.0 | the maximum norm for the gradient clipping |
--critic-norm-coef | 0.001 | the critic norm coefficient |
--gamma | 0.99 | the discount factor gamma |
--lam | 0.95 | the lambda for the reward general advantage estimation |
--lam-c | 0.95 | the lambda for the cost general advantage estimation |
--standardized-adv-r | True | toggles reward advantages standardization |
--standardized-adv-c | True | toggles cost advantages standardization |
--critic-lr | 1e-3 | the learning rate of the critic network |
--actor-lr | 3e-4/None | the learning rate of the actor network |
--log-dir | ../runs | directory to save agent logs |
--write-terminal | True | toggles terminal logging |
--fvp-sample-freq | 1 | the sub-sampling rate of the observation |
--cg-damping | 0.1 | the damping value for conjugate gradient |
--cg-iters | 15 | the number of conjugate gradient iterations |
--backtrack-iters | 15 | the number of backtracking line search iterations |
--backtrack-coef | 0.8 | the coefficient for backtracking line search |
--safety-bound | 25.0 | the cost limit for the safety constraint |
Note: Some hyper-parameters are varied for different algorithms. For more details, please refer to the corresponding code files.
Multi-Agent
We also provide a safe MARL algorithm benchmark for safe MARL research on the challenging tasks of Safety DexterousHands and Safety-Gymnasium multi-agent velocity tasks. HAPPO, IPPO, MACPO, MAPPO-Lag and MAPPO have already been implemented.
Safety DexterousHands
safepo/multi_agent/train_marl.py
is the entrance file. Running train_hand.py
with arguments about algorithms and tasks does the training. For example, you can use the following command:
cd safepo/multi_agent
# algo: macpo, mappolag, mappo, happo
python train_hand.py --task=ShadowHandOver --algo=macpo
Safety-Gymnasium Multi-agent Velocity
note: This task is still under development. We will release the code as soon as possible.
safepo/multi_agent/train_vel.py
is the entrance file. Running train_vel.py
with arguments about algorithms and tasks does the training. For example, you can use the following command to run MACPO
in Safety2x4AntVelocity-v0
, with default arguments:
cd safepo/multi_agent
# algo: macpo, mappolag, mappo, ippo, happo
python train_vel.py --task=Safety2x4AntVelocity-v0 --algo=macpo
The SafePO multi-agent algorithms share almost all hyperparameters for Safety DexterousHands and Safety-Gymnasium multi-agent velocity tasks. However, there are some differences in certain hyperparameters, which are listed below:
Argument | Info | Default (for Safety DexterousHands) | Default (for Safety-Gymnasium multi-agent velocity) |
---|---|---|---|
--episode-length | Episode length | 8 | 200 |
--num-env-steps | The number of total steps | 100000000 | 10000000 |
--n-rollout-threads | The number of episodes to run | 80 | 32 |
--hidden-size | The size of hidden layers of neural network | 512 | 128 |
--entropy-coef | The coefficient of entropy | 0.00 | 0.01 |
--use-value-active-masks | Whether to use value active masks | False | True |
--use-policy-active-masks | Whether to use policy active masks | False | True |
Multi-Agent Benchmark
To run a benchamrk parallelly, for example, you can use the following command to run MACPO
, MAPPO
in Safety2x4AntVelocity-v0
, Safety6x1HalfCheetahVelocity-v0
:
cd safepo/multi_agent
# algo: macpo, mappo
python velocity_benchmark.py --algo macpo mappo --tasks Safety2x4AntVelocity-v0 Safety6x1HalfCheetahVelocity-v0 --workers 1 --exp-name benchmark
After running the benchmark, you can use the following command to plot the results:
cd safepo/multi_agent
python plot.py --logdir ./runs/benchmark
To get the evaluation results, you can use the following command:
cd safepo/multi_agent
python eval.py
Note: To run a evaluation, you need to modify the eval.py
file and specify the basedir
. For example:
basedir = './runs/benchmark'
Machine Configuration
We test all algorithms and experiments on CPU: AMD Ryzen Threadripper PRO 3975WX 32-Cores and GPU: NVIDIA GeForce RTX 3090, Driver Version: 495.44.
Ethical and Responsible Use
SafePO aims to benefit safe RL community research, and is released under the Apache-2.0 license. Illegal usage or any violation of the license is not allowed.
PKU-Alignment Team
The Baseline is a project contributed by PKU-Alignment at Peking University. We also thank the list of contributors of the following open source repositories: Spinning Up, Bullet-Safety-Gym, Safety-Gym.