• Stars
    star
    174
  • Rank 219,104 (Top 5 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

This is a benchmark repository for safe reinforcement learning algorithms

Organization License

Safe Policy Optimization (SafePO) is a comprehensive algorithm benchmark for Safe Reinforcement Learning (Safe RL). It provides RL research community with a unified platform for processing and evaluating algorithms in various safe reinforcement learning environments. In order to better help the community study this problem, SafePO is developed with the following key features:

  • Comprehensive Safe RL benchmark: We offer high-quality implementation of both single-agent safe reinforcement learning algorithms (CPO, PCPO, FOCOPS, PPO-Lag, TRPO-Lag, CUP, CPPO-PID, and RCPO) and multi-agent safe reinforcement learning algorithms (HAPPO, MAPPO-Lag, IPPO, MACPO, and MAPPO).
  • Richer interfaces:In SafePO, you can modify the parameters of the algorithm according to your requirements. You can pass in the parameters you want to change via argparse at the terminal.
  • Single file style:SafePO adopts a single-file style to implement algorithms, aiming to function as an algorithm library that integrates tutorial and tool capabilities. This design choice prioritizes readability and extensibility, albeit at the expense of inheritance and code simplicity. Unlike modular frameworks, users can grasp the essence of the algorithms without the need to extensively navigate through the entire library.
  • More information:We provide rich data visualization methods. Reinforcement learning algorithms typically involves huge number of parameters. In order to better understand the changes of each parameter in the training process, we use log files, TensorBoard, and wandb to visualize them. We believe this will help developers tune each algorithm more efficiently.

Overview of Algorithms

Here we provide a table of Safe RL algorithms that the benchmark includes.

Algorithm Proceedings&Cites Official Code Repo Official Code Last Update Official Github Stars
PPO-Lag Tensorflow 1 GitHub last commit GitHub stars
TRPO-Lag Tensorflow 1 GitHub last commit GitHub stars
CUP Neurips 2022 (Cite: 6) Pytorch GitHub last commit GitHub stars
FOCOPS Neurips 2020 (Cite: 27) Pytorch GitHub last commit GitHub stars
CPO ICML 2017(Cite: 663)
PCPO ICLR 2020(Cite: 67) Theano
RCPO ICLR 2019 (Cite: 238)
CPPO-PID Neurips 2020(Cite: 71) Pytorch GitHub last commit GitHub stars
MACPO Preprint(Cite: 4) Pytorch GitHub last commit GitHub stars
MAPPO-Lag Preprint(Cite: 4) Pytorch GitHub last commit GitHub stars
HAPPO (Purely reward optimisation) ICLR 2022 (Cite: 10) Pytorch GitHub last commit GitHub stars
MAPPO (Purely reward optimisation) Preprint(Cite: 98) Pytorch GitHub last commit GitHub stars

Supported Environments

Safety-Gymnasium

Here is a list of all the environments Saty-Gymnasiumn support for now; some are being tested in our baselines, and we will gradually release them in later updates. For more details, please refer to Safety-Gymnasium.

Category Task Agent Example
Safe Navigation Goal[012] Point, Car, Doggo, Racecar, Ant SafetyPointGoal1-v0
Button[012]
Push[012]
Circle[012]
Velocity Velocity HalfCheetah, Hopper, Swimmer, Walker2d, Ant, Humanoid SafetyAntVelocity-v1

note: Safe velocity tasks support both single-agent and multi-agent algorithms, while safe navigation tasks only support single-agent algorithms currently.

Safe-Dexterous-Hands

note: These tasks support multi-agent algorithms only currently.

Prerequisites

It uses Anaconda to create virtual environments. To install Anaconda, follow instructions here.

Ensure that Isaac Gym works on your system by running one of the examples from the python/examples directory, like joint_monkey.py. Please follow troubleshooting steps described in the Isaac Gym Preview Release 3/4 install instructions if you have any trouble running the samples.

Selected Tasks

Base Environments Description Demo
ShadowHandOver These environments involve two fixed-position hands. The hand which starts with the object must find a way to hand it over to the second hand.
ShadowHandCatchOver2Underarm This environment is made up of half ShadowHandCatchUnderarm and half ShadowHandCatchOverarm, the object needs to be thrown from the vertical hand to the palm-up hand

We implement some different constraints to the base environments, expanding the setting to both single-agent and multi-agent.

What's More

Our team has also designed a number of more interesting safety tasks for two-handed dexterous manipulation, and this work will soon be releasing code for use by more Safe RL researchers.

Base Environments Description Demo
ShadowHandOverWall None
ShadowHandOverWallDown None
ShadowHandCatchOver2UnderarmWall None
ShadowHandCatchOver2UnderarmWallDown None

Pre-requisites

To use SafePO-Baselines, you need to install environments. Please refer to Mujoco, Safety-Gymnasium for more details on installation. Details regarding the installation of IsaacGym can be found here. We currently support the Preview Release 3 version of IsaacGym.

Conda-Environment

conda create -n safe python=3.8
conda activate safe
# because the cuda version, we recommend you install pytorch manual.
pip install -e .

For detailed instructions, please refer to Installation.md.

Getting Started

Single-Agent

each algorithm file is the entrance. Running ALGO.py with arguments about algorithms and environments does the training. For example, to run PPO-Lag in SafetyPointGoal1-v0 with seed 0, you can use the following command:

cd safepo/single_agent
python ppo_lag.py --env-id SafetyPointGoal1-v0 --seed 0

To run a benchamrk parallelly, for example, you can use the following command to run PPO-Lag, TRPO-Lag in SafetyAntVelocity-v1, SafetyHalfCheetahVelocity-v1:

cd safepo/single_agent
python benchmark.py --env-id SafetyAntVelocity-v1 SafetyHalfCheetahVelocity-v1 --algo ppo_lag trpo_lag --workers 2

The command above will run two processes in parallel, each process will run one algorithm in one environment. The results will be saved in ./runs/.

Here we provide the list of arguments:

Argument Default Info
--seed 0 the random seed of the experiment
--device cpu the device (cpu or cuda) to run the code
--torch-threads 4 number of threads for torch
--total-steps 10000000 total timesteps of the experiments
--env-id SafetyPointGoal1-v0 the id of the environment
--use-eval False toggles evaluation
--eval-episodes 1 the number of episodes for final evaluation
--steps-per-epoch 20000 the number of steps to run in each environment per rollout
--update-iters 10 the max iteration to update the policy
--batch-size 64/128 the number of mini-batches
--entropy-coef 0.0 coefficient of the entropy
--target-kl 0.01/0.02 the target KL divergence threshold
--max-grad-norm 40.0 the maximum norm for the gradient clipping
--critic-norm-coef 0.001 the critic norm coefficient
--gamma 0.99 the discount factor gamma
--lam 0.95 the lambda for the reward general advantage estimation
--lam-c 0.95 the lambda for the cost general advantage estimation
--standardized-adv-r True toggles reward advantages standardization
--standardized-adv-c True toggles cost advantages standardization
--critic-lr 1e-3 the learning rate of the critic network
--actor-lr 3e-4/None the learning rate of the actor network
--log-dir ../runs directory to save agent logs
--write-terminal True toggles terminal logging
--fvp-sample-freq 1 the sub-sampling rate of the observation
--cg-damping 0.1 the damping value for conjugate gradient
--cg-iters 15 the number of conjugate gradient iterations
--backtrack-iters 15 the number of backtracking line search iterations
--backtrack-coef 0.8 the coefficient for backtracking line search
--safety-bound 25.0 the cost limit for the safety constraint

Note: Some hyper-parameters are varied for different algorithms. For more details, please refer to the corresponding code files.

Multi-Agent

We also provide a safe MARL algorithm benchmark for safe MARL research on the challenging tasks of Safety DexterousHands and Safety-Gymnasium multi-agent velocity tasks. HAPPO, IPPO, MACPO, MAPPO-Lag and MAPPO have already been implemented.

Safety DexterousHands

safepo/multi_agent/train_marl.py is the entrance file. Running train_hand.py with arguments about algorithms and tasks does the training. For example, you can use the following command:

cd safepo/multi_agent
# algo: macpo, mappolag, mappo, happo
python train_hand.py --task=ShadowHandOver --algo=macpo

Safety-Gymnasium Multi-agent Velocity

note: This task is still under development. We will release the code as soon as possible.

safepo/multi_agent/train_vel.py is the entrance file. Running train_vel.py with arguments about algorithms and tasks does the training. For example, you can use the following command to run MACPO in Safety2x4AntVelocity-v0, with default arguments:

cd safepo/multi_agent
# algo: macpo, mappolag, mappo, ippo, happo
python train_vel.py --task=Safety2x4AntVelocity-v0 --algo=macpo

The SafePO multi-agent algorithms share almost all hyperparameters for Safety DexterousHands and Safety-Gymnasium multi-agent velocity tasks. However, there are some differences in certain hyperparameters, which are listed below:

Argument Info Default (for Safety DexterousHands) Default (for Safety-Gymnasium multi-agent velocity)
--episode-length Episode length 8 200
--num-env-steps The number of total steps 100000000 10000000
--n-rollout-threads The number of episodes to run 80 32
--hidden-size The size of hidden layers of neural network 512 128
--entropy-coef The coefficient of entropy 0.00 0.01
--use-value-active-masks Whether to use value active masks False True
--use-policy-active-masks Whether to use policy active masks False True

Multi-Agent Benchmark

To run a benchamrk parallelly, for example, you can use the following command to run MACPO, MAPPO in Safety2x4AntVelocity-v0, Safety6x1HalfCheetahVelocity-v0:

cd safepo/multi_agent
# algo: macpo, mappo
python velocity_benchmark.py --algo macpo mappo --tasks Safety2x4AntVelocity-v0 Safety6x1HalfCheetahVelocity-v0 --workers 1 --exp-name benchmark

After running the benchmark, you can use the following command to plot the results:

cd safepo/multi_agent
python plot.py --logdir ./runs/benchmark

To get the evaluation results, you can use the following command:

cd safepo/multi_agent
python eval.py

Note: To run a evaluation, you need to modify the eval.py file and specify the basedir. For example:

basedir = './runs/benchmark'

Machine Configuration

We test all algorithms and experiments on CPU: AMD Ryzen Threadripper PRO 3975WX 32-Cores and GPU: NVIDIA GeForce RTX 3090, Driver Version: 495.44.

Ethical and Responsible Use

SafePO aims to benefit safe RL community research, and is released under the Apache-2.0 license. Illegal usage or any violation of the license is not allowed.

PKU-Alignment Team

The Baseline is a project contributed by PKU-Alignment at Peking University. We also thank the list of contributors of the following open source repositories: Spinning Up, Bullet-Safety-Gym, Safety-Gym.