• Stars
    star
    887
  • Rank 51,456 (Top 2 %)
  • Language
    Jupyter Notebook
  • License
    Apache License 2.0
  • Created over 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PyTorch implementation of Soft Actor-Critic (SAC), Twin Delayed DDPG (TD3), Actor-Critic (AC/A2C), Proximal Policy Optimization (PPO), QT-Opt, PointNet..

Popular Model-free Reinforcement Learning Algorithms

PyTorch and Tensorflow 2.0 implementation of state-of-the-art model-free reinforcement learning algorithms on both Openai gym environments and a self-implemented Reacher environment.

Algorithms include:

  • Actor-Critic (AC/A2C);
  • Soft Actor-Critic (SAC);
  • Deep Deterministic Policy Gradient (DDPG);
  • Twin Delayed DDPG (TD3);
  • Proximal Policy Optimization (PPO);
  • QT-Opt (including Cross-entropy (CE) Method);
  • PointNet;
  • Transporter;
  • Recurrent Policy Gradient;
  • Soft Decision Tree;
  • Probabilistic Mixture-of-Experts;
  • QMIX
  • etc.

Please note that this repo is more of a personal collection of algorithms I implemented and tested during my research and study period, rather than an official open-source library/package for usage. However, I think it could be helpful to share it with others and I'm expecting useful discussions on my implementations. But I didn't spend much time on cleaning or structuring the code. As you may notice that there may be several versions of implementation for each algorithm, I intentionally show all of them here for you to refer and compare. Also, this repo contains only PyTorch Implementation.

For official libraries of RL algorithms, I provided the following two with TensorFlow 2.0 + TensorLayer 2.0:

  • RL Tutorial (Status: Released) contains RL algorithms implementation as tutorials with simple structures.

  • RLzoo (Status: Released) is a baseline implementation with high-level API supporting a variety of popular environments, with more hierarchical structures for simple usage.

For multi-agent RL, a new repository is built (PyTorch):

  • MARS (Status: WIP) is a library for multi-agent RL on games, like PettingZoo Atari, SlimeVolleyBall, etc.

Since Tensorflow 2.0 has already incorporated the dynamic graph construction instead of the static one, it becomes a trivial work to transfer the RL code between TensorFlow and PyTorch.

Contents:

Usage:

python ***.py --train

python ***.py --test

Troubleshooting:

If you meet problem "Not imlplemented Error", it may be due to the wrong gym version. The newest gym==0.14 won't work. Install gym==0.7 or gym==0.10 with pip install -r requirements.txt.

Undervalued Tricks:

As we all known, there are various tricks in empirical RL algorithm implementations in support the performance in practice, including hyper-parameters, normalization, network architecture or even hidden activation function, etc. I summarize some I met with the programs in this repo here:

  • Environment specific:

    • For Pendulum-v0 environment in Gym, a reward pre-processing as (r+8)/8 usually improves the learning efficiency, as here Also, this environment needs the maximum episode length to be at least 150 to learn well, too short episodes make it hard to learn.
    • MountainCar-v0 environment in Gym has very sparse reward (only when reaching the flag), general learning curves will be noisy; therefore some specific process may also need for this environment.
  • Normalization:

    • Reward normalization or advantage normalization in batch can have great improvements on performance (learning efficiency, stability) sometimes, although theoretically on-policy algorithms like PPO should not apply data normalization during training due to distribution shift. For an in-depth look at this problem, we should treat it differently (1) when normalizing the direct input data like observation, action, reward, etc; (2) when normalizing the estimation of the values (state value, state-action value, advantage, etc). For (1), a more reasonable way for normalization is to keep a moving average of previous mean and standard deviation, to achieve a similar effect as conducting the normaliztation on the full dataset during RL agent learning (this is not possible since in RL the data comes from interaction of agents and environments). For (2), we can simply conduct normalization on value estimations (rather than keeping the historical average) since we do not want the estimated values to have distribution shift, so we treat them like a static distribution.
  • Multiprocessing:

    • Is the multiprocessing update based on torch.multiprocessing the right/safe way to parallelize the code? It can be seen that the official instruction (example of Hogwild) of using torch.multiprocessing is applied without any explicit locks, which means it can be potentially unsafe when multiple processes generate gradients and update the shared model at the same time. See more discussions here and some tests and answers. In general, the drawback of unsafe updates may be overwhelmed by the speed up of using multiprocessing (also RL training itself has huge variances and noise).

    • Although I provide the multiprocessing versions of serveral algorithms (SAC, PPO, etc), for small-scale environments in Gym, this is usually not necessary or even inefficient. The vectorized environment wrapper for parallel environment sampling may be more proper solution for learning these environments, since the bottelneck in learning efficiency mainly lies in the interaction with environments rather than the model learning (back-propagation) process.

  • PPO Details:

    • Here I summarized a list of implementation details for PPO algorithm on continous action spaces, correspoonding to scripts ppo_gae_continuous.py, ppo_gae_continuous2.py and ppo_gae_continuous3.py.

More discussions about implementation tricks see this chapter in our book.

Performance:

  • SAC for gym Pendulum-v0:

SAC with automatically updating variable alpha for entropy:

SAC without automatically updating variable alpha for entropy:

It shows that the automatic-entropy update helps the agent to learn faster.

  • TD3 for gym Pendulum-v0:

TD3 with deterministic policy:

TD3 with non-deterministic/stochastic policy:

It seems TD3 with deterministic policy works a little better, but basically similar.

  • AC for gym CartPole-v0:

However, vanilla AC/A2C cannot handle the continuous case like gym Pendulum-v0 well.

  • PPO for gym LunarLanderContinuous-v2:

Use ppo_continuous_multiprocess2.py.

Citation:

To cite this repository:

@misc{rlalgorithms,
  author = {Zihan Ding},
  title = {Popular-RL-Algorithms},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/quantumiracle/Popular-RL-Algorithms}},
}

Other Resources:

Deep Reinforcement Learning: Foundamentals, Research and Applications Springer Nature 2020

is the book I edited with Dr. Hao Dong and Dr. Shanghang Zhang, which provides a wide coverage of topics in deep reinforcement learning. Details see website and Springer webpage. To cite the book:

@book{deepRL-2020,
 title={Deep Reinforcement Learning: Fundamentals, Research, and Applications},
 editor={Hao Dong, Zihan Ding, Shanghang Zhang},
 author={Hao Dong, Zihan Ding, Shanghang Zhang, Hang Yuan, Hongming Zhang, Jingqing Zhang, Yanhua Huang, Tianyang Yu, Huaqing Zhang, Ruitong Huang},
 publisher={Springer Nature},
 note={\url{http://www.deepreinforcementlearningbook.org}},
 year={2020}
}

More Repositories

1

Reinforcement_Learning_for_Traffic_Light_Control

Apply deep reinforcement learning methods including DQN, DDPG for traffic light control in simulation (discrete environment), to prove the 'Green Wave' phenomenon in intelligent traffic system.
Python
66
star
2

QT_Opt

Q-network with cross-entropy (CE) method for reinforcement learning.
Jupyter Notebook
43
star
3

MARS

MARS is shortened for Multi-Agent Research Studio, a library for mulit-agent reinforcement learning research.
Jupyter Notebook
33
star
4

Cascading-Decision-Tree

Open-source code for paper CDT: Cascading Decision Trees for Explainable Reinforcement Learning
Jupyter Notebook
32
star
5

Benchmark-Efficient-Reinforcement-Learning-with-Demonstrations

Benchmark present methods for efficient reinforcement learning. Methods include Reptile, MAML, Residual Policy, etc. RL algorithms include DDPG, PPO.
Python
21
star
6

Robotic_Door_Opening_with_Tactile_Simulation

Official code (simulation part) for paper Sim-to-Real Transfer for Robotic Manipulation with Tactile Sensory Zihan Ding, Ya-Yen Tsai, Wang Wei Lee, Bidan Huang International Conference on Intelligent Robots and Systems (IROS) 2021
Python
15
star
7

nash-dqn

Official code of Nash-DQN for paper: Nash-DQN algorithm for two-player zero-sum Markov games, details see our paper: A Deep Reinforcement Learning Approach for Finding Non-Exploitable Strategies in Two-Player Atari Games. Zihan Ding, Dijia Su, Qinghua Liu, Chi Jin
Python
8
star
8

RL_RLBench

Reinforcement Learning for RLBench
Python
6
star
9

On_board_FNN_qubit_discrimination

Sigle qubit state discrimination with machine learning method (neural networks); An on board implementation with vivado FPGA + ARM, for fast qubit discrimination and feedback control in real physics system.
VHDL
4
star
10

RL-with-AutoEncoder-for-Learning-from-Image-Pixels

Python
3
star
11

RoboTinder

Jupyter Notebook
3
star
12

PointNet_Landmarks_from_Image

Python
2
star
13

Meta-Learning-for-Reinforcement-Learning

Python
2
star
14

UPESI

Code for paper Not Only Domain Randomization: Universal Policy with Embedding System Identification.
Jupyter Notebook
2
star
15

Store

ASP
1
star
16

marl_torch

Jupyter Notebook
1
star
17

Robot_Learning

ASP
1
star
18

ion-trap-tomography-experiment

Python
1
star
19

Glints-detection

C++
1
star
20

Robot_Learning2

Python
1
star
21

Robosuite-Panda-IK

Python
1
star
22

Consistency_Model_For_Reinforcement_Learning

Official implementation for: Consistency Models as a Rich and Efficient Policy Class for Reinforcement Learning ICLR'24
Python
1
star