• Stars
    star
    503
  • Rank 87,705 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Implementations of Reinforcement Learning and Planning algorithms

rl-agents

A collection of Reinforcement Learning agents

build

Installation

pip install --user git+https://github.com/eleurent/rl-agents

Usage

Most experiments can be started by moving to cd scripts and running python experiments.py

Usage:
  experiments evaluate <environment> <agent> (--train|--test)
                                             [--episodes <count>]
                                             [--seed <str>]
                                             [--analyze]
  experiments benchmark <benchmark> (--train|--test)
                                    [--processes <count>]
                                    [--episodes <count>]
                                    [--seed <str>]
  experiments -h | --help

Options:
  -h --help            Show this screen.
  --analyze            Automatically analyze the experiment results.
  --episodes <count>   Number of episodes [default: 5].
  --processes <count>  Number of running processes [default: 4].
  --seed <str>         Seed the environments and agents.
  --train              Train the agent.
  --test               Test the agent.

The evaluate command allows to evaluate a given agent on a given environment. For instance,

# Train a DQN agent on the CartPole-v0 environment
$ python3 experiments.py evaluate configs/CartPoleEnv/env.json configs/CartPoleEnv/DQNAgent.json --train --episodes=200

Every agent interacts with the environment following a standard interface:

action = agent.act(state)
next_state, reward, done, info = env.step(action)
agent.record(state, action, reward, next_state, done, info)

The environments are described by their gym id, and module for registration.

{
    "id": "CartPole-v0",
    "import_module": "gym"
}

And the agents by their class, and configuration dictionary.

{
    "__class__": "<class 'rl_agents.agents.deep_q_network.pytorch.DQNAgent'>",
    "model": {
        "type": "MultiLayerPerceptron",
        "layers": [512, 512]
    },
    "gamma": 0.99,
    "n_steps": 1,
    "batch_size": 32,
    "memory_capacity": 50000,
    "target_update": 1,
    "exploration": {
        "method": "EpsilonGreedy",
        "tau": 50000,
        "temperature": 1.0,
        "final_temperature": 0.1
    }
}

If keys are missing from these configurations, values in agent.default_config() will be used instead.

Finally, a batch of experiments can be scheduled in a benchmark. All experiments are then executed in parallel on several processes.

# Run a benchmark of several agents interacting with environments
$ python3 experiments.py benchmark cartpole_benchmark.json --test --processes=4

A benchmark configuration file contains a list of environment configurations and a list of agent configurations.

{
    "environments": ["envs/cartpole.json"],
    "agents": ["agents/dqn.json", "agents/mcts.json"]
}

Monitoring

There are several tools available to monitor the agent performances:

  • Run metadata: for the sake of reproducibility, the environment and agent configurations used for the run are merged and saved to a metadata.*.json file.
  • Gym Monitor: the main statistics (episode rewards, lengths, seeds) of each run are logged to an episode_batch.*.stats.json file. They can be automatically visualised by running scripts/analyze.py
  • Logging: agents can send messages through the standard python logging library. By default, all messages with log level INFO are saved to a logging.*.log file. Add the option scripts/experiments.py --verbose to save with log level DEBUG.
  • Tensorboard: by default, a tensoboard writer records information about useful scalars, images and model graphs to the run directory. It can be visualized by running: tensorboard --logdir <path-to-runs-dir>

Agents

The following agents are currently implemented:

Planning

VI Value Iteration

Perform a Value Iteration to compute the state-action value, and acts greedily with respect to it.

Only compatible with finite-mdp environments, or environments that handle an env.to_finite_mdp() conversion method.

Reference: Dynamic Programming, Bellman R., Princeton University Press (1957).

CEM Cross-Entropy Method

A sampling-based planning algorithm, in which sequences of actions are drawn from a prior gaussian distribution. This distribution is iteratively bootstraped by minimizing its cross-entropy to a target distribution approximated by the top-k candidates.

Only compatible with continuous action spaces. The environment is used as an oracle dynamics and reward model.

Reference: A Tutorial on the Cross-Entropy Method, De Boer P-T., Kroese D.P, Mannor S. and Rubinstein R.Y. (2005).

MCTS Monte-Carlo Tree Search

A world transition model is leveraged for trajectory search. A look-ahead tree is expanded so as to explore the trajectory space and quickly focus around the most promising moves.

References:

UCT Upper Confidence bounds applied to Trees

The tree is traversed by iteratively applying an optimistic selection rule at each depth, and the value at leaves is estimated by sampling. Empirical evidence shows that this popular algorithms performs well in many applications, but it has been proved theoretically to achieve a much worse performance (doubly-exponential) than uniform planning in some problems.

References:

OPD Optimistic Planning for Deterministic systems

This algorithm is tailored for systems with deterministic dynamics and rewards. It exploits the reward structure to achieve a polynomial rate on regret, and behaves efficiently in numerical experiments with dense rewards.

Reference: Optimistic Planning for Deterministic Systems, Hren J., Munos R. (2008).

OLOP Open Loop Optimistic Planning

References:

Trailblazer

Reference: Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning, Grill J. B., Valko M., Munos R. (2017).

PlaTγPOOS

Reference: Scale-free adaptive planning for deterministic dynamics & discounted rewards, Bartlett P., Gabillon V., Healey J., Valko M. (2019).

Safe planning

RVI Robust Value Iteration

A list of possible finite-mdp models is provided in the agent configuration. The MDP ambiguity set is constrained to be rectangular: different models can be selected at every transition.The corresponding robust state-action value is computed so as to maximize the worst-case total reward.

References:

DROP Discrete Robust Optimistic Planning

The MDP ambiguity set is assumed to be finite, and is constructed from a list of modifiers to the true environment. The corresponding robust value is approximately computed by Deterministic Optimistic Planning so as to maximize the worst-case total reward.

References:

IRP Interval-based Robust Planning

We assume that the MDP is a parametrized dynamical system, whose parameter is uncertain and lies in a continuous ambiguity set. We use interval prediction to compute the set of states that can be reached at any time t, given that uncertainty, and leverage it to evaluate and improve a robust policy.

If the system is Linear Parameter-Varying (LPV) with polytopic uncertainty, an fast and stable interval predictor can be designed. Otherwise, sampling-based approaches can be used instead, with an increased computational load.

References:

Value-based

DQN Deep Q-Network

A neural-network model is used to estimate the state-action value function and produce a greedy optimal policy.

Implemented variants:

  • Double DQN
  • Dueling architecture
  • N-step targets

References:

FTQ Fitted-Q

A Q-function model is trained by performing each step of Value Iteration as a supervised learning procedure applied to a batch of transitions covering most of the state-action space.

Reference: Tree-Based Batch Mode Reinforcement Learning, Ernst D. et al (2005).

Safe Value-based

BFTQ Budgeted Fitted-Q

An adaptation of FTQ in the budgeted setting: we maximise the expected reward r of a policy π under the constraint that an expected cost c remains under a given budget β. The policy π(a | s, β) is conditioned on this cost budget β, which can be changed online.

To that end, the Q-function model is trained to predict both the expected reward Qr and the expected cost Qc of the optimal constrained policy π.

This agent can only be used with environments that provide a cost signal in their info field:

>>> obs, reward, done, info = env.step(action)
>>> info
{'cost': 1.0}

Reference: Budgeted Reinforcement Learning in Continuous State Space, Carrara N., Leurent E., Laroche R., Urvoy T., Maillard O-A., Pietquin O. (2019).

Citing

If you use this project in your work, please consider citing it with:

@misc{rl-agents,
  author = {Leurent, Edouard},
  title = {rl-agents: Implementations of Reinforcement Learning algorithms},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/eleurent/rl-agents}},
}

More Repositories

1

phd-bibliography

References on Optimal Control, Reinforcement Learning and Motion Planning
852
star
2

twitter-graph

Fetch and visualize the graph of your Twitter friends and followers.
Python
362
star
3

KestrelFPV

Quadcopter racing simulator made with Unity3D
C#
50
star
4

phd-defense

JavaScript
42
star
5

obstacle-env

An environment for an obstacle avoidance task
Python
33
star
6

make-lstm-great-again

Donald Trump's tweets generator
Python
28
star
7

social-attention

Social Attention for Autonomous Decision-Making in Dense Traffic
TeX
19
star
8

python-good-practices

Useful tools and practices for Python development
18
star
9

phd-thesis

My PhD thesis. I defended on the 30th of October, 2020! See https://github.com/eleurent/phd-defense/
TeX
13
star
10

inria-beamer

An INRIA beamer template
TeX
11
star
11

robust-control

Approximate Robust Control of Uncertain Dynamical Systems
TeX
10
star
12

spelling

Naive Bayes classifier for detection of langage and spelling correction
Java
9
star
13

robust-beyond-quadratic

(NeurIPS 2020)
TeX
8
star
14

monte-carlo-graph-search

ACML 2020
TeX
8
star
15

sisyphe

Memorize poetry by hiding words progressively
HTML
5
star
16

who-s-on-my-gpu

Crossmatch GPUs, processes and users
Python
5
star
17

melodic-dictation

Automated melodic dictation
MATLAB
4
star
18

interval-prediction

(CDC 2019)
TeX
3
star
19

should-antoine-go-skying

Well, should he?
Jupyter Notebook
3
star
20

planning-gap-complexity

(NeurIPS 2020)
Python
2
star
21

prescription

Calculateur de prescription de l'action publique du viol et de l'infraction sexuelle
JavaScript
2
star
22

Arte-Plus7-Downloader

Download videos from Arte+7
JavaScript
2
star
23

kl-olop

Practical Open-Loop Optimistic Planning
PostScript
1
star
24

IGNite

IGN, in the end.
Python
1
star
25

hacktamine

A coding battle webapp, powered by django. Inspired by the movie The Social Network.
JavaScript
1
star
26

finite-mdp

Gym environment for MDPs with finite state and action spaces
Python
1
star
27

latexdiff-workflow

A workflow to show the diff between two versions of a latex document
TeX
1
star
28

MakerFaire2014

Psittacidae team source code for the 2014 Mission on Mars Robot Challenge
MATLAB
1
star