Grokking Deep Reinforcement Learning
Note: At the moment, only running the code from the docker container (below) is supported. Docker allows for creating a single environment that is more likely to work on all systems. Basically, I install and configure all packages for you, except docker itself, and you just run the code on a tested environment.
To install docker, I recommend a web search for "installing docker on <your os here>". For running the code on a GPU, you have to additionally install nvidia-docker. NVIDIA Docker allows for using a host's GPUs inside docker containers. After you have docker (and nvidia-docker if using a GPU) installed, follow the three steps below.
Running the code
- Clone this repo:
git clone --depth 1 https://github.com/mimoralea/gdrl.git && cd gdrl
- Pull the gdrl image with:
docker pull mimoralea/gdrl:v0.14
- Spin up a container:
- On Mac or Linux:
docker run -it --rm -p 8888:8888 -v "$PWD"/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.14
- On Windows:
docker run -it --rm -p 8888:8888 -v %CD%/notebooks/:/mnt/notebooks/ mimoralea/gdrl:v0.14
- NOTE: Use
nvidia-docker
or add--gpus all
after--rm
to the command, if you are using a GPU.
- On Mac or Linux:
- Open a browser and go to the URL shown in the terminal (likely to be: http://localhost:8888). The password is:
gdrl
About the book
Book's website
https://www.manning.com/books/grokking-deep-reinforcement-learning
Table of content
- Introduction to deep reinforcement learning
- Mathematical foundations of reinforcement learning
- Balancing immediate and long-term goals
- Balancing the gathering and utilization of information
- Evaluating agents' behaviors
- Improving agents' behaviors
- Achieving goals more effectively and efficiently
- Introduction to value-based deep reinforcement learning
- More stable value-based methods
- Sample-efficient value-based methods
- Policy-gradient and actor-critic methods
- Advanced actor-critic methods
- Towards artificial general intelligence
Detailed table of content
1. Introduction to deep reinforcement learning
- (Livebook)
- (No Notebook)
2. Mathematical foundations of reinforcement learning
- (Livebook)
- (Notebook)
- Implementations of several MDPs:
- Bandit Walk
- Bandit Slippery Walk
- Slippery Walk Three
- Random Walk
- Russell and Norvig's Gridworld from AIMA
- FrozenLake
- FrozenLake8x8
- Implementations of several MDPs:
3. Balancing immediate and long-term goals
- (Livebook)
- (Notebook)
- Implementations of methods for finding optimal policies:
- Policy Evaluation
- Policy Improvement
- Policy Iteration
- Value Iteration
- Implementations of methods for finding optimal policies:
4. Balancing the gathering and utilization of information
- (Livebook)
- (Notebook)
- Implementations of exploration strategies for bandit problems:
- Random
- Greedy
- E-greedy
- E-greedy with linearly decaying epsilon
- E-greedy with exponentially decaying epsilon
- Optimistic initialization
- SoftMax
- Upper Confidence Bound
- Bayesian
- Implementations of exploration strategies for bandit problems:
5. Evaluating agents' behaviors
- (Livebook)
- (Notebook)
- Implementation of algorithms that solve the prediction problem (policy estimation):
- On-policy first-visit Monte-Carlo prediction
- On-policy every-visit Monte-Carlo prediction
- Temporal-Difference prediction (TD)
- n-step Temporal-Difference prediction (n-step TD)
- TD(位)
- Implementation of algorithms that solve the prediction problem (policy estimation):
6. Improving agents' behaviors
- (Livebook)
- (Notebook)
- Implementation of algorithms that solve the control problem (policy improvement):
- On-policy first-visit Monte-Carlo control
- On-policy every-visit Monte-Carlo control
- On-policy TD control: SARSA
- Off-policy TD control: Q-Learning
- Double Q-Learning
- Implementation of algorithms that solve the control problem (policy improvement):
7. Achieving goals more effectively and efficiently
- (Livebook)
- (Notebook)
- Implementation of more effective and efficient reinforcement learning algorithms:
- SARSA(位) with replacing traces
- SARSA(位) with accumulating traces
- Q(位) with replacing traces
- Q(位) with accumulating traces
- Dyna-Q
- Trajectory Sampling
- Implementation of more effective and efficient reinforcement learning algorithms:
8. Introduction to value-based deep reinforcement learning
- (Livebook)
- (Notebook)
- Implementation of a value-based deep reinforcement learning baseline:
- Neural Fitted Q-iteration (NFQ)
- Implementation of a value-based deep reinforcement learning baseline:
9. More stable value-based methods
- (Livebook)
- (Notebook)
- Implementation of "classic" value-based deep reinforcement learning methods:
- Deep Q-Networks (DQN)
- Double Deep Q-Networks (DDQN)
- Implementation of "classic" value-based deep reinforcement learning methods:
10. Sample-efficient value-based methods
- (Livebook)
- (Notebook)
- Implementation of main improvements for value-based deep reinforcement learning methods:
- Dueling Deep Q-Networks (Dueling DQN)
- Prioritized Experience Replay (PER)
- Implementation of main improvements for value-based deep reinforcement learning methods:
11. Policy-gradient and actor-critic methods
- (Livebook)
- (Notebook)
- Implementation of classic policy-based and actor-critic deep reinforcement learning methods:
- Policy Gradients without value function and Monte-Carlo returns (REINFORCE)
- Policy Gradients with value function baseline trained with Monte-Carlo returns (VPG)
- Asynchronous Advantage Actor-Critic (A3C)
- Generalized Advantage Estimation (GAE)
- [Synchronous] Advantage Actor-Critic (A2C)
- Implementation of classic policy-based and actor-critic deep reinforcement learning methods:
12. Advanced actor-critic methods
- (Livebook)
- (Notebook)
- Implementation of advanced actor-critic methods:
- Deep Deterministic Policy Gradient (DDPG)
- Twin Delayed Deep Deterministic Policy Gradient (TD3)
- Soft Actor-Critic (SAC)
- Proximal Policy Optimization (PPO)
- Implementation of advanced actor-critic methods:
13. Towards artificial general intelligence
- (Livebook)
- (No Notebook)