Contents:
- Update
- Purpose of this repository
- Example
- Dependencies
- Installation
- TODO
- Acknowledgements
- Contributing
- Disclaimer
Appendix:
Update:
See latest PR.
Purpose of this repository:
The purpose of this repository is to create a custom MARL (multi-agent reinforcement learning) environment where multiple agents trade against one another in a CDA (continuous double auction).
The environment doesn't use any external data. Data is generated by self-play of the agents themselves through their interaction with the limit order book.
At each time step, the environment emits the top k rows of the aggregated order book as observations to the agents. Each agent then samples an action from the action space & all actions are randomly shuffled before execution in each time step.
Each time step is a snapshot of the limit order book & a key assumption is that all traders(agents) suffer the same lag (wait for all traders to have their orders executed before seeing the next LOB snapshot).
Example:
The example is available in this Jupyter notebook implemented with
RLlib: CDA_NSP.ipynb
. This notebook is tested in Colab.
This example uses two trained agents & N random agents. All agents compete with one another in this zero-sum environment, irregardless of whether they're trained or random.
competitive self-play
The policy weights of the winning trained agent(trader) is used to replace the policy weights of the other trained agents after each training iteration. Winning here is defined as having the highest reward per training iteration.
The reward function requires the agents to maximize profit while minimizing number of trades made in an episode (trading session). As the number of trades accumulates in the later stages of a session, profits will be scaled down by the number of trades & losses will be magnified.
The trained agents are P0 & P1, both using separate PPO policy weights. The rest are random agents.
The results with 10 agents are shown in the figures below:
If you're running locally, you can run the following command & navigate
to localhost:6006
in your browser to access the tensorboard graphs:
$ tensorboard --logdir ~/ray_results
Other ways to run this environment:
By using the python CDA_env_rand.py
script which is basically running a
CDA simulator with dummy (non-learning) random agents.
Dependencies:
- tensorFlow
- ray[rllib]
- pandas
- sortedcontainers
- sklearn
For a full list of dependencies & versions, see requirements.txt
in this
repository.
Installation:
The environment is installable via pip.
$ cd gym-continuousDoubleAuction
$ pip install -e .
TODO:
-
Better documentation.
-
Generalize the environment to use more than 1 LOB.
-
Parametric or hybrid action space (or experiment with different types of action space).
-
Expose the limit orders (that are currently in the LOB or aggregated LOB) which belongs to a particular trader as observation to that trader.
-
Allow traders to have random starting capital.
-
Instead of traders(agents) having the same lag, introduce zero lag (Each LOB snapshot in each t-step is visible to all traders) or random lag.
-
Allows a distribution of previous winning policies to be selected for trained agents. (training)
-
Custom RLlib workflow to include custom RND + PPO policies. (training)
-
Update current model (deprecated) or use default from RLlib. (training)
-
Move TODO to issues.
Acknowledgements:
The orderbook matching engine is adapted from https://github.com/dyn4mik3/OrderBook
Contributing:
Please see CONTRIBUTING.md.
Disclaimer:
This repository is only meant for research purposes & is never meant to be used in any form of trading. Past performance is no guarantee of future results. If you suffer losses from using this repository, you are the sole person responsible for the losses. The author will NOT be held responsible in any way.
Appendix:
Observation space:
Each obs is a snapshot in each environment step.
obs = [array([1026., 2883., 1258., 1263., 3392., 1300., 1950., 1894., 2401., 4241.], # bid size list
array([64., 63., 62., 61., 60., 59., 58., 57., 56., 55.]), # bid price list
array([ -519., -2108., -215., -1094., -1687., -2667., -3440., -2902., -1440 -3078.]), # ask size list
array([-65., -66., -67., -68., -69., -70., -71., -72., -73., -74.])] # ask price list
Action space:
See PR 9 for the current action space.
Reward:
If NAV_chg
is used as the reward. The episode_reward
from RLlib training
output will be 0, indicating a zero-sum game.
NAV_chg = float(trader.acc.nav - trader.acc.prev_nav)
# maximize NAV
#rewards[trader.ID] = NAV_chg
However, if the NAV_chg
is scaled, then the episode_reward
from RLlib
training output will NOT be 0.
# maximize NAV, minimize num of trades (more trades gets penalized).
if NAV_chg >= 0:
rewards[trader.ID] = NAV_chg / (trader.acc.num_trades + 1)
else:
rewards[trader.ID] = NAV_chg * (trader.acc.num_trades + 1)
trader.acc.reward = rewards[trader.ID]
Making sense of the render output:
The step separator:
************************************************** t_step = 306 **************************************************
Actions:
Actions output from the model:
- Each column represents the action from each trader(agent).
- Row 1 represents the side: none, bid, ask (0 to 2).
- Row 2 represents the type: market, limit, modify, cancel.
- Row 3 represents the mean for size selection.
- Row 4 represents the sigma for size selection.
- Row 5 represents the price: based on LOB market depth from 0 to 11.
Model actions:
-- -- -- --
1 1 1 2
1 0 0 1
39 29 6 17
19 89 13 0
7 4 9 10
-- -- -- --
- Column 1 represents the ID of each trader(agent).
- Column 2 the side: none, bid, ask (0 to 2).
- Column 3 type: market, limit, modify, cancel.
- Column 4 represents the order size.
- Column 5 represents the order price.
Formatted actions acceptable by LOB:
- --- ------ ----- --
0 bid limit 38982 15
1 bid market 5779 0
2 bid market 999 0
3 ask limit 17001 47
- --- ------ ----- --
Shuffled action queue sequence for LOB executions:
- --- ------ ----- --
3 ask limit 17001 47
2 bid market 999 0
1 bid market 5779 0
0 bid limit 38982 15
- --- ------ ----- --
Rewards, dones, & infos:
rewards:
{0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0}
dones:
{'__all__': True}
infos:
{0: {}, 1: {}, 2: {}, 3: {}}
Aggregated LOB:
- The columns represents the 10 levels (1 to 10, left to right) of the market depth in the LOB.
- Row 1 represents the bid size.
- Row 2 represents the bid price.
- Row 3 represents the ask size.
- Row 4 represents the ask price.
agg LOB @ t-1
------ ----- ------ ------ ------ ------ ------ ------ ------ ------
7746 19011 126634 116130 43073 124055 74977 188096 139117 143968
23 22 21 20 19 15 14 12 11 10
-62448 -7224 -65989 -96940 -77985 -93987 -55942 -4173 -16998 -81011
-36 -37 -38 -39 -40 -41 -42 -43 -47 -48
------ ----- ------ ------ ------ ------ ------ ------ ------ ------
agg LOB @ t
------ ----- ------ ------ ------ ------ ------ ------ ------ ------
7746 19011 126634 116130 43073 163037 74977 188096 139117 143968
23 22 21 20 19 15 14 12 11 10
-56669 -7224 -65989 -96940 -77985 -93987 -55942 -4173 -33999 -81011
-36 -37 -38 -39 -40 -41 -42 -43 -47 -48
------ ----- ------ ------ ------ ------ ------ ------ ------ ------
LOB bids:
The current limit bid orders in the LOB.
LOB:
***Bids***
size price trade_id timestamp order_id
0 7746 23 0 345 265
1 19011 22 1 344 231
2 14553 21 2 107 99
3 63025 21 1 333 209
4 49056 21 3 349 268
5 89029 20 2 53 53
6 24060 20 0 201 46
7 3041 20 1 297 229
8 43073 19 1 35 35
9 42989 15 1 340 234
10 81066 15 3 336 259
11 38982 15 0 359 275
12 63003 14 0 253 201
13 11974 14 1 285 168
14 18089 12 3 351 105
15 91998 12 0 343 264
16 78009 12 1 352 40
17 45039 11 3 123 101
18 94078 11 0 204 172
19 97967 10 3 223 185
20 46001 10 1 313 243
21 45871 9 2 52 52
22 94993 9 3 209 176
LOB asks:
The current limit ask orders in the LOB.
***Asks***
size price trade_id timestamp order_id
0 40654 36 3 322 250
1 16015 36 0 323 251
2 7224 37 1 272 214
3 39980 38 3 299 190
4 26009 38 1 261 206
5 58977 39 0 231 188
6 37963 39 3 284 164
7 15995 40 0 305 235
8 61990 40 3 328 254
9 93987 41 0 353 143
10 55942 42 1 290 189
11 4173 43 0 112 104
12 16998 47 1 341 239
Tape (Time & sales):
***tape***
size price timestamp counter_party_ID init_party_ID init_party_side
0 5779 36 358 3 1 bid
1 5894 36 356 3 0 bid
2 13347 36 355 3 1 bid
3 2272 36 354 3 0 bid
4 894 23 350 0 1 ask
5 12874 23 347 0 0 ask
6 7501 23 346 0 1 ask
7 9405 22 342 1 3 ask
Trades:
Trades that took place when executing the action of a trader(agent) at t-step.
act_seq_num represents the sequence of the action. In this case, it's the 2nd action executed at t-step.
TRADES (act_seq_num): 2
seq_Trade_ID timestamp price size time counter_ID counter_side counter_order_ID counter_new_book_size init_ID init_side init_order_ID init_new_LOB_size
0 0 358 36 5779.0 358 3 ask 250 40654 1 bid None None
New order in LOB:
The new limit orders inserted into LOB (includes unfilled leftover quantity from previous order).
order_in_book (act_seq_num): 0
type side quantity price trade_id timestamp order_id
------ ------ ---------- ------- ---------- ----------- ----------
limit ask 17001 47 3 357 273
order_in_book (act_seq_num): 3
type side quantity price trade_id timestamp order_id
------ ------ ---------- ------- ---------- ----------- ----------
limit bid 38982 15 0 359 275
Mark to market profit @ t-step:
mark_to_mkt profit@t:
ID: 0; profit: 1491150.999999999999999999998
ID: 1; profit: 3583508.999999999999999999995
ID: 2; profit: -7421583.999999999999999999999
ID: 3; profit: -676658.0000000000000000000013
Accounts info:
Accounts:
ID cash cash_on_hold position_val prev_nav nav net_position VWAP profit total_profit num_trades
---- ------------ -------------- -------------- ------------ ------------ -------------- ------- ----------------- -------------- ------------
0 -4.51044e+07 3.11089e+07 1.64866e+07 2.49115e+06 2.49115e+06 -375119 39.9751 1.49115e+06 1.49115e+06 74
1 -3.8919e+07 3.27787e+07 1.07237e+07 4.58351e+06 4.58351e+06 -98798 72.2711 3.58351e+06 3.58351e+06 78
2 -1.92421e+07 3.55094e+06 9.2696e+06 -6.42158e+06 -6.42158e+06 257489 64.8229 -7.42158e+06 -7.42158e+06 23
3 -4.46985e+07 4.0254e+07 7.79141e+06 3.34692e+06 3.34692e+06 216428 39.1265 -676658 2.34692e+06 79
-
total_sys_profit (total profit of all agents at each step) should be equal to 0 (zero-sum game).
-
total_sys_nav (total net asset value of all agents at each step) is the total sum of beginning NAV of all traders(agents).
Note: Small random rounding errors are present.
total_sys_profit = -9E-21; total_sys_nav = 3999999.999999999999999999991
Sample output results for final training iteration:
- The episode_reward is zero (zero sum game) for each episode.
episode_reward_max: 0.0
episode_reward_mean: 0.0
episode_reward_min: 0.0
- The mean reward of each policy is shown under
policy_reward_mean
.
.
.
.
Result for PPO_continuousDoubleAuction-v0_0:
custom_metrics: {}
date: 2019-09-30_21-16-20
done: true
episode_len_mean: 1001.0
episode_reward_max: 0.0
episode_reward_mean: 0.0
episode_reward_min: 0.0
episodes_this_iter: 4
episodes_total: 38
experiment_id: 56cbdad4389343eca5cfd49eadeb3554
hostname: Duality0.local
info:
grad_time_ms: 15007.219
learner:
policy_0:
cur_kl_coeff: 0.0003906250058207661
cur_lr: 4.999999873689376e-05
entropy: 10.819798469543457
entropy_coeff: 0.0
kl: 8.689265087014064e-06
model: {}
policy_loss: 153.9163055419922
total_loss: 843138688.0
vf_explained_var: 0.0
vf_loss: 843138496.0
num_steps_sampled: 40000
num_steps_trained: 40000
opt_peak_throughput: 266.538
opt_samples: 4000.0
sample_peak_throughput: 80.462
sample_time_ms: 49713.208
update_time_ms: 176.14
iterations_since_restore: 10
node_ip: 192.168.1.12
num_healthy_workers: 2
off_policy_estimator: {}
pid: 10220
policy_reward_mean:
policy_0: 12414.421052631578
policy_1: -301.39473684210526
policy_2: -952.1578947368421
policy_3: -11160.868421052632
sampler_perf:
mean_env_wait_ms: 18.1753569144153
mean_inference_ms: 4.126144958830859
mean_processing_ms: 1.5262831265657335
time_since_restore: 649.1416146755219
time_this_iter_s: 61.54709506034851
time_total_s: 649.1416146755219
timestamp: 1569849380
timesteps_since_restore: 40000
timesteps_this_iter: 4000
timesteps_total: 40000
training_iteration: 10
trial_id: ea67f638
2019-09-30 21:16:20,507 WARNING util.py:145 -- The `process_trial` operation took 0.4397752285003662 seconds to complete, which may be a performance bottleneck.
2019-09-30 21:16:21,407 WARNING util.py:145 -- The `experiment_checkpoint` operation took 0.899777889251709 seconds to complete, which may be a performance bottleneck.
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/4 CPUs, 0/0 GPUs
Memory usage on this node: 3.3/4.3 GB
Result logdir: /Users/hadron0/ray_results/PPO
Number of trials: 1 ({'TERMINATED': 1})
TERMINATED trials:
- PPO_continuousDoubleAuction-v0_0: TERMINATED, [3 CPUs, 0 GPUs], [pid=10220], 649 s, 10 iter, 40000 ts, 0 rew
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/4 CPUs, 0/0 GPUs
Memory usage on this node: 3.3/4.3 GB
Result logdir: /Users/hadron0/ray_results/PPO
Number of trials: 1 ({'TERMINATED': 1})
TERMINATED trials:
- PPO_continuousDoubleAuction-v0_0: TERMINATED, [3 CPUs, 0 GPUs], [pid=10220], 649 s, 10 iter, 40000 ts, 0 rew