HFT-price-prediction
A project of using machine learning model (tree-based) to predict instrument price up or down in high frequency trading.
Project Background
A data science hands-on exercise of a high frequency trading company.
Task
To build a model with the given data to predict whether the trading price will go up or down in a short future. (classification problem)
Data Explanation
Feature Columns
timestamp str, datetime string.
bid_price float, price of current bid in the market.
bid_qty float, quantity currently available at the bid price.
bid_price float, price of current ask in the market.
ask_qty float, quantity currently available at the ask price.
trade_price float, last traded price.
sum_trade_1s float, sum of quantity traded over the last second.
bid_advance_time float, seconds since bid price last advanced.
ask_advance_time float, seconds since ask price last advanced.
last_trade_time float, seconds since last trade.
Labels
_1s_side int
_3s_side int
_5s_side int
Labels indicate what is type of the first event that will happen in the next x seconds, where:
0 -- No price change.
1 -- Bid price decreased.
2 -- Ask price increased.
Process
Preprocessing
data type conversion: preprocessing()
data check: check_null()
missing value handling: fill_null()
,
based on the null check and basic logic, most of the sum_trade_1s null value happens when last_trade_time larger
than 1 sec (in this case sum_trade_1s should be 0). Therefore, we make an assumption that all the sum_trade_1s null
value could be filled with 0. Based on such assumption, last_trade_time can be filled with last_trade_time of the
previous record plus a time movement if record interval is smaller than 1 sec.
Feature Engineering
correlation filter: correlation_filter.filter()
, remove columns that are highly correlated to reduce data redundancy.
logical feature engineering: feature_eng.basic_features()
, build up some features based on trading logic.
time-rolling feature engineering: feature_eng.lag_rolling_features()
, build up features by lagging and rolling of time-series.
Feature Selection
feature_selection.select()
, Hybrid approach of genetic algorithm selection plus feature importance selection.
genetic algorithm selection: feature_selection.GA_features()
feature importance selection: feature_selection.rf_imp_features()
Modelling
Ensemble of lightGBM and random forest model.
random forest: model.random_forest()
lightGBM: model.lightgbm()
Parameter Tuning
Based on search space to decide whether using grid search or genetic search for lightGBM model's parameter tuning.
grid search: model.GS_tune_lgbm()
genetic search: model.GA_tune_lgbm()
Performance
Out-of-sample classfication accuracy is roughly 76-78%, which means its prediction of the short-term future price movement is acceptable.