• Stars
    star
    124
  • Rank 288,207 (Top 6 %)
  • Language
    Python
  • Created about 5 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A project of using machine learning model (tree-based) to predict short-term instrument price up or down in high frequency trading.

HFT-price-prediction

A project of using machine learning model (tree-based) to predict instrument price up or down in high frequency trading.

Project Background

A data science hands-on exercise of a high frequency trading company.

Task

To build a model with the given data to predict whether the trading price will go up or down in a short future. (classification problem)

Data Explanation

Feature Columns

timestamp str, datetime string.
bid_price float, price of current bid in the market.
bid_qty float, quantity currently available at the bid price.
bid_price float, price of current ask in the market.
ask_qty float, quantity currently available at the ask price.
trade_price float, last traded price.
sum_trade_1s float, sum of quantity traded over the last second.
bid_advance_time float, seconds since bid price last advanced.
ask_advance_time float, seconds since ask price last advanced.
last_trade_time float, seconds since last trade.

Labels

_1s_side int
_3s_side int
_5s_side int
Labels indicate what is type of the first event that will happen in the next x seconds, where:
0 -- No price change.
1 -- Bid price decreased.
2 -- Ask price increased.

Process

Preprocessing

data type conversion: preprocessing()
data check: check_null()
missing value handling: fill_null(), based on the null check and basic logic, most of the sum_trade_1s null value happens when last_trade_time larger than 1 sec (in this case sum_trade_1s should be 0). Therefore, we make an assumption that all the sum_trade_1s null value could be filled with 0. Based on such assumption, last_trade_time can be filled with last_trade_time of the previous record plus a time movement if record interval is smaller than 1 sec.

Feature Engineering

correlation filter: correlation_filter.filter(), remove columns that are highly correlated to reduce data redundancy.
logical feature engineering: feature_eng.basic_features(), build up some features based on trading logic.
time-rolling feature engineering: feature_eng.lag_rolling_features(), build up features by lagging and rolling of time-series.

Feature Selection

feature_selection.select(), Hybrid approach of genetic algorithm selection plus feature importance selection.
genetic algorithm selection: feature_selection.GA_features()
feature importance selection: feature_selection.rf_imp_features()

Modelling

Ensemble of lightGBM and random forest model.
random forest: model.random_forest()
lightGBM: model.lightgbm()

Parameter Tuning

Based on search space to decide whether using grid search or genetic search for lightGBM model's parameter tuning.
grid search: model.GS_tune_lgbm()
genetic search: model.GA_tune_lgbm()

Performance

Out-of-sample classfication accuracy is roughly 76-78%, which means its prediction of the short-term future price movement is acceptable.