• Stars
    star
    103
  • Rank 333,046 (Top 7 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 6 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

TalkingData AdTracking Fraud Detection Challenge

TalkingData AdTracking Fraud Detection Challenge

models and scores

model definition can be found in scripts/model_lib.py

  • model1 LGBM with 83 (76 numerical, 7 categorical) features.

  • model2 keras with 27(18 numerical, 9 categorical) features, You can see network structure in model.png

model private score public score
model1 0.9836325 0.9828896
model2 0.9830595 0.9822785

feature engineering and scripts

Most of these features have already been discussed on the kaggle forum.

  • counting features

    • mk_feat_count.py
    • mk_feat_count_time.py
    • mk_feat_countRatio.py
  • cumulative count

    • mk_feat_cumcount.py
    • mk_feat_recumcount.py
    • mk_feat_cumratio.py
  • time to next click

    • mk_feat_nextClick_leak_day.py
    • mk_feat_nextClick_filter.py
  • time bucket count.(make multiple time intervals, and count the number of buckets which the IP exists)

    • mk_feat_rangecount.py
    • mk_feat_rangecount_minute.py
  • variance

    • mk_feat_var.py
  • common IP

    • mk_feat_common_ip.py
  • unique count

    • mk_feat_uniq_count2.py
  • target encoding: woe

    • mk_feat_woe_all_prev.py
    • mk_feat_woe_bound.py

Features will be calculated once and saved to disk.

Importance from LGBM is found in importance.txt.

Requirements

I used following environment

Hardware:

  • Memory: 256GB RAM, 256GB SWAP
  • CPU: 20 core, 2.10GHz
  • GPU: 1080Ti

Python3 packages:

  • numpy==1.14.2
  • pandas==0.22.0
  • lightgbm==2.1.0
  • keras==2.1.5

How to run

At first, put sample_submission.csv test.csv test_supplement.csv train.csv to input directory.

Then run shell scripts as follows,

$ cd scripts/

$ ./run_mk_feats.sh

$ ./run_mk_model1.sh

$ ./run_mk_model2.sh

Output prediction files will be in csv directory.

It took about one day for feature extraction(run_mk_feats.sh).

It needs large memory(~256GB) to build model1(run_mk_model1.sh), sorry.

GPU is required to build model2(run_mk_model2.sh)