• Stars
    star
    116
  • Rank 301,932 (Top 6 %)
  • Language
    Python
  • Created over 5 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

阿里云安全恶意程序检测比赛

Machine Learning for Malware Detection

edited by Raymond Luo

Environment

python3
RTX2080 I7 9700K 16G RAM
ubuntu 18.04

Requirement

TensorFlow==1.14 keras==2.3.1

sudo pip3 install -r requirements.txt

Please put the security_train.xlsx and security_test.xlsx in related floders

Loadfile

python3 loadfile.py

Train multi vision lstm

python3 train_lstm.py

Train LSTM

python3 train_lstm2.py

Train textcnn LSTM

python3 train_lstm3.py

Train CNN

python3 train_textcnn.py

Train tf-idf model

python3 xgboost.py

Stack model

python3 stack_result.py

Result

Result will be in submit floder

FILE PREPROCESS

Data format

label class explain
file_id bigint Number of files
label bigint 文件标签,0-正常/1-勒索病毒/2-挖矿程序/3-DDoS木马/4-蠕虫病毒/5-感染型病毒/6-后门程序/7-木马程序.
api string The API call list of the file.
tid bigint The thread number of the file.
index string The order of the API call.

阿里云安全恶意程序检测

Our method:

Firstly, we grouped the file by ‘file-id’. For each files, we then grouped the API calls by ‘tid’ and sorted the API calls by its ‘index’. Finally, we concatenate each thread to one line to get our final represent of 1 sample.

Sample LdrLoadDll LdrGetProcedureAddress LdrGetProcedureAddress LdrGetProcedureAddress LdrGetProcedureAddress....... label 5

MODEL

For best score we use 5 models and stack them to get our final score.

TF-idf model:

In order to get the whole information of the sample. We firstly extract the TF-idf features of the sample. We extract ‘TF-IDF’ feature after we preprocess the sample with 1-5 gram.

vectorizer = TfidfVectorizer(ngram_range=(1, 5), min_df=3, max_df=0.9, )  # tf-idf特征抽取ngram_range=(1,5)

Then we classify the sample base on XGboost method.

param = {'max_depth': 6, 'eta': 0.1, 'eval_metric': 'mlogloss', 'silent': 1, 'objective': 'multi:softprob',
         'num_class': 8, 'subsample': 0.8,
         'colsample_bytree': 0.85}  # 参数

evallist = [(dtrain, 'train'), (dtest, 'val')]  # 测试 , (dtrain, 'train')
num_round = 300  # 循环次数
bst = xgb.train(param, dtrain, num_round, evallist, early_stopping_rounds=50)

Text cnn model

In this model, we used ‘text cnn’ to extract the features and classify the sample.

Each sample has been cut to maximum 6000 length and encode as one-hot label.

We input the sample to 1 embedding layer then go to the CNN layers.

We used 2,4,6,8,10 5 different kernel sizes, each convolutional layer has 32 filters.

After the cnn layers, we concatenate the feature extracted by CNN layers and classify them.

main_input = Input(shape=(maxlen,), dtype='float64')
_embed = Embedding(304, 256, input_length=maxlen)(main_input)
_embed = SpatialDropout1D(0.25)(_embed)
warppers = []
num_filters = 64
kernel_size = [2, 3, 4, 5]
conv_action = 'relu'
for _kernel_size in kernel_size:
    for dilated_rate in [1, 2, 3, 4]:
        conv1d = Conv1D(filters=num_filters, kernel_size=_kernel_size, activation=conv_action,
                        dilation_rate=dilated_rate)(_embed)
        warppers.append(GlobalMaxPooling1D()(conv1d))

fc = concatenate(warppers)
fc = Dropout(0.5)(fc)
# fc = BatchNormalization()(fc)
fc = Dense(256, activation='relu')(fc)
fc = Dropout(0.25)(fc)
# fc = BatchNormalization()(fc)
preds = Dense(8, activation='softmax')(fc)

model = Model(inputs=main_input, outputs=preds)

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
return model

CNN_LSTM model

In order to get the sequence information of the sample. We used 3 LSTM based model.

The first one is simple CNN+LSTM. All cnn layer has the same kernel size.

 main_input = Input(shape=(maxlen,), dtype='float64')
    embedder = Embedding(304, 256, input_length=maxlen)
    embed = embedder(main_input)
    # avg = GlobalAveragePooling1D()(embed)
    # cnn1模块,kernel_size = 3
    conv1_1 = Conv1D(64, 3, padding='same', activation='relu')(embed)

    conv1_2 = Conv1D(64, 3, padding='same', activation='relu')(conv1_1)

    cnn1 = MaxPool1D(pool_size=2)(conv1_2)
    conv1_1 = Conv1D(64, 3, padding='same', activation='relu')(cnn1)

    conv1_2 = Conv1D(64, 3, padding='same', activation='relu')(conv1_1)

    cnn1 = MaxPool1D(pool_size=2)(conv1_2)
    conv1_1 = Conv1D(64, 3, padding='same', activation='relu')(cnn1)

    conv1_2 = Conv1D(64, 3, padding='same', activation='relu')(conv1_1)

    cnn1 = MaxPool1D(pool_size=2)(conv1_2)
    rl = CuDNNLSTM(256)(cnn1)
    # flat = Flatten()(cnn3)
    # drop = Dropout(0.5)(flat)
    fc = Dense(256)(rl)

    main_output = Dense(8, activation='softmax')(rl)
    model = Model(inputs=main_input, outputs=main_output)
    return model

Mulit vision LSTM

Inspired by out text cnn method. We thought that if different kernel size can have different visions why can we have a mulit vision LSTM model. So we first use 3,5,7 3 different kerner sizes. In spite of arrange them as linner order, we use them to extract the features from the embedding layer independly. Between each lyayer we used averge pooling instead of max pooling to get more information of the sequence rather than only one API call. After the CNN layers we have three feature vectors which have same size. We then use
$$ max_{elements}(v_1,v_2,v_3) $$ in order to get the most significant feature in different visions. After that we used LSTM layer to analyze the sequence and classify it.

embed_size = 256
    num_filters = 64
    kernel_size = [3, 5, 7]
    main_input = Input(shape=(maxlen,))
    emb = Embedding(304, 256, input_length=maxlen)(main_input)
    # _embed = SpatialDropout1D(0.15)(emb)
    warppers = []
    warppers2 = []  # 0.42
    warppers3 = []
    for _kernel_size in kernel_size:
        conv1d = Conv1D(filters=num_filters, kernel_size=_kernel_size, activation='relu', padding='same')(emb)
        warppers.append(AveragePooling1D(2)(conv1d))
    for (_kernel_size, cnn) in zip(kernel_size, warppers):
        conv1d_2 = Conv1D(filters=num_filters, kernel_size=_kernel_size, activation='relu', padding='same')(cnn)
        warppers2.append(AveragePooling1D(2)(conv1d_2))
    for (_kernel_size, cnn) in zip(kernel_size, warppers2):
        conv1d_2 = Conv1D(filters=num_filters, kernel_size=_kernel_size, activation='relu', padding='same')(cnn)
        warppers3.append(AveragePooling1D(2)(conv1d_2))
    fc = Maximum()(warppers3)
    rl = CuDNNLSTM(512)(fc)
    main_output = Dense(8, activation='softmax')(rl)
    model = Model(inputs=main_input, outputs=main_output)
    return model

Text cnn lstm

This model is mostly as same as multi visions LSTM but we concatenate the 3 feature in order. Because we think this method did't affect the order of the sequence. In each part the order of API calls did't change.

main_input = Input(shape=(maxlen,), dtype='float64')

    embedder = Embedding(304, 256, input_length=maxlen)
    embed = embedder(main_input)
    # cnn1模块,kernel_size = 3
    conv1_1 = Conv1D(16, 3, padding='same')(embed)
    bn1_1 = BatchNormalization()(conv1_1)
    relu1_1 = Activation('relu')(bn1_1)
    conv1_2 = Conv1D(32, 3, padding='same')(relu1_1)
    bn1_2 = BatchNormalization()(conv1_2)
    relu1_2 = Activation('relu')(bn1_2)
    cnn1 = MaxPool1D(pool_size=4)(relu1_2)
    # cnn2模块,kernel_size = 4
    conv2_1 = Conv1D(16, 4, padding='same')(embed)
    bn2_1 = BatchNormalization()(conv2_1)
    relu2_1 = Activation('relu')(bn2_1)
    conv2_2 = Conv1D(32, 4, padding='same')(relu2_1)
    bn2_2 = BatchNormalization()(conv2_2)
    relu2_2 = Activation('relu')(bn2_2)
    cnn2 = MaxPool1D(pool_size=4)(relu2_2)
    # cnn3模块,kernel_size = 5
    conv3_1 = Conv1D(16, 5, padding='same')(embed)
    bn3_1 = BatchNormalization()(conv3_1)
    relu3_1 = Activation('relu')(bn3_1)
    conv3_2 = Conv1D(32, 5, padding='same')(relu3_1)
    bn3_2 = BatchNormalization()(conv3_2)
    relu3_2 = Activation('relu')(bn3_2)
    cnn3 = MaxPool1D(pool_size=4)(relu3_2)
    # 拼接三个模块
    cnn = concatenate([cnn1, cnn2, cnn3], axis=-1)
    lstm = CuDNNLSTM(256)(cnn)
    f = Flatten()(cnn1)
    fc = Dense(256, activation='relu')(f)
    D = Dropout(0.5)(fc)
    main_output = Dense(8, activation='softmax')(lstm)
    model = Model(inputs=main_input, outputs=main_output)
    return model

Stack result

We stack the results of the model mentioned above.

train = np.hstack([tfidf_train_result, textcnn_train_result, mulitl_version_lstm_train_result, cnn_train_result,
                   textcnn_lstm_train_result])
test = np.hstack(
    [tfidf_out_result, textcnn_out_result, mulitl_version_lstm_test_result, cnn_out_result, textcnn_lstm_test_result])
meta_test = np.zeros(shape=(len(outfiles), 8))
skf = StratifiedKFold(n_splits=5, random_state=4, shuffle=True)
dout = xgb.DMatrix(test)
for i, (tr_ind, te_ind) in enumerate(skf.split(train, labels)):
    print('FOLD: {}'.format(str(i)))
    X_train, X_train_label = train[tr_ind], labels[tr_ind]
    X_val, X_val_label = train[te_ind], labels[te_ind]
    dtrain = xgb.DMatrix(X_train, label=X_train_label)
    dtest = xgb.DMatrix(X_val, X_val_label)  # label可以不要,此处需要是为了测试效果

    param = {'max_depth': 6, 'eta': 0.01, 'eval_metric': 'mlogloss', 'silent': 1, 'objective': 'multi:softprob',
             'num_class': 8, 'subsample': 0.9,
             'colsample_bytree': 0.85}  # 参数
    evallist = [(dtrain, 'train'), (dtest, 'val')]  # 测试 , (dtrain, 'train')
    num_round = 10000  # 循环次数
    bst = xgb.train(param, dtrain, num_round, evallist, early_stopping_rounds=100)
    preds = bst.predict(dout)
    meta_test += preds

meta_test /= 5.0
result = meta_test

More Repositories

1

Awesome-LLM-KG

Awesome papers about unifying LLMs and KGs
1,833
star
2

reasoning-on-graphs

Official Implementation of ICLR 2024 paper: "Reasoning on Graphs: Faithful and Interpretable Large Language Model Reasoning"
Python
195
star
3

FakePosition

Fake position for IOS17 虚拟定位 支持IOS17
Shell
130
star
4

CP-GNN

Official code implementation for CIKM 21 paper Detecting Communities from Heterogeneous Graphs: A Context Path-based Graph Neural Network Model
Python
33
star
5

crack-soga-v2ray

A crack version of soga v2ray backend
Shell
33
star
6

NP-FKGC

Official code implementation for SIGIR 23 paper Normalizing Flow-based Neural Process for Few-Shot Knowledge Graph Completion
Python
32
star
7

MAMDR

Official code implementation for ICDE 23 paper MAMDR: A Model Agnostic Learning Method for Multi-Domain Recommendation
Python
31
star
8

ChatRule

Mining Logical Rules with Large Language Models for Knowledge Graph Reasoning with 1 dollar.
Python
28
star
9

ciscn2019_final_web4

全国大学生信息安全竞赛决赛web4,SSTI+TensorFlow模型
Python
27
star
10

GSNOP

Official code implementation for WSDM 23 paper Graph Sequential Neural ODE Process for Link Prediction on Dynamic and Sparse Graphs.
Python
26
star
11

relay

高性能流量转发后端
Go
22
star
12

HITszOpeningTemplate

哈工大深圳本科毕设开题模板
PostScript
11
star
13

llm-facteval

Source code of paper "Systematic Assessment of Factual Knowledge in Large Language Models" - EMNLP Findings 2023
Python
10
star
14

SSPanel-VNet-V2ray

对接SSPanel的V2ray后端。
Go
8
star
15

2018CISCN_SOUTH

2018CISCN华南赛区WEB备份
PHP
7
star
16

MotifGNN

Official code implementation for paper A Motif-Based Graph Neural Network to Reciprocal Recommendation for Online Dating
Python
7
star
17

srun_auto_login

深澜校园网自动后台常驻保证在线,支持windows、linux、mac,妈妈再也不用担心电脑被挤掉线了。
Python
6
star
18

wh_migration

武汉人口迁移数据爬取
Python
4
star
19

GSim

Official code implementation of "GSim: A Graph Neural Network Based Relevance Measure for Heterogeneous Graphs"
Python
4
star
20

zhihuishu-auto

2018新版智慧树自动刷课脚本
Python
4
star
21

HITszMidtermTemplate

哈工大深圳本科毕设中期模板
PostScript
3
star
22

myblog

My Blog
HTML
1
star
23

RManLuo.github.io

My Homepage
Stylus
1
star
24

2018-HITCTF-Challenges

Source code and exploit(writeup) script of HITCTF 2018 hold by [Lilac](http://7hxzz.club/) & H1TerHub & Aya9
Python
1
star
25

SniperOJ-WriteUps

Python
1
star
26

Home_Monitor

A simple House Monitor programe with a WebUI
Python
1
star
27

RManLuo

1
star
28

code-similarity

An out of box code and text similarity computation package
Python
1
star
29

fishing_dectect

提取钓鱼页面特征,并方便SVM处理
Python
1
star
30

docker-openconnect-sso

Run your own anyconnect/openconnect VPN client with SSO in Docker.
Dockerfile
1
star