• Stars
    star
    169
  • Rank 224,453 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Topological botnet detection datasets and graph neural network applications

botnet-detection

MIT License Paper

Topological botnet detection datasets and automatic detection with graph neural networks.

A collection of different botnet topologyies overlaid onto normal background network traffic, containing featureless graphs of relatively large scale for inductive learning.

Installation

From source

git clone https://github.com/harvardnlp/botnet-detection
cd botnet-detection
python setup.py install

To Load the Botnet Data

We provide standard and easy-to-use dataset and data loaders, which automatically handle the dataset dnowloading as well as standard data splitting, and can be compatible with most of the graph learning libraries by specifying the graph_format argument:

from botdet.data.dataset_botnet import BotnetDataset
from botdet.data.dataloader import GraphDataLoader

botnet_dataset_train = BotnetDataset(name='chord', split='train', graph_format='pyg')
botnet_dataset_val = BotnetDataset(name='chord', split='val', graph_format='pyg')
botnet_dataset_test = BotnetDataset(name='chord', split='test', graph_format='pyg')

train_loader = GraphDataLoader(botnet_dataset_train, batch_size=2, shuffle=False, num_workers=0)
val_loader = GraphDataLoader(botnet_dataset_val, batch_size=1, shuffle=False, num_workers=0)
test_loader = GraphDataLoader(botnet_dataset_test, batch_size=1, shuffle=False, num_workers=0)

The choices for dataset name are (indicating different botnet topologies):

  • 'chord' (synthetic, 10k botnet nodes)
  • 'debru' (synthetic, 10k botnet nodes)
  • 'kadem' (synthetic, 10k botnet nodes)
  • 'leet' (synthetic, 10k botnet nodes)
  • 'c2' (real, ~3k botnet nodes)
  • 'p2p' (real, ~3k botnet nodes)

The choices for dataset graph_format are (for different graph data format according to different graph libraries):

Based on different choices of the above argument, when indexing the botnet dataset object, it will return a corresponding graph data object defined by the specified graph library.

The data loader handles automatic batching and is agnostic to the specific graph learning library.

To Evaluate a Model Predictor

We prepare a standardized evaluator for easy evaluation and comparison of different models. First load the dataset class with BotnetDataset and the evaluation function eval_predictor. Then define a simple wrapper of your model as a predictor function (see examples), which takes in a graph from the dataset and returns the prediction probabilities for the positive class (as well as the loss from the forward pass, optionally).

We mainly use the average F1 score to compare across models. For example, to get evaluations on the chord test set:

from botdet.data.dataset_botnet import BotnetDataset
from botdet.eval.evaluation import eval_predictor
from botdet.eval.evaluation import PygModelPredictor

botnet_dataset_test = BotnetDataset(name='chord', split='test', graph_format='pyg')
predictor = PygModelPredictor(model)    # 'model' is some graph learning model
result_dict_avg, loss_avg = eval_predictor(botnet_dataset_test, predictor)

print(f'Testing --- loss: {loss_avg:.5f}')
print(' ' * 10 + ', '.join(['{}: {:.5f}'.format(k, v) for k, v in result_dict_avg.items()]))

test_f1 = result_dict_avg['f1']

To Train a Graph Neural Network for Topological Botnet Detection

We provide a set of graph convolutional neural network (GNN) models here with PyTorch Geometric, along with the corresponding training script (note: the training pipeline was tested with PyTorch 1.2 and torch-scatter 1.3.1). Various basic GNN models can be constructed and tested by specifing configuration arguments:

  • number of layers, hidden size
  • node updating model each layer (e.g. direct message passing, MLP, gated edges, or graph attention)
  • message normalization
  • residual hops
  • final layer type
  • etc. (check the model API and the training script)

As an example, to train a GNN model on the topological botnet datasets, simply run:

bash run_botnet.sh

With the above configuration, we run graph neural network models (with 12 layers, 32 hidden dimension, random walk normalization, and residual connections) on each of the topologies, and results are as below:

Topology Chord de Bruijn Kademlia LEET-Chord C2 P2P
Test F1 (%) 99.061 99.926 98.935 99.231 98.992 98.692
Average 99.140

Note

We also provide labels on the edges under the name edge_y, which can be used for the complete botnet community recovery task, or for interpretation matters.

Citing

@article{zhou2020auto,
  title={Automating Botnet Detection with Graph Neural Networks},
  author={Jiawei Zhou*, Zhiying Xu*, Alexander M. Rush, and Minlan Yu},
  journal={AutoML for Networking and Systems Workshop of MLSys 2020 Conference},
  year={2020}
}

More Repositories

1

annotated-transformer

An annotated implementation of the Transformer paper.
Jupyter Notebook
5,683
star
2

seq2seq-attn

Sequence-to-sequence model with LSTM encoder/decoders and attention
Lua
1,257
star
3

im2markup

Neural model for converting Image-to-Markup (by Yuntian Deng yuntiandeng.com)
Lua
1,203
star
4

pytorch-struct

Fast, general, and tested differentiable structured prediction in PyTorch
Jupyter Notebook
1,107
star
5

sent-conv-torch

Text classification using a convolutional neural network.
Lua
448
star
6

namedtensor

Named Tensor implementation for Torch
Jupyter Notebook
443
star
7

var-attn

Latent Alignment and Variational Attention
Python
326
star
8

sent-summary

300
star
9

neural-template-gen

Python
262
star
10

struct-attn

Code for Structured Attention Networks https://arxiv.org/abs/1702.00887
Lua
237
star
11

NeuralSteganography

STEGASURAS: STEGanography via Arithmetic coding and Strong neURAl modelS
Python
183
star
12

urnng

Python
176
star
13

data2text

Lua
158
star
14

sa-vae

Python
154
star
15

compound-pcfg

Python
127
star
16

cascaded-generation

Cascaded Text Generation with Markov Transformers
Python
127
star
17

TextFlow

Python
116
star
18

boxscore-data

HTML
111
star
19

decomp-attn

Decomposable Attention Model for Sentence Pair Classification (from https://arxiv.org/abs/1606.01933)
Lua
95
star
20

encoder-agnostic-adaptation

Encoder-Agnostic Adaptation for Conditional Language Generation
Python
79
star
21

genbmm

CUDA kernels for generalized matrix-multiplication in PyTorch
Jupyter Notebook
79
star
22

DeepLatentNLP

61
star
23

nmt-android

Neural Machine Translation on Android
Lua
59
star
24

BSO

Lua
54
star
25

hmm-lm

Python
42
star
26

seq2seq-talk

TeX
39
star
27

Talk-Latent

TeX
31
star
28

regulatory-prediction

Code and Data to accompany "Dilated Convolutions for Modeling Long-Distance Genomic Dependencies", presented at the ICML 2017 Workshop on Computational Biology
Python
28
star
29

harvardnlp.github.io

JavaScript
26
star
30

strux

Python
18
star
31

lie-access-memory

Lua
17
star
32

annotated-attention

Jupyter Notebook
15
star
33

DataModules

A state-less module system for torch-like languages
Python
8
star
34

rush-nlp

JavaScript
8
star
35

seq2seq-attn-web

CSS
8
star
36

tutorial-deep-latent

TeX
7
star
37

MemN2N

Torch implementation of End-to-End Memory Networks (https://arxiv.org/abs/1503.08895)
Lua
6
star
38

image-extraction

Extract images from PDFs
Jupyter Notebook
4
star
39

paper-explorer

JavaScript
3
star
40

readcomp

Entity Tracking Improves Cloze-style Reading Comprehension
Python
3
star
41

banded

Sparse banded diagonal matrices for pytorch
Cuda
2
star
42

torax

Python
2
star
43

cs6741

HTML
2
star
44

simple-recs

Python
1
star
45

poser

Python
1
star
46

iclr

1
star
47

cs6741-materials

1
star