• Stars
    star
    150
  • Rank 245,912 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 7 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Machine learning algorithms applied on log analysis to detect intrusions and suspicious activities.

πŸ¦… Webhawk 2.0

πŸ”΄ IMPORTANT The unsupervised Webhawk is now available as independent projet. Check it out at https://github.com/slrbl/unsupervised-learning-attack-detection-webhawk-catch

Machine Learning based web attacks detection.

About

Webhawk is an open source machine learning powered Web attack detection tool. It uses your web logs as training data. Webhawk offers a REST API that makes it easy to integrate within your SoC ecosystem. To train a detection model and use it as an extra security level in your organization, follow the following steps.

Setup

Using a Python virtual env

python -m venv webhawk_venv
source webhawk_venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt

Create a settings.conf file

Copy settings_template.conf file to settings.conf and fill it with the required parameters as the following.

[MODEL]
model:MODELS/the_model_you_will_train.pkl
[FEATURES]
features:length,params_number,return_code,size,upper_cases,lower_cases,special_chars,url_depth

Unsupervised detection Usage

Run the unsupervised detection script

Encoding is automatic for the unsupervised mode. You just need to run the catch.py script. Get inspired from this example:

python catch.py -l ./SAMPLE_DATA/raw-http-logs-samples/aug_sep_oct_2021.log -t apache -j 10000 -v -e 5000 -s 5

Supervised detection Usage

Encode your http logs and save supervised detection results into a csv file

python encode.py -a -l ./SAMPLE_DATA/raw-http-logs-samples/aug_sep_oct_2021.log -d ./SAMPLE_DATA/labeled-encoded-data-samples/aug_sep_oct_2021.csv

Please note that two already encoded data files are available in ./SAMPLE_DATA/labeled-encoded-data-samples/, in case you would like to move directly to the next step.

Train a model and test the prediction

Use the http log data from May to July 2021 to train a model, and test it with the data from August to October 2021.

python train.py -a 'dt' -t ./SAMPLE_DATA/labeled-encoded-data-samples/may_jun_jul_2021.csv -v ./SAMPLE_DATA/labeled-encoded-data-samples/aug_sep_oct_2021.csv

Make a prediction for a single log line

python predict.py -m 'MODELS/the_model_you_will_train.pkl' -l '198.72.227.213 - - [16/Dec/2018:00:39:22 -0800] "GET /self.logs/access.log.2016-07-20.gz HTTP/1.1" 404 340 "-" "python-requests/2.18.4"'

REST API

Launch the API server

In order to use the API to need first to launch it's server as the following

python -m uvicorn api:app --reload --host 0.0.0.0 --port 8000

Make a prediction request

You can use the following code which based on Python 'requests' (the same in test_api.py) to make a prediction using the REST API

import requests
import json
headers = {
    'accept': 'application/json',
    'Content-Type': 'application/json',
}
data = {
    'log_type':'apache',
    'http_log_line': '187.167.57.27 - - [15/Dec/2018:03:48:45 -0800] "GET /honeypot/Honeypot%20-%20Howto.pdf HTTP/1.1" 200 1279418 "http://www.secrepo.com/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/61.0.3163.128 Safari/534.24 XiaoMi/MiuiBrowser/9.6.0-Beta"'
}
response = requests.post('http://127.0.0.1:8000/predict', headers=headers, data=json.dumps(data))
print(response.text)

It will return the following:

{"prediction":"0","confidence":"0.9975490196078431","log_line":"187.167.57.27 - - [15/Dec/2018:03:48:45 -0800] \"GET /honeypot/Honeypot%20-%20Howto.pdf HTTP/1.1\" 200 1279418 \"http://www.secrepo.com/\" \"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/61.0.3163.128 Safari/534.24 XiaoMi/MiuiBrowser/9.6.0-Beta\""}

Using Docker

Launch the API server (with Docker)

To launch the prediction server using docker

docker compose build
docker compose up

Used sample data

The data you will find in SAMPLE_DATA folder comes from
https://www.secrepo.com.

Interesting data samples

https://www.kaggle.com/datasets/eliasdabbas/web-server-access-logs https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/3QBYB5

Documentation

Details on how this tool is built could be found at http://enigmater.blogspot.fr/2017/03/intrusion-detection-based-on-supervised.html

TODO

Nothing for now.

Reference

Silhouette Effeciency
https://bioinformatics-training.github.io/intro-machine-learning-2017/clustering.html


Optimal Value of Epsilon
https://towardsdatascience.com/machine-learning-clustering-dbscan-determine-the-optimal-value-for-epsilon-eps-python-example-3100091cfbc


Max curvature point
https://towardsdatascience.com/detecting-knee-elbow-points-in-a-graph-d13fc517a63c

Contribution

All feedbacks, testing and contribution are very welcome! If you would like to contribute, fork the project, add your contribution and make a pull request.

More Repositories

1

human-in-the-loop-machine-learning-tool-tornado

Tornado is an open source Human-in-the-loop machine learning tool. It helps you label your dataset on the fly while training your model through a simple web user interface. It supports all data types: structured, text and image.
Ruby
61
star
2

malicious-urls-detection-with-autoencoder-neural-networks

Detecting malicious URLs using an autoencoder neural network
Python
40
star
3

unsupervised-learning-attack-detection-webhawk-catch

Webhawk/Catch helps automatically finding web attack traces in logs
Python
13
star
4

malware-detection-with-deep-learning-autoencoder

Python
12
star
5

reinforcement-learning-game

A random environment reinforcement learning-powered Mario game
JavaScript
8
star
6

perceptron-text-classification-from-scracth

A perceptron based text classification based on word bag feature extraction and applied on sentiment analysis dataset
Python
5
star
7

Unix-Memos

unix commands notes (sed, grep, ln..)
4
star
8

future-prediction-using-tweets-and-ai

Prediction big events using data from time series data from hedonometer
Python
3
star
9

deep-neural-networks-fine-tuning-cheat-sheet

Deep Neural Networks Fine Tuning Guide
3
star
10

DevOps-Swiss-Knife-Tools

A set of a daily use tools that could be used by DevOps and SysAdmin
Shell
2
star
11

Is-It-Built-With-Ruby-On-Rails

A tool to identify web applications built with Ruby On Rails
Python
1
star
12

Rails-Signup-From-Scratch

Ruby
1
star
13

JobTeaser

JobTeaser search web app
Ruby
1
star
14

github-console-dashboard-gidash

Have a global view of your Github repositories with simple console command
Python
1
star
15

algorithms-and-data-structures

Algorithms and data structures implementation
Java
1
star
16

log4j-vulnerability-check

Find log4j JARs and check them for vulnerability using MD5 hash
Python
1
star
17

SANS-KringleCon-Holiday-Hack-Challenge-2019

Scripts/C program used to solve SANS KringleCon Holiday Hack Challenge
Python
1
star
18

monte-carlo-simulation-lazycarlo

Monte Carlo simulation made easy
Ruby
1
star
19

data-centric-ai-competition

Tools used to work on Andrew Ng's data centric ai competition
Python
1
star
20

Sinatra-Ruby-Social-Network

A ruby based social network.
HTML
1
star