• Stars
    star
    1,875
  • Rank 23,814 (Top 0.5 %)
  • Language
    Python
  • Created over 6 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Deep Learning model to analyze a large corpus of clear text passwords.

1.4 Billion Text Credentials Analysis (NLP)

Using deep learning and NLP to analyze a large corpus of clear text passwords.

Objectives:

  • Train a generative model.
  • Understand how people change their passwords over time: hello123 -> h@llo123 -> h@llo!23.

Disclaimer: for research purposes only.

In the press

Get the data

  • Download any Torrent client.
  • Here is a magnet link you can find on Reddit:
    • magnet:?xt=urn:btih:7ffbcd8cee06aba2ce6561688cf68ce2addca0a3&dn=BreachCompilation&tr=udp%3A%2F%2Ftracker.openbittorrent.com%3A80&tr=udp%3A%2F%2Ftracker.leechers-paradise.org%3A6969&tr=udp%3A%2F%2Ftracker.coppersurfer.tk%3A6969&tr=udp%3A%2F%2Fglotorrents.pw%3A6969&tr=udp%3A%2F%2Ftracker.opentrackr.org%3A1337
  • Checksum list is available here: checklist.chk
  • ./count_total.sh in BreachCompilation should display something like 1,400,553,870 rows.

Get started (processing + deep learning)

Process the data and run the first deep learning model:

# make sure to install the python deps first. Virtual env are recommended here.
# virtualenv -p python3 venv3; source venv3/bin/activate; pip install -r requirements.txt
# Remove "--max_num_files 100" to process the whole dataset (few hours and 50GB of free disk space are required.)
./process_and_train.sh <BreachCompilation path>

Data (explanation)

INPUT:   BreachCompilation/
         BreachCompilation is organized as:

         - a/          - folder of emails starting with a
         - a/a         - file of emails starting with aa
         - a/b
         - a/d
         - ...
         - z/
         - ...
         - z/y
         - z/z

OUTPUT: - BreachCompilationAnalysis/edit-distance/1.csv
        - BreachCompilationAnalysis/edit-distance/2.csv
        - BreachCompilationAnalysis/edit-distance/3.csv
        [...]
        > cat 1.csv
            1 ||| samsung94 ||| samsung94@
            1 ||| 040384alexej ||| 040384alexey
            1 ||| HoiHalloDoeii14 ||| hoiHalloDoeii14
            1 ||| hoiHalloDoeii14 ||| hoiHalloDoeii13
            1 ||| hoiHalloDoeii13 ||| HoiHalloDoeii13
            1 ||| 8znachnuu ||| 7znachnuu
        EXPLANATION: edit-distance/ contains the passwords pairs sorted by edit distances.
        1.csv contains all pairs with edit distance = 1 (exactly one addition, substitution or deletion).
        2.csv => edit distance = 2, and so on.

        - BreachCompilationAnalysis/reduce-passwords-on-similar-emails/99_per_user.json
        - BreachCompilationAnalysis/reduce-passwords-on-similar-emails/9j_per_user.json
        - BreachCompilationAnalysis/reduce-passwords-on-similar-emails/9a_per_user.json
        [...]
        > cat 96_per_user.json
        {
            "1.0": [
            {
                "edit_distance": [
                    0,
                    1
                ],
                "email": "[email protected]",
                "password": [
                    "090698d",
                    "090698D"
                ]
            },
        {
                "edit_distance": [
                    0,
                    1
                ],
                "email": "[email protected]",
                "password": [
                    "5555555555q",
                    "5555555555Q"
                ]
         }
        EXPLANATION: reduce-passwords-on-similar-emails/ contains files sorted by the first 2 letters of
        the email address. For example [email protected] will be located in 96_per_user.json
        Each file lists all the passwords grouped by user and by edit distance.
        For example, [email protected] had 2 passwords: 090698d and 090698D. The edit distance between them is 1.
        The edit_distance and the password arrays are of the same length, hence, a first 0 in the edit distance array.
        Those files are useful to model how users change passwords over time.
        We can't recover which one was the first password, but a shortest hamiltonian path algorithm is run
        to detect the most probably password ordering for a user. For example:
        hello => hello1 => hell@1 => hell@11 is the shortest path.
        We assume that users are lazy by nature and that they prefer to change their password by the lowest number
        of characters.

Run the data processing alone:

python3 run_data_processing.py --breach_compilation_folder <BreachCompilation path> --output_folder ~/BreachCompilationAnalysis

If the dataset is too big for you, you can set max_num_files to something between 0 and 2000.

  • Make sure you have enough free memory (8GB should be enough).
  • It took 1h30m to run on a Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz (on a single thread).
  • Uncompressed output is around 45G.

More Repositories

1

keras-attention

Keras Attention Layer (Luong and Bahdanau scores).
Python
2,795
star
2

keras-tcn

Keras Temporal Convolutional Network.
Python
1,798
star
3

yolo-9000

YOLO9000: Better, Faster, Stronger - Real-Time Object Detection. 9000 classes!
1,148
star
4

keract

Layers Outputs and Gradients in Keras. Made easy.
Python
1,032
star
5

deep-speaker

Deep Speaker: an End-to-End Neural Speaker Embedding System.
Python
864
star
6

n-beats

Keras/Pytorch implementation of N-BEATS: Neural basis expansion analysis for interpretable time series forecasting.
Python
810
star
7

name-dataset

The Python library for names.
Python
742
star
8

stanford-openie-python

Stanford Open Information Extraction made simple!
Python
605
star
9

deep-learning-bitcoin

Exploiting Bitcoin prices patterns with Deep Learning.
Python
516
star
10

FX-1-Minute-Data

HISTDATA - Dataset composed of all FX trading pairs / Crude Oil / Stock Indexes. Simple API to retrieve 1 Minute data Historical FX Prices (up to date).
Python
438
star
11

Deep-Learning-Tinder

Simple Tinder algorithm able to swipe left and right based on the recommendations of a pre-trained deep neural network (Machine Learning).
Python
274
star
12

timit

The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus.
270
star
13

cond_rnn

Conditional RNNs for Tensorflow / Keras.
Python
215
star
14

financial-news-dataset

Reuters and Bloomberg
211
star
15

my-first-bitcoin-miner

For the curious minds who want to understand how Bitcoin Blockchain works!
Python
185
star
16

expressvpn-python

ExpressVPN - Python Wrapper (IP auto switch).
Python
170
star
17

tensorflow-multi-dimensional-lstm

Multi dimensional LSTM as described in Alex Graves' Paper https://arxiv.org/pdf/0705.2011.pdf
Jupyter Notebook
155
star
18

tensorflow-class-activation-mapping

Learning Deep Features for Discriminative Localization (2016)
Python
151
star
19

easy-encryption

A very simple C++ module to encrypt/decrypt strings based on B64 and Vigenere ciper.
C++
138
star
20

Order-Book-Matching-Engine

Order Book Matching Engine for Stock Exchanges (1us latency for matching)
Java
135
star
21

tensorflow-phased-lstm

Phased LSTM: Accelerating Recurrent Network Training for Long or Event-based Sequences (NIPS 2016) - Tensorflow 1.0
Python
131
star
22

tensorflow-ctc-speech-recognition

Application of Connectionist Temporal Classification (CTC) for Speech Recognition (Tensorflow 1.0 but compatible with 2.0).
Python
131
star
23

fractional-differentiation-time-series

As described in Advances of Machine Learning by Marcos Prado.
Python
121
star
24

amazon-reviews-scraper

Yet another multi language scraper for Amazon targeting reviews.
Python
109
star
25

lead-lag

Estimation of the lead-lag parameter from non-synchronous data.
Jupyter Notebook
98
star
26

google-news-scraper

Google News Scraper for languages like Japanese, Chinese... [VPN Support]
Python
94
star
27

stock-volatility-google-trends

Deep Learning Stock Volatility with Google Domestic Trends: https://arxiv.org/pdf/1512.04916.pdf
Python
89
star
28

japanese-words-to-vectors

Word2vec (word to vectors) approach for Japanese language using Gensim and Mecab.
Python
83
star
29

mercari-python-api

The Python Mercari API.
Python
78
star
30

Stanford-NER-Python

Stanford Named Entity Recognizer (NER) - Python Wrapper
Python
74
star
31

very-deep-convnets-raw-waveforms

Tensorflow - Very Deep Convolutional Neural Networks For Raw Waveforms - https://arxiv.org/pdf/1610.00087.pdf
Python
74
star
32

speaker-change-detection

Paper: https://arxiv.org/abs/1702.02285
Python
62
star
33

tensorflow-maxout

Maxout Networks TensorFlow implementation presented in https://arxiv.org/abs/1302.4389
Python
57
star
34

tensorflow-cnn-time-series

Feeding images of time series to Conv Nets! (Tensorflow + Keras)
Python
50
star
35

keras-seq2seq-example

Toy Keras implementation of a seq2seq model with examples.
Python
49
star
36

tensorflow-fifo-queue-example

Example on how to use a Tensorflow Queue to feed data to your models.
Python
39
star
37

3.7-billion-passwords-tools

Tools to manipulate the data behind Collection #1 (and #2–5) - AntiPublic.
Python
38
star
38

python-darknet-yolo-v4

Python to interface with Darknet Yolo V4 (multi GPU with load balancer supported).
Python
37
star
39

bitmex-liquidations

Minimal code to show how to receive the liquidations in realtime on Bitmex.
Python
33
star
40

Statistical-Arbitrage

Using Particle Markov Chain Monte Carlo
MATLAB
33
star
41

tensorflow-grid-lstm

Implementation of the paper https://arxiv.org/pdf/1507.01526v3.pdf (Tensorflow 1.0, Python 3)
Python
29
star
42

advanced-deep-learning-keras

File repository for the course [Advanced Deep Learning with Keras]. Packt Publishing.
Jupyter Notebook
28
star
43

vision-api

Google Vision API made easy!
Python
26
star
44

Facebook-Profile-Pictures-Downloader

😆 Download public profile pictures from Facebook.
Python
25
star
45

bitcoin-market-data

Largest tick market data for Bitcoin (mirror server of bitcoincharts.com).
Shell
24
star
46

NiceHash-api-monitoring-client

Simple NiceHash client to monitor your mining rigs. Configure alerts and emails!
Python
22
star
47

information-extraction-with-dominating-rules

Information extraction based on Stanford open IE Library and domination decision rules. http://philipperemy.github.io/information-extract/
Python
22
star
48

beer-dataset

The biggest beer database is in this repo!
Python
21
star
49

Market-Data

Module to retrieve realtime stock quotes of Paris stock exchange
Java
20
star
50

instant-music-playlist-downloader

Download MP3 songs from the web.
Python
20
star
51

Sentiment-Analysis-NLP

Sentiment Analysis applied to different datasets such as IMDB
Python
19
star
52

wavenet

A general TensorFlow implementation of the Wavenet network to be used to model long term sequences with less trainable parameters.
Python
18
star
53

keras-snail-attention

SNAIL Attention Block for Keras.
Python
17
star
54

which-of-your-friends-are-on-tinder

Discover which of your Facebook friends are on Tinder!
Python
16
star
55

LSTM-text-generation

Generating NEW Reuters articles from Reuters articles.
Python
16
star
56

keras-frn

Keras Filter Response Normalization Layer.
Python
15
star
57

keras-sde-net

Keras implementation of SDE-Net (ICML 2020).
Python
14
star
58

Candlestick-Chart-Generator

Candlestick Charts in JavaScript.
JavaScript
14
star
59

Peer-Group-Analysis-Clustering

Unsupervised Clustering of Time Series using Peer Group Analysis PGA
MATLAB
14
star
60

python-pubsub

A simple python implementation of a message router with many subscribers and many publishers.
Python
13
star
61

selenium-python-examples

Selenium examples in Python (web scraper).
Python
11
star
62

OrderBook-TWAP

Programming Test
C++
11
star
63

philipperemy.github.io

My blog.
SCSS
11
star
64

fxrt

Realtime FX prices from the Oanda broker.
Python
10
star
65

tensorflow-isan-rnn

Input Switched Affine Networks: An RNN Architecture Designed for Interpretability. http://proceedings.mlr.press/v70/foerster17a/foerster17a.pdf
Python
10
star
66

twitter-arxiv-sanity

Your daily "top hype" papers.
Python
9
star
67

japan-weather-forecast

Japanese Meteorological Agency (scraper + data)
Python
9
star
68

github-backup

Back up all your Github repositories in a directory.
Python
9
star
69

Leboncoin

Management of small ads (editing, publishing, deleting, re-publishing)
Java
9
star
70

Github-full-data-set

Generating GitHub data (~1M repositories May 2017).
Python
8
star
71

cocktails

Generate the best cocktail ever with Machine Learning !
Python
8
star
72

Technical-Analysis

Technical Analysis Tool based on TA Lib
C
8
star
73

Ransac-Java

Implementation of the Ransac algorithm written in Java.
Java
8
star
74

Data-Mining-Automaton

Quantitative Algobox based on Data Mining techniques
Java
8
star
75

GPU-Activity-Monitoring

Python monitoring tool for the nvidia-smi command on Linux.
Python
7
star
76

digital-setting-circles

Compatible with Raspberry Pi. Setting circles are used on telescopes equipped with an equatorial mount to find astronomical objects in the sky by their equatorial coordinates.
C++
7
star
77

japanese-street-addresses-scraper

Scraper for Japanese street addresses (住所).
Python
7
star
78

bitstamp-realtime-order-book

Gives you low latency access to Bitstamp Realtime Order Book.
Python
7
star
79

urban-dictionary-transformers

Transformers applied to Urban Dictionary for fun.
Python
7
star
80

arma-scipy-fit

Estimating coefficients of ARMA models with the Scipy package.
Python
7
star
81

binance-futures

Straightforward API endpoint to receive market data for Binance Futures.
Python
6
star
82

HFT-FIX-Parser

Ultra low latency FIX Parser
6
star
83

bitflyer

Bitflyer API Realtime Feed Python.
Python
6
star
84

Ogame-API

Ogame API
Java
6
star
85

keras-mode-normalization

Keras Implementation of Mode Normalization (Lucas Deecke, Iain Murray, Hakan Bilen, 2018)
Python
6
star
86

Kaggle-PKDD-Taxi-I

https://www.kaggle.com/c/pkdd-15-predict-taxi-service-trajectory-i
Python
6
star
87

ssh-failed-attempts

Tool to detect and analyze failed SSH attempts.
Python
5
star
88

Idealwine-wine-prices

API to retrieve quotes from daily wine auctions
Java
5
star
89

notifier

Receive notifications on your phone when your CLI tasks finish.
Python
5
star
90

API-Ratp

API to retrieve real time schedule times for Paris transports
5
star
91

tf-easy-model-saving

An easy way to load and save checkpoints in Tensorflow!
Python
5
star
92

Visual-Ballistic-Roulette-python

Visual Ballistic Roulette written in Python.
Python
5
star
93

Quantitative-Market-Data-Generator

Equity Prices Generator using Quantitative methods such as Brownian Motion
4
star
94

Monte-Carlo-Pi-Computation

This projects aims at computing PI using Monte Carlo method
4
star
95

japanese-sentences-to-vectors

Sentences2vec (sentences to vectors or s2v) algorithm using different papers such as skip-thoughts vectors.
Python
4
star
96

Visual-Ballistic-Roulette-Timer-Android

Timer for Roulette written for Android
Java
4
star
97

EGAIN-pytorch

Python
4
star
98

record-your-internet-speed

Record your internet speed at fixed intervals.
Python
4
star
99

Visual-Ballistic-Roulette-Display-Android

Android App
Java
4
star
100

Martingale-Roulette-MonteCarlo

Monte Carlo simulations for Casino Roulette
MATLAB
4
star