• Stars
    star
    103
  • Rank 331,922 (Top 7 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 7 years ago
  • Updated almost 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

My solution to Kaggle Quora Question Pairs competition (Top 2%, Private LB log loss 0.13497).

kaggle-quora-question-pairs

My solution to Kaggle Quora Question Pairs competition (Top 2%, Private LB log loss 0.13497).

Overview

The solution uses a mixture of purely statistical features, classical NLP features, and deep learning. Almost 200 handcrafted features are combined with out-of-fold predictions from 4 neural networks having different architectures.

The final model is a GBM (LightGBM), trained with early stopping and a very small learning rate, using stratified K-fold cross validation.

Overall solution structure

Reproducing the Solution

Hardware Requirements

Almost all code (with the exception of some 3rd-party scripts) can efficiently utilize multi-core machines. At the same time, some of them might be memory-hungry. All code has been tested on a machine with 64 GB RAM. For all non-neural notebooks, a c4.8xlarge AWS instance should do excellent.

For neural networks, a GPU is highly recommended. On a GTX 1080 Ti, it takes about 8-9 hours to complete all 4 "neural" notebooks.

You'll need about 30 GB of free disk space to store the pre-trained word embeddings and the extracted features.

Software Requirements

  1. Python >= 3.6.
  2. LightGBM (compiled from sources).
  3. FastText (compiled from sources).
  4. Python packages from requirements.txt.
  5. (Recommended) NVIDIA CUDA and a GPU version of TensorFlow.

Environment Provisioning

You can spin up a fresh Ubuntu 16.04 AWS instance and use Ansible to make all the necessary software installation and configuration (except the GPU-related stuff).

  1. Make sure to open the ports 22 and 8888 on the target machine.
  2. Navigate to provisioning directory.
  3. Edit config.yml:
    • jupyter_plaintext_password: the password to set for the Jupyter server on the target machine.
    • kaggle_username, kaggle_password: your Kaggle credentials (required to download the competition datasets). Otherwise, download them to the data folder manually.
  4. Edit inventory.ini and specify your instance DNS and the private key file (*.pem) to access it.
  5. Run:
    $ ansible-galaxy install -r requirements.yml
    $ ansible-playbook playbook.yml -i inventory.ini
    

Running the Code

Automatic

Run run-all.sh from the repository root. Check notebooks/output for execution progress and data/submissions for the final results.

Manual

Start a Jupyter server in the notebooks directory. If you used the Ansible playbook, the server will already be running on port 8888.

Run the notebooks in the following order:

  1. Preprocessing.

    1) preproc-tokenize-spellcheck.ipynb
    2) preproc-extract-unique-questions.ipynb
    3) preproc-embeddings-fasttext.ipynb
    4) preproc-nn-sequences-fasttext.ipynb
    
  2. Feature extraction.

    Run all feature-*.ipynb notebooks in arbitrary order.

    Note: for faster execution, run all feature-oofp-nn-*.ipynb notebooks on a machine with a GPU and NVIDIA CUDA.

  3. Prediction.

    Run classify-lightgbm-cv-pred.ipynb. The output file will be saved as DATETIME-submission-draft-CVSCORE.csv

More Repositories

1

snake-ai-reinforcement

AI for Snake game trained from pixels using Deep Reinforcement Learning (DQN).
Python
158
star
2

midichlorian

A Visual Studio extension that allows you to write code and automate the IDE using MIDI musical instruments.
C#
69
star
3

dechorder

Automatic chord recognition application powered by machine learning
Python
63
star
4

syno-plex-update

Automatically check for Plex Media Server updates on Synology NAS and install them. Compatible with DSM 6 and DSM 7, including DSM 7.2.2+.
Shell
47
star
5

regex-builder

.NET library for human-readable declaration of regular expressions without having to remember the regex syntax. Looks similar to Expression Trees in .NET.
C#
40
star
6

odsc-target-leakage-workshop

Workshop on Target Leakage in Machine Learning I taught at ODSC Europe 2018 (London) and ODSC East 2019, 2020 (Boston)
Jupyter Notebook
36
star
7

persistent-touch-id-sudo

Configures PAM on macOS via a Launch Daemon so that Touch ID for sudo is always available and persists across OS upgrades
C
28
star
8

unicode-virtual-keyboard

Windows utility that simplifies the input of Unicode characters by displaying a handy on-demand virtual keyboard with powerful character search functionality and global hotkey support.
C#
26
star
9

thrones2vec

Using Word2Vec to explore semantic similarities between the entities of "A Song of Ice and Fire" ("Game of Thrones").
Jupyter Notebook
26
star
10

azure-cloud-ocr

A simple cloud OCR application that employs Windows Azure Web and Worker Roles, Blobs, Tables, Queues, and uses Google Tesseract for text recognition.
JavaScript
17
star
11

cartpole-q-learning

A cart pole balancing agent powered by Q-Learning.
Python
13
star
12

lits-algorithms-course

Notes and handouts from the Algorithms course I taught at Lviv IT School.
TeX
10
star
13

pygoose

A Python package used as a utility tool belt for Kaggle competitions and other Data Science experiments.
Python
6
star
14

dou-topic-modeling

Analyzing the topic structure of DOU.ua comments using Latent Dirichlet Allocation (LDA).
Jupyter Notebook
5
star
15

ansible-role-jupyter

An Ansible role to install and configure Jupyter for Python 3.
5
star
16

pythonscript-namebatch

Generates spells to summon Benedict Cumberbatch.
Python
4
star
17

dunedynasty-macos

A fork of Dune Dynasty (http://dunedynasty.sourceforge.net/) that can be built and run on modern Macs, including Apple Silicon (M1)
C
4
star
18

datarobot-mlbench

Evaluation of the DataRobot platform on the mlbench benchmark [H. Zhang et al., 2017]
Python
4
star
19

ucu-nlp-workshop

Supplementary resources for the NLP Summer Workshop I taught at UCU.
Jupyter Notebook
3
star
20

enex2csv

Convert Evernote ENEX files to CSV, optionally converting note content to Markdown
Python
3
star
21

hammurabi

An online judge for algorithmic contests. Strict, but fair.
Python
3
star
22

intel-8080-asm

A very simple Win32 assembler for Intel 8080 that produces COM binaries for CP/M. I built this during my 2nd university year as a replacement for the tool we had at our lab, which often failed to compile large programs and produced misleading error messages.
C++
3
star
23

gdg-speech-classifier

A machine learning system that recognizes the word 'Google' in human speech (demo for my talk @ Lviv GDG meetup).
MATLAB
2
star
24

ucu-ai-checkers

Checkers game AI development tools for the CS301 AI class I teach at UCU.
Python
2
star
25

winforms-auto-taborder-vsaddin

Visual Studio add-in that adds automatic TabOrder arrangement feature to Windows Forms designer
C#
2
star
26

r-exercises

Programming exercises for R: http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic-research/reed/rexercises.pdf
R
1
star
27

libcheckers

International checkers gameplay library for the CS301 AI course I teach at UCU.
Python
1
star
28

ansible-role-anaconda

An Ansible role to install Anaconda on Linux, along with additional conda packages of your choice.
1
star
29

filesystem-monitor-service

Client-server application (WinForms client + NT Service + MS Access DB) for monitoring changes to a remote file system [university project, 2009]
C#
1
star
30

streamlit-blackout-stats

Streamlit app for visualizing power outage statistics. Uses Google Sheets as the data source.
Python
1
star