• Stars
    star
    193
  • Rank 201,081 (Top 4 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 6 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

top 1% solution to toxic comment classification challenge on Kaggle.

DeepToxic

This is part of 27th solution for the toxic comment classification challenge. For easy understanding, I only uploaded what I used in the final stage, and did not attach any experimental or deprecated codes.

Dataset and External pretrained embeddings

You can fetch the dataset here. I used 3 kind of word embeddings:

Overview

Preprocessing

We trained our models on 3 datasets with different preprocessing:

  • original dataset with spellings correction: by comparing the Levenshtein distance and a lot of regular expressions.
  • original dataset with pos taggings: We generate the part of speech (POS) tagging for every comment by TextBlob and concatenate the word embedding and POS embedding as a single one. Since TextBlob drops some tokens and punctuations when generating the POS sequences, that gives our models another view.
  • Riad's dataset: with very heavily data-cleaning, spelling correction and translation

Models

In our case, the simpler, the better. I tried some complicated structures (RHN, DPCNN, HAN). Most of them had performed very well locally but got lower AUC on the leaderboard. The models I kept trying during the final stage are the following two:

Pooled RNN (public: 0.9862, private: 0.9858) pooledRNN

Kmax text CNN (public: 0.9856 , private: 0.9849) kmaxCNN

As many competitors pointed out, dropout and batch-normalization are the keys to prevent overfitting. By applying the dropout on the word embedding directly and behind the pooling does great regularization both on train set and test set. Although model with many dropouts takes about 5 more epochs to coverage, it boosts our scores significantly. For instance, my RNN boosts from 0.9853 (private: 0.9850) to 0.9862 (private: 0.9858) after adding dropout layers.

For maximizing the utility of these datasets, besides training on the original labels, we also add a meta-label "bad_comment". If a comment is labeled, then it's considered to be a bad comment. The hypothesizes between these two labels sets are slightly different but with almost the same LB score, which leaves us room for the ensemble.

In order to increase the diversity and to deal with some toxic typos, we trained the models both on char-level and word-level. The results of char-level perform a bit worse (for charRNN: 0.983 on LB, 0.982 on PB, charCNN: 0.9808 on LB, 0.9801 on PB) but it does have a pretty low correlation with word-level models. By simply bagging my char-level and word-level result, it is good enough to push me over 0.9869 in the private test set. By the way, the hyperparameters influence the performance hugely in the char-based models. A large batch size (256), very long sequence length (1000) would ordinarily get a considerable result even though it takes much time for the K-fold validation. (my char-based models usually converge after 60~70 epochs which is about 5 times more than my word-based models.)

Performance of Single models

Scored by AUC on the private testset.

Word level

Model Fasttext Glove Twitter
AVRNN 0.9858 0.9855 0.9843
Meta-AVRNN 0.9850 0.9849 No data
Pos-AVRNN 0.9850 No data 0.9841
AVCNN 0.9846 0.9845 0.9841
Meta-AVCNN 0.9844 0.9844 No data
Pos-AVCNN 0.9850 No data No data
KmaxTextCNN 0.9849 0.9845 0.9835
TextCNN 0.9837 No data No data
RCNN 0.9847 0.9842 0.9832
RHN 0.9842 No data No data

Char level

Model AUC
AVRNN 0.9821
KmaxCNN 0.9801
AVCNN 0.9797

More Repositories

1

Chatbot

εŸΊζ–Όε‘ι‡εŒΉι…ηš„ζƒ…ε’ƒεΌθŠε€©ζ©Ÿε™¨δΊΊ
Python
898
star
2

word2vec-tutorial

δΈ­ζ–‡θ©žε‘ι‡θ¨“η·΄ζ•™ε­Έ
Python
517
star
3

Gossiping-Chinese-Corpus

PTT ε…«ε¦η‰ˆε•η­”δΈ­ζ–‡θͺžζ–™
Jupyter Notebook
236
star
4

PTT-Chat-Generator

ζ‰ΉθΈ’θΈ’ζŽ¨ζ–‡η”’η”Ÿε™¨
Python
221
star
5

CIKM-AnalytiCup-2018

[ACM-CIKM] 2nd place solution at CIKM AnalytiCup 2018, a task for determining short text similarities.
Python
76
star
6

Sequence-to-Sequence-101

a series of tutorials on sequence to sequence learning, implemented with PyTorch.
Python
71
star
7

WSDM-Cup-2019

[ACM-WSDM] 3rd place solution at WSDM Cup 2019, Fake News Classification on Kaggle.
Jupyter Notebook
64
star
8

PTT-Crawler

A web crawler specifically for PTT website.
Python
19
star
9

Line-Chatbot

Rule-based Line chatbot demo, constructed with django.
Python
18
star
10

Fill-the-GAP

[ACL-WS] 4th place solution to gendered pronoun resolution challenge on Kaggle
Jupyter Notebook
12
star
11

Luis-LineBot

a chatbot published on line, using LUIS for intent classification.
Python
7
star
12

NCKU-Online-Judge

Demonstration of an Online Judge System.
JavaScript
4
star
13

Kyara

Enhancing Chinese Understanding for LLM with Knowledge-Augmented Fine-Tuning.
4
star
14

TensorFlow-Study-Notes

HTML
3
star
15

HS-Chess

🎲 a 2D chess game with the HearthStone theme.
Java
2
star
16

zake7749.github.io

Personal blog.
HTML
2
star
17

Fantasy-Invision

πŸš€ a simple vertically scrolling shoot 'em up game.
C#
2
star
18

SVM-with-Shiny

an example for support vector machine and shiny usage
R
2
star
19

MNIST

a Keras CNN autoencoder to solve the Kaggle Competition MNIST.
Python
2
star
20

AboutMe

My brief history.
CSS
1
star