NSFW Data Scraper
Note: use with caution - the dataset is noisy
Description
This is a set of scripts that allows for an automatic collection of tens of thousands of images for the following (loosely defined) categories to be later used for training an image classifier:
porn
- pornography imageshentai
- hentai images, but also includes pornographic drawingssexy
- sexually explicit images, but not pornography. Think nude photos, playboy, bikini, etc.neutral
- safe for work neutral images of everyday things and peopledrawings
- safe for work drawings (including anime)
Here is what each script (located under scripts
directory) does:
1_get_urls_.sh
- iterates through text files underscripts/source_urls
downloading URLs of images for each of the 5 categories above. Theripme
application performs all the heavy lifting. The source URLs are mostly links to various subreddits, but could be any website that Ripme supports. Note: I already ran this script for you, and its outputs are located inraw_data
directory. No need to rerun unless you edit files underscripts/source_urls
.2_download_from_urls_.sh
- downloads actual images for urls found in text files inraw_data
directory.3_optional_download_drawings_.sh
- (optional) script that downloads SFW anime images from the Danbooru2018 database.4_optional_download_neutral_.sh
- (optional) script that downloads SFW neutral images from the Caltech256 dataset5_create_train_.sh
- createsdata/train
directory and copy all*.jpg
and*.jpeg
files into it fromraw_data
. Also removes corrupted images.6_create_test_.sh
- createsdata/test
directory and movesN=2000
random files for each class fromdata/train
todata/test
(change this number inside the script if you need a different train/test split). Alternatively, you can run it multiple times, each time it will moveN
images for each class fromdata/train
todata/test
.
Prerequisites
- Docker
How to collect data
$ docker build . -t docker_nsfw_data_scraper
Sending build context to Docker daemon 426.3MB
Step 1/3 : FROM ubuntu:18.04
---> 775349758637
Step 2/3 : RUN apt update && apt upgrade -y && apt install wget rsync imagemagick default-jre -y
---> Using cache
---> b2129908e7e2
Step 3/3 : ENTRYPOINT ["/bin/bash"]
---> Using cache
---> d32c5ae5235b
Successfully built d32c5ae5235b
Successfully tagged docker_nsfw_data_scraper:latest
$ # Next command might run for several hours. It is recommended to leave it overnight
$ docker run -v $(pwd):/root/nsfw_data_scraper docker_nsfw_data_scraper scripts/runall.sh
Getting images for class: neutral
...
...
$ ls data
test train
$ ls data/train/
drawings hentai neutral porn sexy
$ ls data/test/
drawings hentai neutral porn sexy
How to train a CNN model
- Install fastai:
conda install -c pytorch -c fastai fastai
- Run
train_model.ipynb
top to bottom
Results
I was able to train a CNN classifier to 91% accuracy with the following confusion matrix:
As expected, drawings
and hentai
are confused with each other more frequently than with other classes.
Same with porn
and sexy
categories.