A plug-in ImageNet DataLoader for PyTorch. Uses Tensorpack DataFlow's sequential loading to load fast even if you're using a HDD.
Install
Requirements:
-
sudo apt-get install build-essential libcap-dev pip install python-prctl pip install --upgrade git+https://github.com/tensorpack/dataflow.git@4ac75d6b0
-
LMDB:
pip install lmdb
-
TQDM:
pip install tqdm
-
OpenCV:
pip install opencv-python
(if you install opencv with conda it's likely to be slow due to being compiled with openmp, if you're not sure your opencv is fast enough check using the script here) -
Protobuf:
conda install protobuf
If you use pip's editable install, you can tune the speed of the DataLoader on your system by modifying this code.
git clone https://github.com/BayesWatch/sequential-imagenet-dataloader.git
cd sequential-imagenet-dataloader
pip install -e .
Or install directly:
pip install git+https://github.com/BayesWatch/sequential-imagenet-dataloader.git
Preprocessing
Before being able to train anything, you have to run the preprocessing
script preprocess_sequential.py
to create the LMDB binary files. They
will get put in the directory specified and they will take up 140G for
train, plus more for val. Use the script with these arguments:
python preprocess_sequential.py <imagenet directory> <directory to save lmdb files>
Usage
Wherever the DataLoader
is defined in your Pytorch code, replaced that
with imagenet_seq.data.Loader
; although you can't call it with exactly
the same arguments. For an example, this would be the substitution in the
PyTorch ImageNet example:
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
imagenet_transforms = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
])
# train_dataset = datasets.ImageFolder(traindir, imagenet_transforms)
# train_loader = torch.utils.data.DataLoader(
# train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
# num_workers=args.workers, pin_memory=True, sampler=train_sampler)
train_loader = ImagenetLoader(args.data, 'train', imagenet_transforms,
batch_size=args.batch_size, num_workers=args.workers, shuffle=True)
For a complete example, see the example ImageNet training script provided.
Experiments
Running the example ImageNet script on a workstation with a single Titan X using 4 workers.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 TITAN Xp Off | 00000000:17:00.0 Off | N/A |
| 52% 80C P2 85W / 250W | 8837MiB / 12196MiB | 100% Default |
| | | N/A |
Comparing estimated hours to completion, each with 4 threads dedicated to workers loading the dataset:
- Shared SSD NFS:
- Loading from folders of images: 85 hours
- HDD:
- Loading from folders of images: 670 hours
- Loading from LMDB as DataFlow: 70 hours
Estimated times neglect the validation set. Real completion time for the experiment using LMDB as DataFlow with the code in this repository completed in 75 hours.
To check that this still converges to the benchmark accuracy (the shuffling is only local so may not match), I ran the experiment until completion using this DataLoader. The final validation top-1 accuracy was 69.76% and the charts detailing this experiment can be found here.