• Stars
    star
    403
  • Rank 107,140 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created about 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

๐Ÿ—‚ Split folders with files (i.e. images) into training, validation and test (dataset) folders

split-folders Build Status PyPI PyPI - Python Version PyPI - Downloads

Split folders with files (e.g. images) into train, validation and test (dataset) folders.

The input folder should have the following format:

input/
    class1/
        img1.jpg
        img2.jpg
        ...
    class2/
        imgWhatever.jpg
        ...
    ...

In order to give you this:

output/
    train/
        class1/
            img1.jpg
            ...
        class2/
            imga.jpg
            ...
    val/
        class1/
            img2.jpg
            ...
        class2/
            imgb.jpg
            ...
    test/
        class1/
            img3.jpg
            ...
        class2/
            imgc.jpg
            ...

This should get you started to do some serious deep learning on your data. Read here why it's a good idea to split your data intro three different sets.

  • Split files into a training set and a validation set (and optionally a test set).
  • Works on any file types.
  • The files get shuffled.
  • A seed makes splits reproducible.
  • Allows randomized oversampling for imbalanced datasets.
  • Optionally group files by prefix.
  • (Should) work on all operating systems.

Install

This package is Python only and there are no external dependencies.

pip install split-folders

Optionally, you may install tqdm to get a progress bar when moving files.

pip install split-folders[full]

Usage

You can use split-folders as Python module or as a Command Line Interface (CLI).

If your datasets is balanced (each class has the same number of samples), choose ratio otherwise fixed. NB: oversampling is turned off by default. Oversampling is only applied to the train folder since having duplicates in val or test would be considered cheating.

Module

import splitfolders

# Split with a ratio.
# To only split into training and validation set, set a tuple to `ratio`, i.e, `(.8, .2)`.
splitfolders.ratio("input_folder", output="output",
    seed=1337, ratio=(.8, .1, .1), group_prefix=None, move=False) # default values

# Split val/test with a fixed number of items, e.g. `(100, 100)`, for each set.
# To only split into training and validation set, use a single number to `fixed`, i.e., `10`.
# Set 3 values, e.g. `(300, 100, 100)`, to limit the number of training values.
splitfolders.fixed("input_folder", output="output",
    seed=1337, fixed=(100, 100), oversample=False, group_prefix=None, move=False) # default values

Occasionally, you may have things that comprise more than a single file (e.g. picture (.png) + annotation (.txt)). splitfolders lets you split files into equally-sized groups based on their prefix. Set group_prefix to the length of the group (e.g. 2). But now all files should be part of groups.

Set move=True if you want to move the files instead of copying.

CLI

Usage:
    splitfolders [--output] [--ratio] [--fixed] [--seed] [--oversample] [--group_prefix] [--move] folder_with_images
Options:
    --output        path to the output folder. defaults to `output`. Get created if non-existent.
    --ratio         the ratio to split. e.g. for train/val/test `.8 .1 .1 --` or for train/val `.8 .2 --`.
    --fixed         set the absolute number of items per validation/test set. The remaining items constitute
                    the training set. e.g. for train/val/test `100 100` or for train/val `100`.
                    Set 3 values, e.g. `300 100 100`, to limit the number of training values.
    --seed          set seed value for shuffling the items. defaults to 1337.
    --oversample    enable oversampling of imbalanced datasets, works only with --fixed.
    --group_prefix  split files into equally-sized groups based on their prefix
    --move          move the files instead of copying
Example:
    splitfolders --ratio .8 .1 .1 -- folder_with_images

Because of some Python quirks you have to prepend -- after using --ratio.

Instead of the command splitfolders you can also use split_folders or split-folders.

Development

Install and use poetry.

Contributing

If you have a question, found a bug or want to propose a new feature, have a look at the issues page.

Pull requests are especially welcomed when they fix bugs or improve the code quality.

License

MIT

More Repositories

1

Sublime-Text-Plugins-for-Frontend-Web-Development

๐Ÿ“ Collection of plugins for Frontend Web Development
1,134
star
2

react-native-onboarding-swiper

๐Ÿ›ณ Delightful onboarding for your React-Native app
JavaScript
927
star
3

clean-text

๐Ÿงน Python package for text cleaning
Python
909
star
4

pdf-scripts

๐Ÿ“‘ Scripts to repair, verify, OCR, compress, wrangle, crop (etc.) PDFs
Shell
55
star
5

text-classification-keras

๐Ÿ“š Text classification library with Keras
Python
52
star
6

frag-den-staat-app

๐Ÿ“ฑ iOS & Android App for FragDenStaat, the German FOI portal
JavaScript
25
star
7

hgmaassen-retweets

Hans-Georg MaaรŸen and the Retweets
Jupyter Notebook
23
star
8

brunch-on-speed

๐Ÿฝ Skeleton for Brunch for a long-scroll, single, static Web page
HTML
18
star
9

ulmfit-for-german

๐Ÿ‘ฉโ€๐Ÿซ Pre-trained German Language Model with sub-word tokenization for ULMFIT
Jupyter Notebook
16
star
10

hyperhyper

๐Ÿงฎ Python package to construct word embeddings for small data using PMI and SVD
Python
15
star
11

ptf-kommentare

Notes & code for my Protoypefund project about Machine Learning & news comments & language change
Jupyter Notebook
11
star
12

youdata

๐Ÿ‡ช๐Ÿ‡บ Because it's about you and your data. (discontinued)
JavaScript
10
star
13

eesti-kelt

๐Ÿ‡ช๐Ÿ‡ช English to Estonian dictionary with all the three important cases (discontinued)
JavaScript
9
star
14

german-abbreviations

๐Ÿ“– A list of 4262 German abbreviations from Wiktionary
Python
9
star
15

german-preprocessing

๐Ÿ‡ฉ๐Ÿ‡ช Preprocess German texts to do some serious natural-language processing.
Python
8
star
16

get-retries

Adding retries to Requests.get() with exponential backoff
Python
6
star
17

wikipedia-edits-verified-accounts

Get all revisions and recent changes for verified German Wikipedia users
Python
6
star
18

german-lemmatizer

โœ‚๏ธ Python package (using a Docker image under the hood) to lemmatize German texts.
Python
6
star
19

deep-plots

๐Ÿ“‰ Visualize your Deep Learning training in static graphics
Python
5
star
20

scrape-gutenberg-de

Scrape all Books from Projekt Gutenberg-DE
Python
5
star
21

masters-thesis

Master's Thesis: Conversation-aware Classification of News Comments
Jupyter Notebook
5
star
22

rechte-gewalt

Mapping of right-wing incidents in Germany
Python
4
star
23

get-wayback-machine

Fetch a URL via the latest Wayback Machine snapshot
Python
4
star
24

most-frequent-words-2019-german-eu-election-programs

Visualization of the most frequent words in the German 2019 EU election programs
Jupyter Notebook
4
star
25

MDMA

Make Deep Art Accessible
Python
3
star
26

sparse-svd-benchmark

Sparse Truncated SVD Benchmark (Python)
Jupyter Notebook
3
star
27

mw-category-members

Using MediaWiki's API, retrieve pages that belong to a given category
Python
2
star
28

btw21

Visualization of the most frequent words in the German federal election in 2021
Jupyter Notebook
2
star
29

nsu-urteil

Most frequent sentences in the written judgment against the NSU
Jupyter Notebook
2
star
30

offene-register-text-analysis

Text analysis of German corporates' names and associated officers
Jupyter Notebook
2
star
31

oauth-proxy

A simple proxy for OAuth to hide the client secret.
JavaScript
1
star
32

utils

bash scripts, dotfiles
Shell
1
star
33

german-lemmatizer-docker

โœ‚๏ธ Combining the power of several tools for lemmatization of German text
Python
1
star
34

autobahn

Playing around with data about broken bridges on the German Autobahn
R
1
star
35

tweets-with-images

Get all tweets with images from a given Twitter user
Python
1
star
36

00-dokku-default

Add a dummy lexicographically first site to a Dokku instance to act as default site
HTML
1
star
37

nlp

Solutions for a course in NLP in Winter 2014/15 @ OVGU, Magdeburg
Python
1
star
38

universal-style-transfer-pytorch

Universal Style Transfer in PyTorch (improved)
Python
1
star
39

hpi-kurs-zuordnung

Determine optimal specializations and course assignments @ HPI
JavaScript
1
star
40

ifg.jfilter.de

Blog for my investigative reporting using German FOI laws
Shell
1
star
41

lobbyalarm

๐Ÿšจ Browser Plugin to Highlight Lobbyism (in Germany)
Python
1
star
42

blog-examples

Example code of my blog posts
R
1
star