Top Rating
- Top Contributors
  Discover the Top Open Source contributors by country or by language
- Interviews
  Discover real stories from Open Source developers
Discover

Discover your Favorite Language
Discover the top trending repositories and projects on Github. Explore the latest trends in your preferred languages.

Solidity

Swift

CoffeeScript

Julia

Dart

Shell

Clojure

OCaml

More Languages
Awesome

Awesome repositories
Discover the most awesome repositories and projects of your favorite languages. Inspired by the Awesome-* lists trend in GitHub.

Perl

Kotlin

Scala

Swift

Shell

C++

TypeScript

Ada

More Languages
By Country

Rankings by Country
Discover the community of talented open source contributors in each country.

🇲🇾 Malaysia

🇲🇽 Mexico

🇦🇴 Angola

🇹🇯 Tajikistan

🇧🇾 Belarus

🇸🇿 Eswatini

🇰🇮 Kiribati

🇸🇳 Senegal

All Countries Compare Countries

imoneoi/multipack_sampler

Stars
125
Rank 286,335 (Top 6 %)
Language
Python
License
MIT License
Created over 1 year ago
Updated over 1 year ago

imoneoi

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Multipack distributed sampler for fast padding-free training of LLMs

Multipack Sampler

The Multipack sampler is designed for padding-free distributed training of large language models. It utilizes an approximate solution to the identical machine scheduling problem to maximize the efficiency of batch processing. On the OpenChat V1 training set, it achieves >99% theoretical efficiency, while the interleaved sampler only achieves ~75%.

Benchmark

Please refer to test_multipack.ipynb

OpenChat V1 (testdata.json)

Sampler Multipack:
Overall Efficiency: 0.9963896327548557

Sampler Interleaved:
Overall Efficiency: 0.756684939066569

Usage

Compatible with PyTorch DataLoader

batch_max_len = 16 * 2048  # batch size * max context length

lengths = np.array([len(tokens) for tokens in data])

sampler = MultipackDistributedBatchSampler(
    batch_max_length=batch_max_len,
    lengths=lengths,
    seed=0
)

dataloader = DataLoader(data, batch_sampler=sampler)