• Stars
    star
    3,732
  • Rank 11,823 (Top 0.3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 2 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pretrained language model with 100B parameters

YaLM 100B

YaLM 100B is a GPT-like neural network for generating and processing text. It can be used freely by developers and researchers from all over the world.

The model leverages 100 billion parameters. It took 65 days to train the model on a cluster of 800 A100 graphics cards and 1.7 TB of online texts, books, and countless other sources in both English and Russian.

Training details and best practices on acceleration and stabilizations can be found on Medium (English) and Habr (Russian) articles.

We used DeepSpeed to train the model and drew inspiration from Megatron-LM example. However, the code in this repo is not the same code that was used to train the model. Rather it is stock example from DeepSpeed repo with minimal changes needed to infer our model.

Setup

Make sure to have 200GB of free disk space before downloading weights. The model (code is based on microsoft/DeepSpeedExamples/Megatron-LM-v1.1.5-ZeRO3) is supposed to run on multiple GPUs with tensor parallelism. It was tested on 4 (A100 80g) and 8 (V100 32g) GPUs, but is able to work with different configurations with ≈200GB of GPU memory in total which divide weight dimensions correctly (e.g. 16, 64, 128).

Downloading checkpoint

  • Run bash download/download.sh to download model weights and vocabulary.
  • By default, weights will be downloaded to ./yalm100b_checkpoint/weights/, and vocabulary will be downloaded to ./yalm100b_checkpoint/vocab/.
  • As another option, you can clone our HF repo and pull the checkpoint.

Docker

  • We published image on Docker Hub, it can be pulled with docker/pull.sh. It is compatible with A100 and V100.
  • Alternatively, you can build docker image from source using docker/build.sh (which will just build docker image from docker/Dockerfile).
  • To run container, use docker/run.sh (volumes, name and other parameters can be changed).

Usage

You can start with the following scripts:

  • examples/generate_interactive.sh: interactive generation from command line, the simplest way to try the model.
  • examples/generate_conditional_sampling.sh: conditional generation with sampling strategy. Top-p is used by default, feel free to change temperature or use top-k. Input is jsonlines (example: examples/example_cond_input.json), output will be the same jsonlines with generated text field added to each line.
  • examples/generate_conditional_greedy.sh: same as previous, but generation is greedy. Suitable for solving problems with few-shot.
  • examples/generate_unconditional.sh: unconditional generation. No input is used, output will be jsonlines.

License

The model is published under the Apache 2.0 license that permits both research and commercial use, Megatron-LM is licensed under the Megatron-LM license.

Training details

Dataset composition

Dataset used for the training of YaLM-100B is comprised of the following parts (rough percentages are measured in tokens seen by the model):

  • 25% The Pile — open English dataset by Eleuther AI team

  • 75% Texts in Russian collected by our team (percentages of the whole dataset are given)

    • 49% Russian web pages from Yandex Search index filtered from ~100Tb to ~1Tb by the following heuristics:

      1. LSH Deduplication — clusters of similar texts were truncated to just one text each
      2. Length filtration — too short or too long texts or texts with too few natural sentences were discarded.
      3. Entropy filtration — texts with too high or too low entropy were discarded
      4. Domain filtration — domains with repetitive texts (like online retail) were discarded
      5. Classifier filtration — dataset of good texts was collected in a manner similar to WebText from pages linked in tweets in Russian that have at least one reply. Then a classifier was trained to distinguish those good texts from random pages from the dataset. Texts from the original crawled dataset with low classifier scores were then discarded
    • 12% News from various sources from Yandex Search index

    • 10% Books from the dataset used in Russian Distributional Thesarus

    • 3% Misc texts from the Taiga Dataset

    • 1.5% Dialogues from social media preprocessed in a manner similar to how Reddit is proccessed in The Pile

    • 0.5% Russian portion of Wikipedia

Some subsets were traversed up to 3 times during the training.

Training process

Model was trained on a cluster of 800 A100 for ~65 days. In that time it consumed 300B tokens. You can see TensorBoard with LR and ramp up schedule, training metrics and our "thermometers" on the HF page.

More Repositories

1

gixy

Nginx configuration static analyzer
Python
8,242
star
2

odyssey

Scalable PostgreSQL connection pooler
C
3,190
star
3

yandex-tank

Load and performance benchmark tool
Python
2,439
star
4

YaFSDP

YaFSDP: Yet another Fully Sharded Data Parallel
Python
821
star
5

rep

Machine Learning toolbox for Humans
Jupyter Notebook
687
star
6

pgmigrate

Simple tool to evolve PostgreSQL schema easily.
Python
623
star
7

faster-rnnlm

Faster Recurrent Neural Network Language Modeling Toolkit with Noise Contrastive Estimation and Hierarchical Softmax
C++
562
star
8

tomita-parser

C
495
star
9

pandora

A load generator in Go language
Go
401
star
10

porto

Yet another Linux container management system
C++
397
star
11

reshadow

Markup and styles that feel right
JavaScript
363
star
12

django_replicated

Django DB router for stateful master-slave replication
Python
351
star
13

pire

Perl Incompatible Regular Expressions library
C++
330
star
14

metrica-tag

The client library of the web analytics tool. It is in the top 5 by popularity worldwide.
TypeScript
257
star
15

yatagan

Dependency Injection framework based on Google's Dagger2 API, optimized for fast builds and for managing large graphs with optional dependencies
Kotlin
231
star
16

ozo

OZO is a C++17 Boost.Asio based header-only library for asyncronous communication with PostgreSQL DBMS.
C++
226
star
17

mapsapi-codestyle

JavaScript and TypeScript Style Guide
JavaScript
213
star
18

zero-downtime-migrations

Apply Django migrations on PostgreSql without long locks on tables
Python
188
star
19

audio-js

Библиотека аудио-плеера для браузера
JavaScript
186
star
20

alice-skills

Примеры кода навыков для голосового помощника, придуманного в Яндексе
Python
179
star
21

burp-molly-scanner

Turn your Burp suite into headless active web application vulnerability scanner
Java
154
star
22

yandex-taxi-testsuite

testsuite: microservices testing framework
Python
144
star
23

burp-molly-pack

Security checks pack for Burp Suite
Java
137
star
24

yatool

Yatool is a cross-platform distribution, building, testing, and debugging toolkit focused on monorepositories
C++
136
star
25

go-hasql

Go library for accessing multi-host SQL database installations
Go
133
star
26

mapsapi-modules

Async modular system
JavaScript
132
star
27

mapkit-android-demo

MapKit Android demo
Kotlin
121
star
28

yoctodb

A tiny embedded Java-engine for extremely fast partitioned immutable-after-construction databases
Java
106
star
29

scout

A fast and safe manual dependency injector for Kotlin and Android.
Kotlin
104
star
30

handystats

C++ library for collecting user-defined in-process runtime statistics with low overhead
C++
97
star
31

speechkitcloud

Speechkit Cloud examples and SDK
JavaScript
89
star
32

fastops

This small library enables acceleration of bulk calls of certain math functions on AVX and AVX2 hardware. Currently supported operations are exp, log, sigmoid and tanh. The library is designed with extensibility in mind.
C++
88
star
33

mapkit-ios-demo

MapKit iOS demo
Swift
82
star
34

argon2

Implementation of argon2 (i, d, id) algorithms with CPU dispatching
C++
80
star
35

mapsapi-heatmap

Heatmap: Yandex.Maps API plugin for data visualization
JavaScript
77
star
36

mms

Memory-mapped storage library
C++
76
star
37

geo-reviews-dataset-2023

76
star
38

tcplanz

TCPDump latency analyzer
Python
74
star
39

balancer

http balancer
C
72
star
40

securitygym

Python
72
star
41

NwSMTP

Asynchronous SMTP proxy server
Shell
72
star
42

smart

SMT-aware Real-time scheduler for Linux
C
67
star
43

mysync

MySync is mysql high-availability and cluster configuration tool.
Go
64
star
44

YandexDriver

YandexDriver is a WebDriver implementation
63
star
45

rtv

Remote TV control for developers
JavaScript
62
star
46

tex-renderer

Микросервис для рендеринга tex-формул в изображения
JavaScript
61
star
47

yandex_tracker_client

Python client for working with Yandex.Tracker Api
Python
55
star
48

csp-tester

This extension helps web masters to test web application behaviour with Content Security Policy (CSP) ver. 1.0 implemented.
JavaScript
55
star
49

mapsapi-examples

Примеры использования API Яндекс.Карт
JavaScript
52
star
50

ofd

Реализация протокола взаимодействия ККТ-ОФД
Python
49
star
51

csp-reporter

Content Security Policy logs parser
Python
44
star
52

reselector

Use React Components in css selectors
JavaScript
44
star
53

CMICOT

Efficient feature selection method based on Conditional Mutual Information.
C++
42
star
54

dpt

BEM-based prototyping framework for large projects
JavaScript
41
star
55

pgcheck

Tool for monitoring backend databases from PL/Proxy hosts and changing plproxy.get_cluster_partitions() output
Go
37
star
56

root-2015-tasks

Yandex.Root 2015 contest data
Python
34
star
57

deaf

Android App for Deaf
Java
33
star
58

datasync-js

DataSync API allows for structured data storage and synchronization in Web services and mobile applications.
JavaScript
33
star
59

inet64_tcp

Magic thing to make old Erlang stuff work in IPv6-only networks
Erlang
32
star
60

browser-extensions

JavaScript
32
star
61

mongoz

An alternative implementation of MongoDB sharding server aimed at high availability
C++
32
star
62

mapsapi-pie-chart-clusterer

Yandex Maps Plugin: Pie Chart Clusterer
JavaScript
31
star
63

mlcup

Official baseline solutions to Yandex Cup ML challenge
Jupyter Notebook
30
star
64

webmaster.api

28
star
65

mapsapi-polylabeler

Plugin to setting labels inside polygons
JavaScript
25
star
66

ch-backup

Backup tool for ClickHouse DBMS
Python
25
star
67

mapsapi-round-controls

Plugin for Yandex.Maps JS API: rounded map controls theme
JavaScript
24
star
68

ch-tools

ClickHouse administration and diagnostics tools
Python
23
star
69

pgconsul

PgConsul is a tool for maintaining High-Availability Postgresql cluster configurations. It is responsible for cluster recovery in case of emergencies.
Python
22
star
70

rdsync

Go
22
star
71

dep_tregex

Stanford Tregex-inspired language for rule-based dependency tree manipulation.
Python
21
star
72

tartifacts

📦 Create artifacts for your assemblies
JavaScript
20
star
73

mastermind

Smart control for a big storage
Python
19
star
74

cggen

Tool for generating Core Graphics code from vector image files
Swift
19
star
75

sdch_module

C++
18
star
76

YNDX000SB_kernel

Yandex.Phone kernel sources
C
18
star
77

openvpn-python-plugin

Runs python3 interpreter inside OpenVPN process in a persistent manner to answer it's plug-in calls.
C
18
star
78

yandex-ecom-search

Бета-Версия документации для разработчиков по работе с товарным фидом Яндекс Поиска
18
star
79

temporal-over-ydb

Go
17
star
80

evgen

Code generation for event logging
TypeScript
16
star
81

vgsl

Very Good Swift Library
Swift
15
star
82

miniapp-example

Example application for brand new platform of MiniApps inside the Yandex App
TypeScript
15
star
83

cluster_metrics

C++
13
star
84

yamail

YMail General Purpose Library
C++
13
star
85

agglomerative_clustering

C++
13
star
86

minishard

Lightweight sharding for distributed erlang applications
Erlang
12
star
87

jsx-directives

Директивы для JSX
TypeScript
12
star
88

mapsapi-ios

Allows to easily add Yandex.Maps to your existing iOS project using Yandex.Maps JavaScript API
Objective-C
11
star
89

miniapp-example-backend

Backend for Miniapp Example App for brand new platform of MiniApps inside the Yandex App
TypeScript
10
star
90

erater

Generic embedded distributed request rate limiting service for erlang applications
Erlang
10
star
91

storytests-cli

Framework agnostic CLI Utility to generate test files from Storybook
TypeScript
10
star
92

mapsapi-area

util.calculateArea: plugin for calculating geodesic features area.
JavaScript
10
star
93

zest

Библиотека для взаимодействия с бэкендом
TypeScript
9
star
94

opentsdb-flume

Module for flume, allows to write incoming events directly to OpenTSDB.
Java
9
star
95

protoc-gen-crd

Protobuf plugin for generating k8s CRD
Go
8
star
96

mediastorage-proxy

Mediastorage-proxy is a HTTP proxy for mediastorage based on elliptics
C++
8
star
97

toolchain-registry

Toolchain Registry is a central registry for maintaining and versioning toolchains used in Yandex.
Python
8
star
98

erateserver

Distributed rate limiting service with HTTP interface
Erlang
7
star
99

domestic-roots-patch

A patch that adds support for the Russian domesic root certificate to the Chromium browser.
7
star
100

libmastermind

Client library for mastermind
C++
6
star