Discover EleutherAI/hae-rae Open Source project by @EleutherAI

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

8,224

gpt-neox

A framework for few-shot evaluation of language models.

6,829

lm-evaluation-harness

The hub for EleutherAI's work on interpretability and learning dynamics

6,268

pythia

the-pile

math-lm

cookbook

Deep learning for dummies. All the practical details and useful utilities that go into working with real models.

Polyglot: Large Language Models of Well-balanced Competence in Multi-languages

635

polyglot

471

DALLE-mtf

Open-AI's DALL-E for large scale training in mesh-tensorflow.

vqgan-clip

sae

concept-erasure

Erasing concepts from neural representations with provable guarantees

Keeping language models honest by directly eliciting knowledge encoded in their activations.

207

elk

OSLO: Open Source for Large-scale Optimization

186

oslo

lm_perplexity

knowledge-neurons

A library for finding knowledge neurons in pretrained transformer models.

Python Research Framework

142

pyfra

Data processing system for polyglot

107

dps

openwebtext2

(Deprecated) A hub for onboarding & other information.

info

improved-t5

Experiments for efforts to train a new and improved t5

Python tools for processing the stackexchange data dumps into a text dataset for Language Models

stackexchange-dataset

See the issue board for the current status of active and prospective projects!

project-menu

magiCARP

One stop shop for all things carp

sae-auto-interp

semantic-memorization

Using queues, tqdm-multiprocess supports multiple worker processes, each with multiple tqdm progress bars, displaying them cleanly through the main process. It offers similar functionality for python logging.

tqdm-multiprocess

aria

Engineering the state of RNN language models (Mamba, RWKV, etc.)

rnngineering

Understanding how features learned by neural networks evolve throughout training

features-across-time

Massively-Parallel Natural Extension of Reference Frame

mp_nerf

Investigating the generalization behavior of LM probes trained to predict truth labels: (1) from one annotator to another, and (2) from easy questions to hard

elk-generalization

A script for collecting the PubMed Central dataset in a language modelling friendly format.

pile-pubmedcentral

URL downloader supporting checkpointing and continuous checksumming.

best-download

polyglot-data

Efficient and robust implementation of seq-to-seq automatic piano transcription.

aria-amt

Web app for demoing the EAI models

text-generation-testing-ui

exploring-contrastive-topology

Minimum Description Length probing for neural network representations

mdl

pile_dedupe

Pile Deduplication Code

w2s

pilev2

Experiments with distilling large language models.

distilling

Efficiently computing & storing token n-grams from large corpora

tokengrams

Rust

lm-eval2

A framework for implementing equivariant DL

equivariance

Adapting the "Radioactive Data" paper to work for text models

radioactive-lab

Download, parse, and filter data from Literotica. Data-ready for The-Pile.

pile-literotica

hn-scraper

Part-of-Speech Tagging for the Pile and RedPajama

tagged-pile

multimodal-fid

A script for collecting the USPTO Backgrounds dataset in a language modelling friendly format.

pile-uspto

The code used to filter CC data for The Pile

pile-cc-filtering

Baseline agents for Minetest tasks.

minetest-baselines

Data collection pipeline for CodeCARP. Includes PyCharm plugins.

CodeCARP

pile-enron-emails

A script for collecting the Enron Emails dataset in a language modelling friendly format.

For exploring the data and documenting its limitations

pile-explorer

Jupyter notebook for the interpretablity section of the minetester blog post

minetest-interpretabilty-notebook

This is the Hugo generated website for eleuther.ai. The source of this build is new-website repo.

thonkenizers

yes

eleutherai.github.io

Visually ground GPT-Neo 1.3b and 2.7b

visual-grounding

Project github for LLM Markov Chains Project

LLM-Markov-Chains

architecture-experiments

Repository to host architecture experiments and development using Paxml and Praxis

Sample explorer tool for the Llemma models.

llemma-sample-explorer

lm-scope

latent-video-diffusion

Latent video diffusion

megatron-3d

New website for EleutherAI based on Hugo static site generator

website

Project Repo for Unpaired Image Generation project

Unpaired-Image-Generation

ccs

isaac-mchorse

EleutherAI's discord bot

Scraper to gather poems from allpoetry.com

pile-allpoetry

A replication of "EvilModel 2.0: Bringing Neural Network Models into Malware Attacks"

EvilModel

eai-prompt-gallery

Library of interesting prompt generations

Studying the variance in neural net predictions across training time

variance-across-time

A script for collecting the Ubuntu IRC dataset in a language modelling friendly format.

pile-ubuntu-irc

reddit-comment-processing

A large instruct dataset for open-source models (WIP).

eleutherai-instruct-dataset

bucket-cleaner

A small utility to clear out old model checkpoints in Google Cloud Buckets whilst keeping tensorboard event files

groupoid-rl

Equinox implementation of llama3 and llama3.1

equinox-llama

Adds GaLore style projection wrappers to optax optimizers

optax-galore

Filter text files or archives by language

lang-filter

here is the generated content for the EleutherAI blog. Source is from new-website repo

eleuther-blog

prefix-free-tokenizer

A prefix free tokenizer

Search and filter through alignment literature

alignment-reader

grouch

central location for access to pretrained models for CLIP and variants, with common API and out-of-the-box differentiable weighted multi-perceptor

language-adaptation

perceptors

pd-books

classifier-latent-diffusion

common-llm-settings

Common LLM Settings App

Exactly what it says on the tin

bayesian-adam

A script for collecting the CORD-19 dataset in a language modelling friendly format.

pile-cord19

Applying LEACE to models during training

conceptual-constraints

ngrams-across-time

steering-llama3

Method-of-moments estimation and sampling for truncated multivariate Gaussian distributions

truncated-gaussian