Explore @bigscience-workshop Open Source projects

BigScience Workshop (@bigscience-workshop)

bigscience-workshop

Stars
16,274
Global Org. Rank 1,412 (Top 0.5 %)
Registered over 3 years ago
Most used languages

Python
60.9 %

Jupyter Notebook
21.7 %

TeX
4.3 %

HTML
4.3 %

Shell
4.3 %

Makefile 4.3 %

petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading

promptsource

Toolkit for creating, sharing and using natural language prompts.

Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

bigscience

Central place for the engineering/scaling WG: documentation, SLURM scripts and logs, compute environment and data.

xmtf

Crosslingual Generalization through Multitask Finetuning

Jupyter Notebook

t-zero

Reproduce results and replicate training fo T0 (Multitask Prompted Training Enables Zero-Shot Task Generalization)

biomedical

Tools for curating biomedical training data for large-scale language modeling

data-preparation

Code used for sourcing and cleaning the BigScience ROOTS corpus

Jupyter Notebook

lam

Libraries, Archives and Museums (LAM)

data_tooling

Tools for managing datasets for governance and training.

multilingual-modeling

BLOOM+1: Adapting BLOOM model to support a new unseen language

evaluation

Code and Data for Evaluation WG

data_sourcing

This directory gathers the tools developed by the Data Sourcing Working Group

metadata

Experiments on including metadata such as URLs, timestamps, website descriptions and HTML tags during pretraining.

model_card

tokenization

carbon-footprint

A repository for `codecarbon` logs.

Jupyter Notebook

bloom-dechonk

A repo for running model shrinking experiments

historical_texts

BigScience working group on language models for historical texts

Jupyter Notebook

catalogue_data

Scripts to prepare catalogue data

Jupyter Notebook

pii_processing

PII Processing code to detect and remediate PII in BigScience datasets. Reference implementation for the PII Hackathon

training_dynamics

bibliography

A list of BigScience publications

scaling-laws-tokenization

scaling-laws-tokenization

datasets_stats

Generate statistics over datasets used in the context of BS

evaluation-robustness-consistency

Tools for evaluating model robustness and consistency

interpretability-ideas

evaluation-results

Dump of results for bigscience.