Top Rating
- Top Contributors
  Discover the Top Open Source contributors by country or by language
- Interviews
  Discover real stories from Open Source developers
Discover

Discover your Favorite Language
Discover the top trending repositories and projects on Github. Explore the latest trends in your preferred languages.

HTML

CSS

Dart

Kotlin

Assembly

C++

Groovy

R

More Languages
Awesome

Awesome repositories
Discover the most awesome repositories and projects of your favorite languages. Inspired by the Awesome-* lists trend in GitHub.

Rust

Dart

Clojure

Objective-C

F#

Ada

Python

R

More Languages
By Country

Rankings by Country
Discover the community of talented open source contributors in each country.

🇰🇼 Kuwait

🇹🇰 Tokelau

🇱🇺 Luxembourg

🇬🇹 Guatemala

🇹🇳 Tunisia

🇲🇴 Macao

🇲🇪 Montenegro

🇹🇷 Turkey

All Countries Compare Countries

evanmiller/LLM-Reading-List

Stars
571
Rank 78,127 (Top 2 %)
Language
Created over 1 year ago
Updated over 1 year ago

evanmiller/LLM-Reading-List

evanmiller

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

LLM papers I'm reading, mostly on inference and model compression

Just helping myself keep track of LLM papers that I‘m reading, with an emphasis on inference and model compression.

Transformer Architectures

Foundation Models

Position Encoding

KV Cache

Activation

Pruning

Optimal Brain Damage (1990)
Optimal Brain Surgeon (1993)
Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning (Jan. 2023) - Introduces Optimal Brain Quantization based on the Optimal Brain Surgeon
Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
A Simple and Effective Pruning Approach for Large Language Models - Introduces Wanda (pruning with Weights and Activations)

Quantization

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale - Quantization with outlier handling. Might be solving the wrong problem - see "Quantizable Transformers" below.
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - Another approach to quantization with outliers
Up or Down? Adaptive Rounding for Post-Training Quantization (Qualcomm 2020) - Introduces AdaRound
Understanding and Overcoming the Challenges of Efficient Transformer Quantization (Qualcomm 2021)
QuIP: 2-Bit Quantization of Large Language Models With Guarantees (Cornell Jul. 2023) - Introduces incoherence processing
SqueezeLLM: Dense-and-Sparse Quantization (Berkeley Jun. 2023)
Intriguing Properties of Quantization at Scale (Cohere May 2023)
Pruning vs Quantization: Which is Better? (Qualcomm Jul. 2023)

Normalization

Root Mean Square Layer Normalization
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing - Introduces gated attention and argues that outliers are a consequence of normalization

Sparsity and rank compression

Compressing Pre-trained Language Models by Decomposition - vanilla SVD composition to reduce matrix sizes
Language model compression with weighted low-rank factorization - Fisher information-weighted SVD
Numerical Optimizations for Weighted Low-rank Estimation on Language Model - Iterative implementation for the above
Weighted Low-Rank Approximation (2003)
Transformers learn through gradual rank increase
Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models
Scatterbrain: Unifying Sparse and Low-rank Attention Approximation
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
LadaBERT: Lightweight Adaptation of BERT through Hybrid Model Compression
KroneckerBERT: Learning Kronecker Decomposition for Pre-trained Language Models via Knowledge Distillation
TRP: Trained Rank Pruning for Efficient Deep Neural Networks - Introduces energy-pruning ratio

Fine-tuning

Sampling

Scaling

Efficiently Scaling Transformer Inference (Google Nov. 2022) - Pipeline and tensor parallelization for inference
Megatron-LM (Nvidia Mar. 2020) - Intra-layer parallelism for training

Mixture of Experts

Adaptive Mixtures of Local Experts (1991, remastered PDF)
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (Google 2017)
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Google 2022)
Go Wider Instead of Deeper

Watermarking

A Watermark for Large Language Models

More

Efficient Deep Learning Systems: Week 9, Compression
The Transformer Family Version 2.0 (Lilian Weng)
Large Transformer Model Inference Optimization (Lilian Weng)

hecate

🔥 The Hex Editor From Hell! 🔥

mod_zip

Streaming ZIP archiver for nginx 📦

ProjCL

GPU and vector-enabled map projections, geodesic calculations, and image warping 🌎🌍🌏

nginx_circle_gif

Serve round corners with custom tinting. Vroom, vroom! 🏎

nginx_upstream_hash

An hashing load-balancer for nginx

mod_rrd_graph

Link RRDtool's graphing facilities directly into nginx

jerome

Erlang rich-text processing library 📜

fmptools

Convert FileMaker Pro databases to several convenient formats 📂

tracewrite

A simple DTrace script for writev & friends

rust-xls

Read Excel files from Rust

XLSX-Writer

Perl 6 wrapper for libxlsxwriter

SlowerLogLog

Like HyperLogLog, but slower 🛶

MarkovTextAdventure

Analyzing The Teeny Tiny Mansion with stochastic matrices

SampleMedianBootstrap

Exact and approximate formulas for bootstrapping sample medians

mod_strip

Whitespace stripper for nginx