Awesome-LLM-Compression
Awesome LLM compression research papers and tools to accelerate the LLM training and inference.
Contents
Papers
Survey
- A Survey on Model Compression for Large Language Models
Arxiv 2023 [Paper]
Quantization
-
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers
NeurIPS 2022 [Paper] [Code (DeepSpeed)] -
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
NeurIPS 2022 [Paper] [Code] -
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
Arxiv 2022 [Paper] -
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
ICML 2023 [Paper] [Code] -
FlexRound: Learnable Rounding based on Element-wise Division for Post-Training Quantization
ICML 2023 [Paper] [Code (DeepSpeed)] -
Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases
ICML 2023 [Paper] [Code] -
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
ICLR 2023 [Paper] [Code] -
RPTQ: Reorder-based Post-training Quantization for Large Language Models
Arxiv 2023 [Paper] [Code] -
PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models
ACL 2023 [Paper] -
Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization
ACL 2023 [Paper] -
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
Arxiv 2023 [Paper] -
Quantized Distributed Training of Large Models with Convergence Guarantees
Arxiv 2023 [Paper] -
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
Arxiv 2023 [Paper] [Code] -
QLoRA: Efficient Finetuning of Quantized LLMs
Arxiv 2023 [Paper] [Code] -
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
Arxiv 2023 [Paper] -
The Quantization Model of Neural Scaling
Arxiv 2023 [Paper] -
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
Arxiv 2023 [Paper] -
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt
Arxiv 2023 [Paper] -
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Arxiv 2023 [Paper] [Code] -
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Arxiv 2023 [Paper] [Code] -
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
Arxiv 2023 [Paper] [Code] -
OWQ: Lessons learned from activation outliers for weight quantization in large language models
Arxiv 2023 [Paper] -
SqueezeLLM: Dense-and-Sparse Quantization
Arxiv 2023 [Paper] [Code] -
INT2.1: Towards Fine-Tunable Quantized Large Language Models with Error Correction through Low-Rank Adaptation
Arxiv 2023 [Paper] -
INT-FP-QSim: Mixed Precision and Formats For Large Language Models and Vision Transformers
Arxiv 2023 [Paper] [Code] -
QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models
Arxiv 2023 [Paper] [Code] -
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study
Arxiv 2023 [Paper] -
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
Arxiv 2023 [Paper] [Code (DeepSpeed)] -
OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization
ISCA 2023 [Paper] -
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Arxiv 2023 [Paper] [Code] -
NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search
Arxiv 2023 [Paper] -
GPT-Zip: Deep Compression of Finetuned Large Language Models
ICML 2023 Workshop ES-FoMO [Paper] -
Generating Efficient Kernels for Quantized Inference on Large Language Models
ICML 2023 Workshop ES-FoMO [Paper] -
Gradient-Based Post-Training Quantization: Challenging the Status Quo
Arxiv 2023 [Paper] -
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs
Arxiv 2023 [Paper] -
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
Arxiv 2023 [Paper] -
FPTQ: Fine-grained Post-Training Quantization for Large Language Models
Arxiv 2023 [Paper] -
eDKM: An Efficient and Accurate Train-time Weight Clustering for Large Language Models
Arxiv 2023 [Paper] -
QuantEase: Optimization-based Quantization for Language Models -- An Efficient and Intuitive Algorithm
Arxiv 2023 [Paper] -
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
Arxiv 2023 [Paper] -
Understanding the Impact of Post-Training Quantization on Large-scale Language Models
Arxiv 2023 [Paper] -
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
Arxiv 2023 [Paper] [Code]
Pruning and Sparsity
-
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers
ICLR 2023 [Paper] -
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
ICML 2023 [Paper] [Code] -
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
ICML 2023 [Paper] [Code] -
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Arxiv 2023 [Paper] [Code] -
LLM-Pruner: On the Structural Pruning of Large Language Models
Arxiv 2023 [Paper] [Code] -
Prune and Tune: Improving Efficient Pruning Techniques for Massive Language Models
ICLR 2023 TinyPapers [Paper] -
Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of LLMs with Self-Information-Based Content Filtering
Arxiv 2023 [Paper] [Code] -
Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale
Arxiv 2023 [Paper] [Code] -
A Simple and Effective Pruning Approach for Large Language Models
Arxiv 2023 [Paper] [Code] -
Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
Arxiv 2023 [Paper] -
Structural pruning of large language models via neural architecture search
AutoML 2023 [Paper]
Distillation
-
Lifting the Curse of Capacity Gap in Distilling Language Models
ACL 2023 [Paper] [Code] -
Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step
ACL 2023 [Ppaer] -
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
ACL 2023 [Paper] -
SCOTT: Self-Consistent Chain-of-Thought Distillation
ACL 2023 [Paper] -
DISCO: Distilling Counterfactuals with Large Language Models
ACL 2023 [Paper] [Code] -
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Arxiv 2023 [Paper] [Code] -
Large Language Model Distillation Doesn't Need a Teacher
Arxiv 2023 [Paper] [Code] -
The False Promise of Imitating Proprietary LLMs
Arxiv 2023 [Paper] -
GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
Arxiv 2023 [Paper] [Code] -
PaD: Program-aided Distillation Specializes Large Models in Reasoning
Arxiv 2023 [Paper] -
Knowledge Distillation of Large Language Models
Arxiv 2023 [Paper] [Code] -
GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models
Arxiv 2023 [Paper] -
Chain-of-Thought Prompt Distillation for Multimodal Named Entity and Multimodal Relation Extraction
Arxiv 2023 [Paper] -
Task-agnostic Distillation of Encoder-Decoder Language Models
Arxiv 2023 [Paper] -
Lion: Adversarial Distillation of Closed-Source Large Language Model
Arxiv 2023 [Paper] [Code]
Efficient Prompting
-
Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning
ACL 2023 [Paper] [Code] -
Efficient Prompting via Dynamic In-Context Learning
Arxiv 2023 [Paper] -
Learning to Compress Prompts with Gist Tokens
Arxiv 2023 [Paper] [Code] -
Batch Prompting: Efficient Inference with Large Language Model APIs
Arxiv 2023 [Paper] [Code] -
Adapting Language Models to Compress Contexts
Arxiv 2023 [Paper] [Code] -
In-context Autoencoder for Context Compression in a Large Language Model
Arxiv 2023 [Paper] -
Discrete Prompt Compression with Reinforcement Learning
Arxiv 2023 [Paper] -
BatchPrompt: Accomplish more with less
Arxiv 2023 [Paper]
Other
-
TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition
Arxiv 2023 [Paper] -
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Arxiv 2023 [Paper] -
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference
Arxiv 2023 [Paper] -
Scaling In-Context Demonstrations with Structured Attention
Arxiv 2023 [Paper] -
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Arxiv 2023 [Paper] [Code] -
Text Alignment Is An Efficient Unified Model for Massive NLP Tasks
Arxiv 2023 [Paper] [Code] -
CPET: Effective Parameter-Efficient Tuning for Compressed Large Language Models
Arxiv 2023 [Paper] -
Ternary Singular Value Decomposition as a Better Parameterized Form in Linear Mapping
Arxiv 2023 [Paper] -
LLMCad: Fast and Scalable On-device Large Language Model Inference
Arxiv 2023 [Paper]
Tools
-
BMCook: Model Compression for Big Models [Code]
-
llama.cpp: Inference of LLaMA model in pure C/C++ [Code]
-
LangChain: Building applications with LLMs through composability [Code]
-
GPTQ-for-LLaMA: 4 bits quantization of LLaMA using GPTQ [Code]
-
Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Language Models Interface [Code]
-
vllm: A high-throughput and memory-efficient inference and serving engine for LLMs [Code]
-
LLaMA Efficient Tuning: Fine-tuning LLaMA with PEFT (PT+SFT+RLHF with QLoRA) [Code]
-
Efficient-Tuning-LLMs: (Efficient Finetuning of QLoRA LLMs). QLoRA, LLama, bloom, baichuan-7B, GLM [Code]
-
bitsandbytes: 8-bit CUDA functions for PyTorch [Code]
-
ExLlama: A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights. [Code]
-
lit-gpt: Hackable implementation of state-of-the-art open-source LLMs based on nanoGPT. Supports flash attention, 4-bit and 8-bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. [Code]
-
Lit-LLaMA: Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. [Code]
-
lama.onnx: LLaMa/RWKV onnx models, quantization and testcase [Code]
-
fastLLaMa: An experimental high-performance framework for running Decoder-only LLMs with 4-bit quantization in Python using a C/C++ backend. [Code]
-
Sparsebit: A model compression and acceleration toolbox based on pytorch. [Code]
-
llama2.c: Inference Llama 2 in one file of pure C [Code]
-
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads [Code]
-
Megatron-LM: Ongoing research training transformer models at scale [Code]