• Stars
    star
    1,077
  • Rank 42,945 (Top 0.9 %)
  • Language
    Python
  • Created over 1 year ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A curated list for Efficient Large Language Models

Awesome-Efficient-LLM

A curated list for Efficient Large Language Models:

🚀 Updates

  • Sep 27, 2023: Add tag Publish for papers accepted at NeurIPS'23.
  • Sep 6, 2023: Add a new subdirectory project/ to organize those projects that are designed for developing a lightweight LLM.
  • July 11, 2023: In light of the numerous publications that conducts experiments using PLMs (such as BERT, BART) currently, a new subdirectory efficient_plm/ is created to house papers that are applicable to PLMs but have yet to be verified for their effectiveness on LLMs (not implying that they are not suitable on LLM).

💮 Contributing

If you'd like to include your paper, or need to update any details such as conference information or code URLs, please feel free to submit a pull request. You can generate the required markdown format for each paper by filling in the information in generate_item.py and execute python generate_item.py. We warmly appreciate your contributions to this list. Alternatively, you can email me with the links to your paper and code, and I would add your paper to the list at my earliest convenience.

Knowledge Distillation

Title & Authors Introduction Links
StarPublish
Specializing Smaller Language Models towards Multi-Step Reasoning
Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot
image Github
Paper
StarPublish
Distilling Script Knowledge from Large Language Models for Constrained Language Planning
Siyu Yuan, Jiangjie Chen, Ziquan Fu, Xuyang Ge, Soham Shah, Charles Robert Jankowski, Yanghua Xiao, Deqing Yang
image Github
Paper
Publish
SCOTT: Self-Consistent Chain-of-Thought Distillation
Peifeng Wang, Zhengyang Wang, Zheng Li, Yifan Gao, Bing Yin, Xiang Ren
image Paper
StarPublish
DISCO: Distilling Counterfactuals with Large Language Models
Zeming Chen, Qiyue Gao, Antoine Bosselut, Ashish Sabharwal, Kyle Richardson
image Github
Paper
StarPublish
I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation
Chandra Bhagavatula, Jena D. Hwang, Doug Downey, Ronan Le Bras, Ximing Lu, Lianhui Qin, Keisuke Sakaguchi, Swabha Swayamdipta, Peter West, Yejin Choi
image Github
Paper
Project
StarPublish
Symbolic Chain-of-Thought Distillation: Small Models Can Also "Think" Step-by-Step
Liunian Harold Li, Jack Hessel, Youngjae Yu, Xiang Ren, Kai-Wei Chang, Yejin Choi
image Github
Paper
Star Publish
Can Language Models Teach? Teacher Explanations Improve Student Performance via Theory of Mind
Swarnadeep Saha, Peter Hase, and Mohit Bansal
image Github
Paper
Publish
Dialogue Chain-of-Thought Distillation for Commonsense-aware Conversational Agents
Hyungjoo Chae, Yongho Song, Kai Tzu-iunn Ong, Taeyoon Kwon, Minjin Kim, Youngjae Yu, Dongha Lee, Dongyeop Kang, Jinyoung Yeo
image Paper
StarPublish
PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation
Gaurav Sahu, Olga Vechtomova, Dzmitry Bahdanau, Issam H. Laradji
image Github
Paper
StarPublish
Turning Dust into Gold: Distilling Complex Reasoning Capabilities from LLMs by Leveraging Negative Data
Yiwei Li, Peiwen Yuan, Shaoxiong Feng, Boyuan Pan, Bin Sun, Xinglin Wang, Heda Wang, Kan Li
image Github
Paper
Star Publish
GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model
Shicheng Tan, Weng Lam Tam, Yuanchun Wang, Wenwen Gong, Yang Yang, Hongyin Tang, Keqing He, Jiahao Liu, Jingang Wang, Shu Zhao, Peng Zhang, Jie Tang
image Github
Paper
Star Publish
Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes
Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister
image Github
Paper
Publish
Retrieval-based Knowledge Transfer: An Effective Approach for Extreme Large Language Model Compression
Jiduan Liu, Jiahao Liu, Qifan Wang, Jingang Wang, Xunliang Cai, Dongyan Zhao, Ran Lucien Wang, Rui Yan
image Paper
StarPublish
Cache me if you Can: an Online Cost-aware Teacher-Student framework to Reduce the Calls to Large Language Models
Ilias Stogiannidis, Stavros Vassos, Prodromos Malakasiotis, Ion Androutsopoulos
image Github
Paper
Star
LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions
Minghao Wu, Abdul Waheed, Chiyu Zhang, Muhammad Abdul-Mageed, Alham Fikri Aji
image Github paper
Knowledge Distillation of Large Language Models
Yuxian Gu, Li Dong, Furu Wei, Minlie Huang
image Github
Paper
Teaching Small Language Models to Reason
Lucie Charlotte Magister, Jonathan Mallinson, Jakub Adamek, Eric Malmi, Aliaksei Severyn.
image Paper
Star
Large Language Model Distillation Doesn't Need a Teacher
Ananya Harsh Jha, Dirk Groeneveld, Emma Strubell, Iz Beltagy
image Github paper
The False Promise of Imitating Proprietary LLMs
Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, Dawn Song
image Paper
Star
Impossible Distillation: from Low-Quality Model to High-Quality Dataset & Model for Summarization and Paraphrasing
Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, Yejin Choi
image Github paper
PaD: Program-aided Distillation Specializes Large Models in Reasoning
Xuekai Zhu, Biqing Qi, Kaiyan Zhang, Xingwei Long, Bowen Zhou
image Paper
RLCD: Reinforcement Learning from Contrast Distillation for Language Model Alignment
Kevin Yang, Dan Klein, Asli Celikyilmaz, Nanyun Peng, Yuandong Tian
image Paper
Sci-CoT: Leveraging Large Language Models for Enhanced Knowledge Distillation in Small Models for Scientific QA
Yuhan Ma, Haiqi Jiang, Chenyou Fan
image Paper
Star
UniversalNER: Targeted Distillation from Large Language Models for Open Named Entity Recognition
Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, Hoifung Poon
image Github
Paper
Project
Star
Baby Llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty
Inar Timiryasov, Jean-Loup Tastet
image Github
Paper
DistillSpec: Improving Speculative Decoding via Knowledge Distillation
Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, Rishabh Agarwal
image Paper
Star
Zephyr: Direct Distillation of LM Alignment
Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, Thomas Wolf
image Github
Paper
Star
Towards the Law of Capacity Gap in Distilling Language Models
Chen Zhang, Dawei Song, Zheyu Ye, Yan Gao
image Github
Paper
Unlock the Power: Competitive Distillation for Multi-Modal Large Language Models
Xinwei Li, Li Lin, Shuai Wang, Chen Qian
image Paper
Mixed Distillation Helps Smaller Language Model Better Reasoning
Li Chenglin, Chen Qianglong, Wang Caiyu, Zhang Yin
image Paper

Network Pruning

Title & Authors Introduction Links
Star Publish Type
SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot
Elias Frantar, Dan Alistarh
image Github paper
Star Publish Type
LLM-Pruner: On the Structural Pruning of Large Language Models
Xinyin Ma, Gongfan Fang, Xinchao Wang
image Github paper
Star Publish Type
The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter
Ajay Jaiswal, Shiwei Liu, Tianlong Chen, Zhangyang Wang
image Github
Paper
StarPublish Type
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song
image Github
Paper
StarPublish Type
NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models
Jongwoo Ko, Seungjoon Park, Yujin Kim, Sumyeong Ahn, Du-Seong Chang, Euijai Ahn, Se-Young Yun
image Github
Paper
Star Type
A Simple and Effective Pruning Approach for Large Language Models
Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter
image Github
Paper
Type
Pruning Large Language Models via Accuracy Predictor
Yupeng Ji, Yibo Cao, Jiucai Liu
image Paper
Type
Compressing LLMs: The Truth is Rarely Pure and Never Simple
Ajay Jaiswal, Zhe Gan, Xianzhi Du, Bowen Zhang, Zhangyang Wang, Yinfei Yang
image Paper
StarType
Junk DNA Hypothesis: A Task-Centric Angle of LLM Pre-trained Weights through Sparsity
Lu Yin, Shiwei Liu, Ajay Jaiswal, Souvik Kundu, Zhangyang Wang
image Github
Paper
StarType
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Mykola Pechenizkiy, Yi Liang, Zhangyang Wang, Shiwei Liu
image Github
Paper
Type
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models
Song Guo, Jiahang Xu, Li Lyna Zhang, Mao Yang
image Github
Paper
Star Type
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen
image Github
Paper
Star Type
Sparse Finetuning for Inference Acceleration of Large Language Models
Eldar Kurtic, Denis Kuznedelev, Elias Frantar, Michael Goin, Dan Alistarh
image Github
Paper
Type
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, Mehrdad Farajtabar
image Paper
Type
The Cost of Down-Scaling Language Models: Fact Recall Deteriorates before In-Context Learning
Tian Jin, Nolan Clement, Xin Dong, Vaishnavh Nagarajan, Michael Carbin, Jonathan Ragan-Kelley, Gintare Karolina Dziugaite
image Paper
Type
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models
Hang Shao, Bei Liu, Yanmin Qian
image Paper
Star Type
LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery
Tianyi Chen, Tianyu Ding, Badal Yadav, Ilya Zharkov, Luming Liang
image Github
Paper
Star Type
Divergent Token Metrics: Measuring degradation to prune away LLM components -- and optimize quantization
Björn Deiseroth, Max Meuer, Nikolas Gritsch, Constantin Eichenberg, Patrick Schramowski, Matthias Aßenmacher, Kristian Kersting
image Github
Paper
Star Type
Beyond Size: How Gradients Shape Pruning Decisions in Large Language Models
Rocktim Jyoti Das, Liqun Ma, Zhiqiang Shen
image Github
Paper
Star
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs Type
Yuxin Zhang, Lirui Zhao, Mingbao Lin, Yunyun Sun, Yiwu Yao, Xingjia Han, Jared Tanner, Shiwei Liu, Rongrong Ji
image Github
Paper
Type E-Sparse: Boosting the Large Language Model Inference through Entropy-based N:M Sparsity
Yun Li, Lin Niu, Xipeng Zhang, Kai Liu, Jianchen Zhu, Zhanhui Kang
image Paper
Star Type
PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs
Max Zimmer, Megi Andoni, Christoph Spiegel, Sebastian Pokutta
image Github
Paper
Star
Fast and Optimal Weight Update for Pruned Large Language Models Type
Vladimír Boža
image Github
Paper

Quantization

Title & Authors Introduction Links
StarPublish
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
image Github
Paper
StarPublish
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, Song Han
image Github
Paper
Star Publish
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer

Github
Paper
Star Publish
QuIP: 2-Bit Quantization of Large Language Models With Guarantees
Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, Christopher De SaXQ
image Github
Paper
Publish
Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization
Jeonghoon Kim, Jung Hyun Lee, Sungdong Kim, Joonsuk Park, Kang Min Yoo, Se Jung Kwon, Dongsoo Lee
image Paper
Star Publish
Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing
Yelysei Bondarenko, Markus Nagel, Tijmen Blankevoort
Github Paper
StarPublish
LLM-FP4: 4-Bit Floating-Point Quantized Transformers
Shih-yang Liu, Zechun Liu, Xijie Huang, Pingcheng Dong, Kwang-Ting Cheng
image Github
Paper
Publish
Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization
Jangwhan Lee, Minsoo Kim, Seungcheol Baek, Seok Joong Hwang, Wonyong Sung, Jungwook Choi
image Paper
Publish
Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge
Xuan Shen, Peiyan Dong, Lei Lu, Zhenglun Kong, Zhengang Li, Ming Lin, Chao Wu, Yanzhi Wang
image Paper
Publish
GPT-Zip: Deep Compression of Finetuned Large Language Models
Berivan Isik, Hermann Kumbong, Wanyi Ning, Xiaozhe Yao, Sanmi Koyejo, Ce Zhang
image Paper
StarPublish
Watermarking LLMs with Weight Quantization
Linyang Li, Botian Jiang, Pengyu Wang, Ke Ren, Hang Yan, Xipeng Qiu
image Github
Paper
Star
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, Song Han
image Github
Paper
Star
RPTQ: Reorder-based Post-training Quantization for Large Language Models
Zhihang Yuan and Lin Niu and Jiawei Liu and Wenyu Liu and Xinggang Wang and Yuzhang Shang and Guangyu Sun and Qiang Wu and Jiaxiang Wu and Bingzhe Wu

Github
Paper
ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation
Zhewei Yao, Xiaoxia Wu, Cheng Li, Stephen Youn, Yuxiong He
image Paper
Star
SqueezeLLM: Dense-and-Sparse Quantization
Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, Kurt Keutzer
image Github
Paper
Outlier Suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling
Xiuying Wei , Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, Xianglong Liu
image Paper
Integer or Floating Point? New Outlooks for Low-Bit Quantization on Large Language Models
Yijia Zhang, Lingran Zhao, Shijie Cao, Wenqiang Wang, Ting Cao, Fan Yang, Mao Yang, Shanghang Zhang, Ningyi Xu
image Paper
LLM-QAT: Data-Free Quantization Aware Training for Large Language Models
Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, Vikas Chandra
image Paper
Star
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression
Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, Dan Alistarh
image Github
Paper
Star
OWQ: Lessons learned from activation outliers for weight quantization in large language models
Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, Eunhyeok Park
image Github
Paper
Star
Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study
Peiyu Liu, Zikang Liu, Ze-Feng Gao, Dawei Gao, Wayne Xin Zhao, Yaliang Li, Bolin Ding, Ji-Rong Wen
image Github
Paper
ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats
Xiaoxia Wu, Zhewei Yao, Yuxiong He
image Paper
FPTQ: Fine-grained Post-Training Quantization for Large Language Models
Qingyuan Li, Yifan Zhang, Liang Li, Peng Yao, Bo Zhang, Xiangxiang Chu, Yerui Sun, Li Du, Yuchen Xie
image Paper
QuantEase: Optimization-based Quantization for Language Models - An Efficient and Intuitive Algorithm
Kayhan Behdin, Ayan Acharya, Aman Gupta, Sathiya Keerthi, Rahul Mazumder
image Paper
Norm Tweaking: High-performance Low-bit Quantization of Large Language Models
Liang Li, Qingyuan Li, Bo Zhang, Xiangxiang Chu
image Paper
Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs
Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv
image Github
Paper
Star
QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models
Yuhui Xu, Lingxi Xie, Xiaotao Gu, Xin Chen, Heng Chang, Hengheng Zhang, Zhensu Chen, Xiaopeng Zhang, Qi Tian
image Github
Paper
ModuLoRA: Finetuning 3-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers
Junjie Yin, Jiahao Dong, Yingheng Wang, Christopher De Sa, Volodymyr Kuleshov
image Paper
Star
PB-LLM: Partially Binarized Large Language Models
Yuzhang Shang, Zhihang Yuan, Qiang Wu, Zhen Dong
image Github
Paper
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
Luoming Zhang, Wen Fei, Weijia Wu, Yefei He, Zhenyu Lou, Hong Zhou
image Paper
QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources
Zhikai Li, Xiaoxuan Liu, Banghua Zhu, Zhen Dong, Qingyi Gu, Kurt Keutzer
image Paper
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, Bohan Zhuang
image Paper
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, Tuo Zhao
image Paper
TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Wenhua Cheng, Yiyang Cai, Kaokao Lv, Haihao Shen
image Github
Paper
BitNet: Scaling 1-bit Transformers for Large Language Models
Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei
image Paper
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci
image Paper
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models
Baisong Li, Xingwang Wang, Haixiao Xu
image Paper
Star
AFPQ: Asymmetric Floating Point Quantization for LLMs
Yijia Zhang, Sicheng Zhang, Shijie Cao, Dayou Du, Jianyu Wei, Ting Cao, Ningyi Xu
image Github
Paper
A Speed Odyssey for Deployable Quantization of LLMs
Qingyuan Li, Ran Meng, Yiduo Li, Bo Zhang, Liang Li, Yifan Lu, Xiangxiang Chu, Yerui Sun, Yuchen Xie
image Paper
Star
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
Han Guo, Philip Greengard, Eric P. Xing, Yoon Kim
image Github
Paper
Enabling Fast 2-bit LLM on GPUs: Memory Alignment, Sparse Outlier, and Asynchronous Dequantization
Jinhao Li, Shiyao Li, Jiaming Xu, Shan Huang, Yaoxiu Lian, Jun Liu, Yu Wang, Guohao Dai
image Paper
Star
SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM
Jiayi Pan, Chengcan Wang, Kaifu Zheng, Yangguang Li, Zhenyu Wang, Bin Feng
image Github
Paper
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Xiaoxia Wu, Haojun Xia, Stephen Youn, Zhen Zheng, Shiyang Chen, Arash Bakhtiari, Michael Wyatt, Yuxiong He, Olatunji Ruwase, Leon Song, Zhewei Yao
image Github
Paper
Star
Extreme Compression of Large Language Models via Additive Quantization
Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, Dan Alistarh
image Github
Paper

Inference Acceleration

Title & Authors Introduction Links
StarPublish
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
Zichang Liu, Jue WANG, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, Beidi Chen
image Github
Paper
Publish
Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, Anshumali Shrivastava
image Paper
Publish
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Sotiris Anagnostidis, Dario Pavllo, Luca Biggio, Lorenzo Noci, Aurelien Lucchi, Thomas Hofmann
image Paper
StarPublish
H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, Beidi Chen
image Github
Paper
StarPublish
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, Lili Qiu
image Github
Paper
StarPublish
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
Sangmin Bae, Jongwoo Ko, Hwanjun Song, Se-Young Yun
image Github
Paper
StarPublish
Compressing Context to Enhance Inference Efficiency of Large Language Models
Yucheng Li, Bo Dong, Chenghua Lin, Frank Guerin
image Github
Paper
Publish
ConsistentEE: A Consistent and Hardness-Guided Early Exiting Method for Accelerating Language Models Inference
Ziqian Zeng, Yihuai Hong, Hongliang Dai, Huiping Zhuang, Cen Chen
image Paper
Publish
Accelerating LLM Inference with Staged Speculative Decoding
Benjamin Spector, Chris Re
image Paper
Publish
TCRA-LLM: Token Compression Retrieval Augmented Large Language Model for Inference Cost Reduction
Junyi Liu, Liangzhi Li, Tong Xiang, Bowen Wang, Yiming Qian
image Paper
Inference with Reference: Lossless Acceleration of Large Language Models
Nan Yang, Tao Ge, Liang Wang, Binxing Jiao, Daxin Jiang, Linjun Yang, Rangan Majumder, Furu Wei
image Github
paper
Star
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia
image Github
paper
SkipDecode: Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inference
Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, Subhabrata Mukherjee
image Paper
Skeleton-of-Thought: Large Language Models Can Do Parallel Decoding
Xuefei Ning, Zinan Lin, Zixuan Zhou, Huazhong Yang, Yu Wang
image Paper
Star
Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding
Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Sharad Mehrotra
image Github
Paper
Star
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, Mike Lewis
image Github
Paper
(Dynamic) Prompting might be all you need to repair Compressed LLMs
Duc N.M Hoang, Minsik Cho, Thomas Merth, Mohammad Rastegari, Zhangyang Wang
image Paper
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, Jianfeng Gao
image Paper
Star
Large Language Model Cascades with Mixture of Thoughts Representations for Cost-efficient Reasoning
Murong Yue, Jie Zhao, Min Zhang, Liang Du, Ziyu Yao
image Github
Paper
Star
LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression
Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, Lili Qiu
image Github
Paper
CacheGen: Fast Context Loading for Language Model Applications
Yuhan Liu, Hanchen Li, Kuntai Du, Jiayi Yao, Yihua Cheng, Yuyang Huang, Shan Lu, Michael Maire, Henry Hoffmann, Ari Holtzman, Ganesh Ananthanarayanan, Junchen Jiang
image Paper
StarPublish
Context Compression for Auto-regressive Transformers with Sentinel Tokens
Siyu Ren, Qi Jia, Kenny Q. Zhu
image Github
Paper
Star
A Setwise Approach for Effective and Highly Efficient Zero-shot Ranking with Large Language Models
Shengyao Zhuang, Honglei Zhuang, Bevan Koopman, Guido Zuccon
image Github
Paper
SPEED: Speculative Pipelined Execution for Efficient Decoding
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Hasan Genc, Kurt Keutzer, Amir Gholami, Sophia Shao
image Paper
Accelerating LLM Inference by Enabling Intermediate Layer Decoding
Neeraj Varshney, Agneet Chatterjee, Mihir Parmar, Chitta Baral
image Paper
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster
Hongxuan Zhang, Zhining Liu, Jiaqi Zheng, Chenyi Zhuang, Jinjie Gu, Guihai Chen
image Paper
Star
Compressed Context Memory For Online Language Model Interaction
Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song
image Github
Paper
SparQ Attention: Bandwidth-Efficient LLM Inference
Luka Ribar, Ivan Chelombiev, Luke Hudlass-Galley, Charlie Blake, Carlo Luschi, Douglas Orr
image Paper
Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy
Yao Zhao, Zhitian Xie, Chenyi Zhuang, Jinjie Gu
image Paper
Cascade Speculative Drafting for Even Faster LLM Inference
Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Jie Huang, Kevin Chen-Chuan Chang
image Paper

Efficient MOE

Title & Authors Introduction Links
SiDA: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
Zhixu Du, Shiyu Li, Yuhao Wu, Xiangyu Jiang, Jingwei Sun, Qilin Zheng, Yongkai Wu, Ang Li, Hai "Helen" Li, Yiran Chen
image Paper
Star
Fast Inference of Mixture-of-Experts Language Models with Offloading
Artyom Eliseev, Denis Mazur
image Github
Paper
Star
SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention
Róbert Csordás, Piotr Piękos, Kazuki Irie, Jürgen Schmidhuber
image Github
Paper

Text Compression

Title & Authors Introduction Links
Publish
EntropyRank: Unsupervised Keyphrase Extraction via Side-Information Optimization for Language Model-based Text Compression
Alexander Tsvetkov. Alon Kipnis
image Paper
LLMZip: Lossless Text Compression using Large Language Models
Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Dileep Kalathil, Jean-Francois Chamberland, Srinivas Shakkottai
image Paper | Unofficial Github
Star
Adapting Language Models to Compress Contexts
Alexis Chevalier, Alexander Wettig, Anirudh Ajith, Danqi Chen
image Github
Paper
In-context Autoencoder for Context Compression in a Large Language Model
Tao Ge, Jing Hu, Xun Wang, Si-Qing Chen, Furu Wei
image Paper
Nugget 2D: Dynamic Contextual Compression for Scaling Decoder-only Language Model
Guanghui Qin, Corby Rosset, Ethan C. Chau, Nikhil Rao, Benjamin Van Durme
image Paper
Boosting LLM Reasoning: Push the Limits of Few-shot Learning with Reinforced In-Context Pruning
Xijie Huang, Li Lyna Zhang, Kwang-Ting Cheng, Mao Yang
image Paper

Low-Rank Decomposition

Title & Authors Introduction Links
Star Publish
LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation
Yixiao Li, Yifan Yu, Qingru Zhang, Chen Liang, Pengcheng He, Weizhu Chen, Tuo Zhao
image Github
Paper
StarPublish
Matrix Compression via Randomized Low Rank and Low Precision Factorization
Rajarshi Saha, Varun Srivastava, Mert Pilanci
image Github
Paper
TensorGPT: Efficient Compression of the Embedding Layer in LLMs based on the Tensor-Train Decomposition
Mingxue Xu, Yao Lei Xu, Danilo P. Mandic
image Paper
LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression
Ayush Kaushal, Tejas Vaidhya, Irina Rish
image Paper
Project
Star
Rethinking Compression: Reduced Order Modelling of Latent Features in Large Language Models
Arnav Chavan, Nahush Lele, Deepak Gupta
image Github
Paper

Hardware

Tuning

Survey

Leaderboard

Platform Access
Huggingface LLM Perf Leaderboard [Source]
LLMPerf Leaderboard [Source]