Awesome LLM Security
A curation of awesome tools, documents and projects about LLM Security.
Contributions are always welcome. Please read the Contribution Guidelines before contributing.
Table of Contents
Papers
- Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
- Visual Adversarial Examples Jailbreak Large Language Models
- Jailbroken: How Does LLM Safety Training Fail?
- Are aligned neural networks adversarially aligned?
- Latent Jailbreak: A Benchmark for Evaluating Text Safety and Output Robustness of Large Language Models
- (Ab)using Images and Sounds for Indirect Instruction Injection in Multi-Modal LLMs
- Prompts Should not be Seen as Secrets: Systematically Measuring Prompt Extraction Attack Success
- BITE: Textual Backdoor Attacks with Iterative Trigger Injection
- Multi-step Jailbreaking Privacy Attacks on ChatGPT
- Prompt as Triggers for Backdoor Attack: Examining the Vulnerability in Language Models
- LLM Censorship: A Machine Learning Challenge or a Computer Security Problem?
- Universal and Transferable Adversarial Attacks on Aligned Language Models
- Plug and Pray: Exploiting off-the-shelf components of Multi-Modal Models
- Virtual Prompt Injection for Instruction-Tuned Large Language Models
- Jailbreaking chatgpt via prompt engineering: An empirical study
- Prompt Injection attack against LLM-integrated Applications
- Jailbreaker: Automated Jailbreak Across Multiple Large Language Model Chatbots
- GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher
- LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked
- Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities
- Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs
- Detecting Language Model Attacks with Perplexity
- Baseline Defenses for Adversarial Attacks Against Aligned Language Models
- Image Hijacking: Adversarial Images can Control Generative Models at Runtime
- Open Sesame! Universal Black Box Jailbreaking of Large Language Models
- LLM Platform Security: Applying a Systematic Evaluation Framework to OpenAI’s ChatGPT Plugins
- Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM
- Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
- AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
- Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
- Multilingual Jailbreak Challenges in Large Language Models
- Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks
- Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation
- DeepInception: Hypnotize Large Language Model to Be Jailbreaker
- A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily
- AutoDAN: Automatic and Interpretable Adversarial Attacks on Large Language Models
- Language Model Inversion
Tools
- Rebuff: a self-hardening prompt injection detector
- Garak: a LLM vulnerability scanner
- LLMFuzzer: a fuzzing framework for LLMs
- LLM Guard: a security toolkit for LLM Interactions
- Vigil: a LLM prompt injection detection toolkit
Articles
- Hacking Auto-GPT and escaping its docker container
- Prompt Injection Cheat Sheet: How To Manipulate AI Language Models
- Indirect Prompt Injection Threats
- Prompt injection: What’s the worst that can happen?
- OWASP Top 10 for Large Language Model Applications
- PoisonGPT: How we hid a lobotomized LLM on Hugging Face to spread fake news
- ChatGPT Plugins: Data Exfiltration via Images & Cross Plugin Request Forgery
- Jailbreaking GPT-4's code interpreter
- Securing LLM Systems Against Prompt Injection
- The AI Attack Surface Map v1.0
- Adversarial Attacks on LLMs
Other Awesome Projects
- Gandalf: a prompt injection wargame
- LangChain vulnerable to code injection - CVE-2023-29374
- Jailbreak Chat
- Adversarial Prompting
- Epivolis: a prompt injection aware chatbot designed to mitigate adversarial efforts
- LLM Security Problems at DEFCON31 Quals: the world's top security competition
- PromptBounty.io
- PALLMs (Payloads for Attacking Large Language Models)
Other Useful Resources
- Twitter: @llm_sec
- Blog: LLM Security authored by @llm_sec
- Blog: Embrace The Red
- Blog: Kai's Blog
- Newsletter: AI safety takes
- Newsletter & Blog: Hackstery