Awesome-Foundation-Models

A foundation model is a large-scale pretrained model (e.g., BERT, DALL-E, GPT-3) that can be adapted to a wide range of downstream applications. This term was first popularized by the Stanford Institute for Human-Centered Artificial Intelligence. This repository maintains a curated list of foundation models for vision and language tasks. Research papers without code are not included.

Survey

[2023.08] Towards Generalist Foundation Model for Radiology (from SJTU)
[2023.07] Foundational Models Defining a New Era in Vision: A Survey and Outlook (from MBZ University of AI)
[2023.07] Towards Generalist Biomedical AI (from Google)
[2023.07] Foundational Models Defining a New Era in Vision: A Survey and Outlook
[2023.07] A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models (from University of Oxford)
[2023.06] Large Multimodal Models: Notes on CVPR 2023 Tutorial (from Chunyuan Li, Microsoft.)
[2023.06] A Survey on Multimodal Large Language Models
[2023.04] Vision-Language Models for Vision Tasks: A Survey
[2023.04] Foundation Models for Generalist Medical Artificial Intelligence
[2023.03] A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT
[2023.03] A Comprehensive Survey of AI-Generated Content (AIGC): A History of Generative AI from GAN to ChatGPT
[2022.12] Vision-language pre-training: Basics, recent advances, and future trends
[2022.07] On the Opportunities and Risks of Foundation Models (This survey first popularizes the concept of foundation model; from Standford)

Papers by Date

Papers by Topic

Vision-Language Pretraining

FLIP: Scaling Language-Image Pre-training via Masking (from Meta)
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (proposes a generic and efficient VLP strategy based on off-the-shelf frozen vision and language models. from Saleforce Research)
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation (from Salesforce Research)
SLIP: Self-supervision meets Language-Image Pre-training (ECCV, from UC Berkeley and Meta)
GLIP: Grounded Language-Image Pre-training (CVPR, from UCLA and Microsoft)
ALIGN: Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision (PMLR, from Google)
RegionCLIP: Region-Based Language-Image Pretraining
CLIP: Learning Transferable Visual Models From Natural Language Supervision (from OpenAI)

Perception Tasks: Segmentation and Detection

SEEM: Segment Everything Everywhere All at Once (from University of Wisconsin-Madison, HKUST, and Microsoft)
SAM: Segment Anything (the first foundation model for image segmentation; from Meta)
SegGPT: Segmenting Everything In Context (from BAAI, ZJU, and PKU)

Large Language Models

GPT-4 Technical Report (from OpenAI)
GPT-3: Language Models are Few-Shot Learners (175B parameters; permits in-context learning compared with GPT-2; from OpenAI)
GPT-2: Language Models are Unsupervised Multitask Learners (1.5B parameters; from OpenAI)
GPT: Improving Language Understanding by Generative Pre-Training (from OpenAI)

Training Efficiency

Green AI (introduces the concept of Red AI vs Green AI)
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks (the lottery ticket hypothesis, from MIT)
Other Challenges and Opportunities: Trust, reliability, safe use, interpretability, self-improvement, adaptation, augmentation, specilization, understanding/predicting capability.

Towards AGI

Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models (from Huawei)

Links to Similar Awesome Repositories

Awesome-CV-Foundational-Models (maintained by Muhammad Awais)

uncbiag/Awesome-Foundation-Models

uncbiag

Reviews

Repository Details

Awesome-Foundation-Models

Survey

Papers by Date

2023

2022

2021

Before 2021