• Stars
    star
    222
  • Rank 173,327 (Top 4 %)
  • Language
  • License
    MIT License
  • Created about 1 year ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A collection of phenomenons observed during the scaling of big foundation models, which may be developed into consensus, principles, or laws in the future

BM-Principles

🌟 The big models have proven their potential to lead to artificial general intelligence. However 😕, due to their rapid development, people have not fully grasped the principles of understanding and training big models. Therefore, in order to learn about big models together, we have decided to collect new phenomena observed on the big models and summarize them in this repository 📚 in the form of short entries. We hope this collection of phenomena observed during the scaling of big models may form future consensuses, principles, or patterns 📝.

The repository focuses on two aspects:

  • How: How to train powerful big models? 🚀
  • What: What properties are interesting for big models? 🤔

The repo is far from exclusive currently. Let's work together to improve it! 💪

How: how to train a powerful big model.

  1. Scaling of Computation

    1. Training loss decreases predictably.

      • Training loss can be written as a smooth function of model parameters and computation.

      Scaling Laws for Neural Language Models

      Scaling Laws for Autoregressive Generative Modeling

    2. Computational-optimal language model.

      • Given a fixed computational budget, if we train an excessively large model, we can only iterate for a very limited number of steps. On the other hand, if we train a model that is too small, the limit of the loss will not be as good as that of a larger model. Therefore, there exists an optimal model size, optimal training compute, and optimal tokens.
      • From previous experience, it's roughly $20 * N$, where $N$ is the number of model parameters.

      Training Compute-Optimal Large Language Models

    3. LLM doesn't converge at tokens of optimal computation.

      • LLM might continue to improve the loss after optimal tokens.
      • From Llama-7b and Llama-13b's training loss, we can see that continue to improve after 140 B and 260 B parameters.

      LLaMA: Open and Efficient Foundation Language Models `

  2. Optimal Hyperparameters.

    1. The best batch size is a function of loss.

      • To reach a certain loss, a large batch size requires more computation, a small batch size requires more training steps (i.e., times). The best batch size is a trade-off.
      • Each diagonal line formed by the points represents a training process. The horizontal axis represents the training steps, the vertical axis represents the number of processed tokens, and the color depth represents the loss. The optimal batch size can be considered as the inflection point of each contour line of loss.

      Scaling Laws for Neural Language Models

    2. Large batch size allows a large learning rate,

      1. Generally, a larger batch size allows a larger learning rate. And the larger learning rate has faster convergence.

      Don't decay the learning rate, increase the batch size

    3. Cosine scheduler is prevalent.

      • Cosine scheduler is the prevalent one, which is better than Noam with the same peak learning rate. Noam decreases more sharply.
      • Below is our experiment for CPM.
    4. Cosine learning rate's period should be set to the end step.

      • From 2.3, you might wonder if it is good to keep the learning rate high is good for training. But it's not.
      • When you want to train $N$ steps, it's best to set the period of the scheduler to $N$, not bigger or smaller.
  3. Predictable Scaling.

    1. Pass rate on human eval can be predicted with 1/10000 compute.
      • It's important to forecast the model's ability before it is trained. OpenAI GPT-4 proposed the first version of predictable scaling. It estimates the Human-eval's pass rate
      • Currently, there is no other public result for predicting the downstream metrics for large models.
  4. Model Architecture

    1. **Architectures in a diverse range have a similar pre-training loss.

      Scaling Laws for Neural Language Models

    2. For downstream metrics, we prefer deepnarrow architecture.

      Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

    3. Normalization has not reached a consensus, but pre-norm is more popular recently.

      • Here we list the normalization techniques of publicly known models.
      Model Normalization
      Llama Pre-norm
      GLM PostNorm + DeepNorm
      Pythia PostNorm
      BLOOM PreNorm
      StarCoder PreNorm

      DeepNet: Scaling Transformers to 1,000 Layers

  5. Data Mixture

    1. Diversity improves zero-shot generalization.

      • Diverse cross-domain pretraining data combining web crawls with curated high-quality sources significantly improves zero-shot generalization over pretraining datasets constructed from Common Crawl only.

      What Language Model to Train if You Have One Million GPU Hours?

    2. Data portion is important.

      1. Re-mix the dataset in Pile boosts the convergence speed and performance.

      DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

    3. Code might contribute to reasoning ability.

      • There is a wide belief that pre-training on code results in a strong capability of reasoning. But currently, there is no quantitative verification.

      How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources

What: what properties are interesting for large models?

  1. Emergent ability

    1. Emergent ability is observed with models ~ 50B or larger

      Emergent Abilities of Large Language Models

    2. Popular method only works on large models.

      • Prompt tuning, Delta tuning works well for models larger than 1B
      • In-context Learning, Chain-of-thought reasoning works for larger models.

      Emergent Abilities of Large Language Models

      Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Model

    3. Inverse (U-shape) scaling

      • Some task scaling curve exhibit U-shape.
      • Some reasons might contribute: distractor、memorization、misleading few-shot prompting.

      Inverse scaling can become U-shaped

      Inverse Scaling: When Bigger Isn't Better

  2. Training Dynamics.

    1. Double Descent phenomenon is observed.

      • There is a regime that improving model size harms performance.
      • Closely resembles the inverse scaling phenomenon.

      Deep double descent

    2. Grokking phenomenon might contribute to generalization.

      • overparameterized neural networks show a sign of a sudden improvement in generalization.

      Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    3. Modularity emerges in LLM.

      • Sparse Activation has been observed in big models.
      • The sparsity of modules tends to form in an early stage.
      • The sparsity of neurons tends to form later.

      Emergent Modularity in Pre-trained Transformer

More Repositories

1

ChatDev

Create Customized Software using Natural Language Idea (through LLM-powered Multi-Agent Collaboration)
Shell
20,167
star
2

XAgent

An Autonomous LLM Agent for Complex Task Solving
Python
7,583
star
3

ToolBench

[ICLR'24 spotlight] An open platform for training, serving, and evaluating large language model for tool learning.
Python
4,457
star
4

MiniCPM

MiniCPM-2B: An end-side LLM outperforms Llama2-13B.
Jupyter Notebook
3,871
star
5

AgentVerse

🤖 AgentVerse 🪐 is designed to facilitate the deployment of multiple LLM-based agents in various applications, which primarily provides two frameworks: task-solving and simulation
JavaScript
3,695
star
6

BMTools

Tool Learning for Big Models, Open-Source Solutions of ChatGPT-Plugins
Python
2,848
star
7

CPM-Bee

百亿参数的中英文双语基座大模型
Python
2,658
star
8

MiniCPM-V

MiniCPM-V 2.0: An Efficient End-side MLLM with Strong OCR and Understanding Capabilities
Python
1,406
star
9

VisCPM

[ICLR'24 spotlight] Chinese and English Multimodal Large Model Series (Chat and Paint) | 基于CPM基础模型的中英双语多模态大模型系列
Python
984
star
10

ProAgent

An LLM-based Agent for the New Automation Paradigm - Agentic Process Automation
Python
674
star
11

BMInf

Efficient Inference for Big Models
Python
565
star
12

CPM-Live

Live Training for Open-source Big Models
Python
510
star
13

BMTrain

Efficient Training (including pre-training and fine-tuning) for Big Models
Python
507
star
14

BMList

A List of Big Models
Python
339
star
15

UltraFeedback

A large-scale, fine-grained, diverse preference dataset (and models).
Python
247
star
16

ModelCenter

Efficient, Low-Resource, Distributed transformer implementation based on BMTrain
Python
215
star
17

UltraEval

An open source framework for evaluating foundation models.
Python
155
star
18

RepoAgent

An LLM-powered repository agent designed to assist developers and teams in generating documentation and understanding repositories quickly.
Python
148
star
19

InfiniteBench

100k+ Long-Context Benchmark for Large Language Models (paper upcoming)
Python
105
star
20

DecT

Source code for ACL 2023 paper Decoder Tuning: Efficient Language Understanding as Decoding
Python
42
star
21

OlympiadBench

An Olympiad-level bilingual multimodal scientific benchmark, featuring 8,952 questions from Olympiad-level mathematics and physics competitions, including the Chinese college entrance exam.
Python
41
star
22

XAgent-doc

Document for XAgent.
19
star
23

BMInf-demos

BMInf demos.
JavaScript
14
star
24

UltraLink

An Open-Source Knowledge-Enhanced Multilingual Supervised Fine-tuning Dataset
Python
11
star
25

General-Model-License

6
star