BM-Principles

🌟 The big models have proven their potential to lead to artificial general intelligence. However 😕, due to their rapid development, people have not fully grasped the principles of understanding and training big models. Therefore, in order to learn about big models together, we have decided to collect new phenomena observed on the big models and summarize them in this repository 📚 in the form of short entries. We hope this collection of phenomena observed during the scaling of big models may form future consensuses, principles, or patterns 📝.

The repository focuses on two aspects:

How: How to train powerful big models? 🚀
What: What properties are interesting for big models? 🤔

The repo is far from exclusive currently. Let's work together to improve it! 💪

How: how to train a powerful big model.

Scaling of Computation
1. Training loss decreases predictably.
  - Training loss can be written as a smooth function of model parameters and computation.
  Scaling Laws for Neural Language Models
  
  Scaling Laws for Autoregressive Generative Modeling
2. Computational-optimal language model.
  - Given a fixed computational budget, if we train an excessively large model, we can only iterate for a very limited number of steps. On the other hand, if we train a model that is too small, the limit of the loss will not be as good as that of a larger model. Therefore, there exists an optimal model size, optimal training compute, and optimal tokens.
  - From previous experience, it's roughly $20 * N$, where $N$ is the number of model parameters.
  Training Compute-Optimal Large Language Models
3. LLM doesn't converge at tokens of optimal computation.
  - LLM might continue to improve the loss after optimal tokens.
  - From Llama-7b and Llama-13b's training loss, we can see that continue to improve after 140 B and 260 B parameters.
  LLaMA: Open and Efficient Foundation Language Models `
Optimal Hyperparameters.
1. The best batch size is a function of loss.
  - To reach a certain loss, a large batch size requires more computation, a small batch size requires more training steps (i.e., times). The best batch size is a trade-off.
  - Each diagonal line formed by the points represents a training process. The horizontal axis represents the training steps, the vertical axis represents the number of processed tokens, and the color depth represents the loss. The optimal batch size can be considered as the inflection point of each contour line of loss.
  Scaling Laws for Neural Language Models
2. Large batch size allows a large learning rate,
  1. Generally, a larger batch size allows a larger learning rate. And the larger learning rate has faster convergence.
  Don't decay the learning rate, increase the batch size
3. Cosine scheduler is prevalent.
  - Cosine scheduler is the prevalent one, which is better than Noam with the same peak learning rate. Noam decreases more sharply.
  - Below is our experiment for CPM.
4. Cosine learning rate's period should be set to the end step.
  - From 2.3, you might wonder if it is good to keep the learning rate high is good for training. But it's not.
  - When you want to train $N$ steps, it's best to set the period of the scheduler to $N$, not bigger or smaller.
Predictable Scaling.
1. Pass rate on human eval can be predicted with 1/10000 compute.
  - It's important to forecast the model's ability before it is trained. OpenAI GPT-4 proposed the first version of predictable scaling. It estimates the Human-eval's pass rate
  - Currently, there is no other public result for predicting the downstream metrics for large models.
Model Architecture
1. **Architectures in a diverse range have a similar pre-training loss.
  
  Scaling Laws for Neural Language Models
2. For downstream metrics, we prefer deepnarrow architecture.
  
  Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers
3. Normalization has not reached a consensus, but pre-norm is more popular recently.
  - Here we list the normalization techniques of publicly known models.
  Model Normalization
  
  Llama Pre-norm
  
  GLM PostNorm + DeepNorm
  
  Pythia PostNorm
  
  BLOOM PreNorm
  
  StarCoder PreNorm
  
  DeepNet: Scaling Transformers to 1,000 Layers
Data Mixture
1. Diversity improves zero-shot generalization.
  - Diverse cross-domain pretraining data combining web crawls with curated high-quality sources significantly improves zero-shot generalization over pretraining datasets constructed from Common Crawl only.
  What Language Model to Train if You Have One Million GPU Hours?
2. Data portion is important.
  1. Re-mix the dataset in Pile boosts the convergence speed and performance.
  DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
3. Code might contribute to reasoning ability.
  - There is a wide belief that pre-training on code results in a strong capability of reasoning. But currently, there is no quantitative verification.
  How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources

Model	Normalization
Llama	Pre-norm
GLM	PostNorm + DeepNorm
Pythia	PostNorm
BLOOM	PreNorm
StarCoder	PreNorm

What: what properties are interesting for large models?

Emergent ability
1. Emergent ability is observed with models ~ 50B or larger
  
  Emergent Abilities of Large Language Models
2. Popular method only works on large models.
  - Prompt tuning, Delta tuning works well for models larger than 1B
  - In-context Learning, Chain-of-thought reasoning works for larger models.
  Emergent Abilities of Large Language Models
  
  Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for Pre-trained Language Model
3. Inverse (U-shape) scaling
  - Some task scaling curve exhibit U-shape.
  - Some reasons might contribute: distractor、memorization、misleading few-shot prompting.
  Inverse scaling can become U-shaped
  
  Inverse Scaling: When Bigger Isn't Better
Training Dynamics.
1. Double Descent phenomenon is observed.
  - There is a regime that improving model size harms performance.
  - Closely resembles the inverse scaling phenomenon.
  Deep double descent
2. Grokking phenomenon might contribute to generalization.
  - overparameterized neural networks show a sign of a sudden improvement in generalization.
  Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
3. Modularity emerges in LLM.
  - Sparse Activation has been observed in big models.
  - The sparsity of modules tends to form in an early stage.
  - The sparsity of neurons tends to form later.
  Emergent Modularity in Pre-trained Transformer

OpenBMB/BMPrinciples

OpenBMB

Reviews

Repository Details

BM-Principles

How: how to train a powerful big model.

What: what properties are interesting for large models?

More Repositories