• Stars
    star
    1,116
  • Rank 40,174 (Top 0.9 %)
  • Language
    Python
  • Created 9 months ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A family of open-sourced Mixture-of-Experts (MoE) Large Language Models

OpenMoE

| Blog | Twitter | Discord |

OpenMoE is a project aimed at igniting the open-source MoE community! We are releasing a family of open-sourced Mixture-of-Experts (MoE) Large Language Models.

Since we are a small team working on a huge project, we cannot handle everything. Instead, we release some intermediate checkpoints in this repo to invite more contributors to work on open-sourced MoE project together!

News

[2023/08] πŸ”₯ We released an intermediate OpenMoE-8B checkpoint (OpenMoE-v0.2) along with two other models. Check out the blog post.

TODO List

  • PyTorch Implementation with Colossal AI
  • More Evaluation
  • Continue Training to 1T tokens
  • Paper

Contents

Model Weights

Currently, three models are released in total.

Model Name Description #Param GCS Huggingface Gin File
OpenMoE-base/16E A small MoE model for debugging 637M gs://openmoe/openmoe-base/checkpoint_500000 Link Link
OpenLLaMA-base A dense counter-part of OpenMoE-base 310M gs://openmoe/openllama-base/checkpoint_500000 Link Link
OpenMoE-8B/32E 8B MoE with comparable FLOPs of a 2B LLaMA 8B gs://openmoe/openmoe-8b/checkpoint_100000 Link Link

We release all these checkpoints on Huggingface and Google Cloud Storage. For instance, you can download openmoe-8B with

gsutil cp -r gs://openmoe/openmoe-8b/checkpoint_100000 $YOUR_DIR

The base models are trained with 128B tokens. The openmoe-8B checkpoint with 4 MoE layers and 32 experts has been trained by 200B tokens. We are still training OpenMoE-8B. So if you are interested in the latest checkpoint, please feel free to drop Fuzhao an email ([email protected]). In addition, we are highly interested in training this model until saturate by performing multi-epoch training, which means we may train our model for over 2T and even more tokens (this depends on the resource we can get in the coming months)

Note: downloading data from Google Cloud Storage is not free, but you can sign in to Google Cloud and get some credits.

Get Started

Training

Get a TPU-vm and run the following code on all TPUs. Researcher can apply TPU Research Cloud to get the TPU resource.

We are working on the PyTorch + GPU implementation with Colossal AI.

git clone https://github.com/XueFuzhao/OpenMoE.git
bash OpenMoE/script/run_pretrain.sh

Eval

Get a TPU-vm and run the following code on all TPUs.

git clone https://github.com/XueFuzhao/OpenMoE.git
bash OpenMoE/script/run_eval.sh

Approach

Data

50% The RedPajama + 50% The Stack Dedup. We use a high ratio of coding data to improve reasoning ability.

Tokenizer

We use the umt5 Tokenizer to support multi-lingual continue learning in the future, which can be downloaded on Huggingface or Google Cloud.

Model Architecture

OpenMoE is based on ST-MoE but uses Decoder-only architecture. The detailed implementation can be found in Fuzhao's T5x and Flaxformer repo.

Training Objective

We use a modified UL2 training objective but Casual Attention Mask (We use more prefix LM and high mask ratio because it saves computation.):

  • 50% prefix LM
  • 10% span len=3 mask ratio=0.15
  • 10% span len=8 mask ratio=0.15
  • 10% span len=3 mask ratio=0.5
  • 10% span len=8 mask ratio=0.5
  • 10% span len=64 mask ratio=0.5

Other Designs

RoPE, SwiGLU activation, 2K context length. We will release a more detailed report soon.

Evaluation

We evaluate our model on BigBench-Lite as our first step. We plot the cost-effectiveness curve in the figure below.

Relative Cost is approximated by multiplying activated parameters and training tokens. The size of dots denotes the number of activated parameters for each token. The lightgray dot denotes the total parameters of MoE models. Plot

For more detailed results, please see our Blog

License

Our code is under Apache 2.0 License.

Since the models are trained on The Redpajama and The Stack dataset, please check the license of these two datasets for your model usage.

Authors

This project is currently contributed by the following authors:

Citation

Please cite the repo if you use the model and code in this repo.

@misc{openmoe2023,
  author = {Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou and Yang You},
  title = {OpenMoE: Open Mixture-of-Experts Language Models},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/XueFuzhao/OpenMoE}},
}