• Stars
    star
    394
  • Rank 105,454 (Top 3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 12 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

NeMo Megatron launcher and tools

NeMo Framework

Open Beta

Scripts and code to provide end-to-end data preparation and training for NeMo Framework.

The most recent version of the README can be found at https://ngc.nvidia.com/containers/ea-bignlp:nemofw-training.

Table of contents

1. Model Overview

NeMo Framework allows developers to effectively train and scale language models to billions of parameters. With NeMo Framework, you can train different variants of GPT, Bert and T5 style models, and scale them to multiple nodes on NVIDIA DGX SuperPOD deployments. This deep learning (DL) software stack is optimized for DGX SuperPOD configurations using NVIDIA InfiniBand technology to provide efficient on-premises compute for training and inferring complex workloads.

The model parallelism techniques of NeMo Framework enable the efficient training of large models that do not fit in the memory of a single GPU. In the training tasks, tensor (intra-layer) and pipeline (inter-layer) model parallelism are adopted. Tensor model parallelism partitions individual transformer layers over multiple devices. Pipeline model parallelism stripes layers of a model over multiple devices. For more details, refer to this paper.

Our latest techniques, sequence parallelism and selective activation recomputation, bring up to ~30% faster training time for GPT models ranging from 20B to 1T parameters. Sequence parallelism expands tensor-level model parallelism, by noticing that the regions of a transformer layer that have not previously been parallelized are independent along the sequence dimension. By splitting these layers along the sequence dimension we can distribute the compute and, most importantly, the activation memory for these regions across the tensor parallel devices. Selective activation recomputation improves cases where memory constraints force us to recompute some, but not all, of the activations. For more details, refer to this paper.

GPT architecture

Figure 1: The GPT family architecture. The 5B variant includes 24 transformer layers, a hidden size of 4096, and 32 attention heads. The sequence length is 2048, and the optimizer is Adam. This variant uses tensor parallelism of 2.

2. Feature Matrix

2.1. GPT Models

Feature Training Inference
Data parallelism Yes N/A
Tensor parallelism Yes Yes
Pipeline parallelism Yes Yes (Megatron-LM checkpoints)
Interleaved Pipeline Parallelism Schedule Yes N/A
Sequence parallelism Yes No
Selective activation checkpointing Yes No
Gradient checkpointing Yes N/A
Partial gradient checkpointing Yes N/A
FP32/TF32 Yes Yes (FP16 enabled by default)
AMP/FP16 No Yes
BF16 Yes Yes
TransformerEngine/FP8 Yes No
Multi-GPU Yes Yes
Multi-Node Yes Yes
Inference deployment N/A NVIDIA Triton supported, Faster Transformer
SW stack support Slurm DeepOps/Base Command Manager/Base Command Platform Slurm DeepOps/Base Command Manager/Base Command Platform
Distributed data preprocessing Yes (the Pile only) N/A
NVfuser No N/A
P-Tuning and Prompt Tuning Yes N/A
IA3 and Adapter learning Yes N/A
Distributed Optimizer Yes N/A

2.2. T5 and mT5 Models

Feature Training Inference
Data parallelism Yes N/A
Tensor parallelism Yes No
Pipeline parallelism Yes No
Sequence parallelism No No
Selective activation checkpointing No No
Gradient checkpointing Yes N/A
Partial gradient checkpointing Yes N/A
FP32/TF32 Yes No
AMP/FP16 No No
BF16 Yes No
Multi-GPU Yes No
Multi-Node Yes No
Inference deployment N/A No
SW stack support Slurm DeepOps/Base Command Manager/Base Command Platform No
Distributed data preprocessing Yes (the Pile dataset for T5, mC4 dataset for mT5) N/A
NVfuser No N/A
P-Tuning and Prompt Tuning Yes N/A
IA3 and Adapter learning Yes N/A
AutoConfigurator Yes N/A
Distributed Optimizer Yes N/A
Mixture of Experts Yes (no expert parallelism) N/A

2.3. BERT Models

Feature Training Inference
Data parallelism Yes N/A
Tensor parallelism Yes N/A
Pipeline parallelism Yes N/A
Sequence parallelism Yes N/A
Selective activation checkpointing Yes N/A
Gradient checkpointing Yes N/A
Partial gradient checkpointing Yes N/A
FP32/TF32 Yes N/A
AMP/FP16 No N/A
BF16 Yes N/A
Multi-GPU Yes N/A
Multi-Node Yes N/A
Inference deployment N/A N/A
SW stack support Slurm DeepOps/Base Command Manager/Base Command Platform N/A
Distributed data preprocessing Yes (the Pile only) N/A
NVfuser Yes N/A
P-Tuning and Prompt Tuning N/A N/A
IA3 and Adapter learning N/A N/A
Distributed Optimizer Yes N/A

3. Setup

3.1. Support Matrix

Software Version
NVIDIA Triton 2.24.0
FasterTransformer v5.3+c6e8f60
TransformerEngine v0.8+8e5f00f
PyTorch 2.1.0a0+fe05266
NeMo 1.19.0+913e5e5
PyTorch Lightning 1.9.4
Hydra 1.2.0
CUDA NVIDIA CUDA 12.1
cuBLAS 12.1.3.1
cuDNN 8.9.0.131
NCCL 2.17.1
Container OS Ubuntu 20.04
rdma-core 36.0
GDRcopy 2.3
HPC-X 2.13
Base Command Manager 1.0.0
DeepOps 21.06

4. Cloud Service Providers

4.1. Cluster Bring-Up

4.1.1. Common

To set up a Slurm cluster for NeMo Framework, we recommend using Nephele. This cluster deployment tool has been tested on Azure, AWS, and Oracle Cloud. We recommend hosting Nephele on a new VM instance in the CSP of your choice. To get started:

  • Clone the Nephele repo
  • Install the dependencies
  • Modify nephele.conf
    • Add your CSP credentials
    • Change REPLICAS_x8a100 to the number of nodes in your desired cluster

You can then run ./nephele init and ./nephele create.

We also recommend mounting an external persistent NFS once the cluster is up and running (ensure it is mounted on all nodes) and using this to configure and run NeMo Framework.

The above steps apply to all CSPs, including Azure, AWS, and OCI. Some modifications are necessary for OCI and AWS and are detailed below. Note that for OCI, a custom image must be imported, which should be done before running ./nephele create.

4.1.2. OCI

NeMo Framework supports running training and inference containers on OCI. For more details about orchestration scripts, reach out to [email protected]

4.1.3. AWS

To launch jobs on AWS, the EFA driver and NCCL plugin first need to be installed on top of the training container. We recommend building a new container image with Docker, then creating an Enroot image.

On the scheduler node:

  • Install Docker
  • Build the image with EFA drivers and NCCL plugin from csp_tools/aws/Dockerfile
  • Run this command on the Docker image to create an Enroot image:
    enroot import --output nemo_megatron_training.sqsh dockerd://<image name>:<tag>
  • Move the .sqsh file to the root of NeMo-Megatron-Launcher
  • Set the container path in launcher_scripts/conf/config.yaml to the new Enroot image:
container: /path/to/nemo_megatron_launcher/nemo_megatron_training.sqsh

4.2. Cluster Validation

Before running the cluster validation script, ensure your NGC credentials have been added to ~/.config/enroot/.credentials on all nodes.

The cluster validation script at csp_tools/<csp>/cluster_validation.sh will run GPU diagnostics and test NCCL node-to-node bus bandwidth. The logs from these tests will be stored at results/cluster_validation. The script will list any nodes that fail these tests. These nodes should be replaced or restarted through the CSP UI.

4.2.1. Validation Script Usage

The script has 3 required parameters:

  • --nodes: the number of nodes
  • --nodelist: the list of node names
  • --partition: the Slurm partition the nodes are assigned to

The values for these parameters should be in the same format that is found in sinfo. With the following example:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
x8a100       up   infinite      8   idle x8a100-[0000-0007]

To test all 8 idle nodes, the script would be run as:

bash cluster_validation.sh --nodes=8 --nodelist=x8a100-[0000-0007] --partition=x8a100

By default, the script will run both the GPU diagnostics and the NCCL test. You can choose to run only one or the other by specifying:

  • --dcgm: run GPU diagnostics only
  • --nccl: run NCCL test only

See bash cluster_validation.sh -h for more information.

4.2.2 Running tests manually

The cluster_validation.sh script is essentially a wrapper of the 2 Slurm job scripts in the CSP directories. If you prefer, you can run these jobs manually. Make sure to use the Slurm job script in your corresponding CSP's path (csp_tools/<csp>/dcgmi_diag.sh and csp_tools/<csp>/nccl.sh)

For the GPU diagnostics job, provide these arguments when submitting the job to Slurm:

sbatch -p <partition> -w <node list> -o <job log file> dcgmi_diag.sh

For the NCCL test job, cluster_validation.sh performs a pair-wise sweep of the nodes, as this is a sufficient test, but you can test with a different number of nodes if desired.

First build the test binaries:

sbatch -N 1 build-nccl-tests.sh

Then, to run a 2-node all_reduce_perf job:

sbatch -w <node 1>,<node 2> -o <job log file> nccl.sh

To run the job with more nodes, simply add the node names to the -w flag in the same comma-separated list format.

4.3. Config Modifications

Before launching jobs some changes to the config must be made.

4.3.1 Set NCCL Topology

The NCCL topology file is unique for each CSP, and can be found in their corresponding folders (csp_tools/<csp>/topo.xml)

In launcher_scripts/conf/config.yaml, mount the directory containing the topology file:

container_mounts:
  - /path/to/nemo_megatron_laujncher/csp_tools/<csp>/:/nccl

Then set the path of the file in the container:

env_vars:
    NCCL_TOPO_FILE: /nccl/topo.xml

4.3.2 Environment Variables

4.3.2.1 Azure Variables

Set these environment variables in config.yaml (these are only needed for Azure):

env_vars:
  UCX_IB_PCI_RELAXED_ORDERING: auto
  NCCL_IB_PCI_RELAXED_ORDERING: 2
  NCCL_IB_TIMEOUT: 22
  NCCL_DEBUG: INFO
4.3.2.2 AWS Variables

AWS recommends setting the following flag to avoid data corruption:

env_vars:
  NCCL_PROTO: simple

Setting this flag reduces training throughput by roughly 2%.

5. Quick Start Guide

5.1. Training NeMo Framework Models

5.1.1. Prepare Environment

NOTE: Ensure the high-speed filesystem is mounted on the job submission node(s) at the same path as on the compute nodes.

The whole solution uses a set of Docker containers executed on a Slurm cluster (using the pyxis plug-in) or a Base Command Platform cluster. The training container also includes conversion scripts. The inference container comprises the NVIDIA Triton Inference Server with the FasterTransformer backend installed.

5.1.1.1. Slurm

The NeMo Framework codebase is included as part of the training container. To copy it to a local directory in the cluster, it needs to be extracted from the container. To copy the code to a directory named /path/to/local/dir the following command can be executed. The NeMo Framework repository for Slurm has been verified on both Slurm-based DeepOps clusters as well as Base Command Manager.

srun -p [partition] -N 1 --container-mounts=/path/to/local/dir:/workspace/mount_dir --container-image=[container_tag] bash -c "cp -r /opt/NeMo-Megatron-Launcher/launcher_scripts /opt/NeMo-Megatron-Launcher/auto_configurator /opt/FasterTransformer /opt/nemo-data-curator /opt/nemo-rlhf /workspace/mount_dir/"

Install the NeMo Framework scripts dependencies on the head node of the cluster:

pip install -r requirements.txt

You can use virtualenv to prevent polluting your head node environment for other Python projects. If your configuration lacks pip, then you can install pip using use get_pip.py with just python3.

5.1.1.2. Base Command Platform

The nemo_megatron_launcher codebase is included as part of the training container. Before starting, set up the ngc cli and configuration as described in the Base Command Platform User Guide. In this guide, we will mainly use two Base Command Platform workspaces, one for storing the training dataset, and another for storing the results, checkpoints and logs. Therefore, start by creating these workspaces (e.g. nemo_megatron_data_ws and nemo_megatron_results_ws). See the Base Command Platform User Guide for how to create and work with Base Command Platform workspaces.

5.1.1.3. General Configuration

The first parameter that must be set is the launcher_scripts_path parameter inside the conf/config.yaml file. This parameter must point to the absolute path where the nemo_megatron_launcher repository is stored in the file system.
Additionally, if using a Slurm based cluster, the config file in the subfolder of conf/cluster/bcm.yaml has the parameters to set the generic cluster related information, such as the partition or account parameters.

The NUMA mapping can also be configured from the conf/config.yaml file. The mapping should be automatic; the code will read the number of CPU cores available in your cluster, and provide the best possible mapping, to maximize performance. The mapping is enabled by default, but it can be disabled by setting enable: False in the numa_mapping section of the conf/config.yaml file. The type of mapping can also be configured using the same file. See the full config parameters below:

numa_mapping:
  enable: True  # Set to False to disable all mapping (performance will suffer).
  mode: unique_contiguous  # One of: all, single, single_unique, unique_interleaved or unique_contiguous.
  scope: node  # Either node or socket.
  cores: all_logical  # Either all_logical or single_logical.
  balanced: True  # Whether to assing an equal number of physical cores to each process.
  min_cores: 1  # Minimum number of physical cores per process.
  max_cores: 8  # Maximum number of physical cores per process. Can be null to use all available cores.

Interactive: In order to run the launcher in an interactive job or locally on a workstation, set cluster_type=interactive in conf/config.yaml.

Slurm: The launcher_scripts_path parameter will automatically be mounted to the container at the same path as in the local file system. Any additional directories that should be mounted must be specified using the container_mounts parameter. If the paths contain the colon character (:), the code will assume both the source and destination paths are provided. Otherwise, the given paths will be mounted to the same path inside the container. The data_dir parameter can also be modified to point to where the dataset will be loaded from or saved. The base_results_dir can also be modified to point to where the results, checkpoints and logs will be stored. These last two parameters will be automatically mounted into the container. The parameters cluster and cluster_type must be set to bcm for all the tasks.

Base Command Platform: The launcher_scripts_path should be set to /opt/NeMo-Megatron-Launcher/launcher_scripts , which is the default location where the scripts are located inside the container. The data_dir parameter can also be modified to point to where the dataset will be loaded from or saved. The base_results_dir can also be modified to point to where the results, checkpoints and logs will be stored. In the case of Base Command Platform, we recommend that data_dir points to one of the workspaces, and base_results_dir points to the other. They should both be mounted in read and write (RW) mode. The parameter cluster_type must be set to bcp for all the tasks.

main.py is the main file that needs to be executed to run the data preparation, training, conversion, fine-tuning, and evaluation pipelines. Each of these pipelines has a parameter in the conf/config.yaml file that decides whether to run that pipeline or not. In slurm based clusters, all of them can be set to True at the same time, and they will be executed in order. However, in Base Command Platform, only one of them should be set to True at a time.

Settings for GPT Models: Default settings for GPT models are in the config/config.yaml file:

stages:
  - data_preparation
  - training
  - conversion
  - evaluation
  - export

Settings for T5 Models: Default settings for T5 models are in the config/config.yaml file:

# default values:
cluster: bcm  # Leave it as bcm even if using bcp. It will be ignored for bcp.
data_preparation: t5/download_t5_pile
training: t5/220m
conversion: t5/convert_t5
fine_tuning: t5/squad
evaluation: t5/squad
export: t5/export_t5

stages:
  - data_preparation
  - training
  - conversion
  - fine_tuning
  - prompt_learning
  - evaluation
  - export

Settings for mT5 Models: Default settings for T5 models are in the config/config.yaml file:

# default values:
cluster: bcm  # Leave it as bcm even if using bcp. It will be ignored for bcp.
data_preparation: mt5/download_mc4
training: mt5/390m
conversion: mt5/convert_mt5
fine_tuning: mt5/xquad
evaluation: mt5/xquad
export: mt5/export_mt5

stages:
  - data_preparation
  - training
  - conversion
  - fine_tuning
  - prompt_learning
  - evaluation
  - export

Settings for Bert Models: Default settings for T5 models are in the config/config.yaml file:

# default values:
cluster: bcm  # Leave it as bcm even if using bcp. It will be ignored for bcp.
data_preparation: bert/download_bert_pile
training: bert/4b

stages:
  - data_preparation
  - training

To run these pipelines execute:

python3 main.py

The entire repository uses hydra/omegaconf to handle job configuration using YAML files, so look at the documentation for those projects to learn more.

5.1.2. Data Preparation

The Pile: We provide utilities to download and prepare the Pile dataset (mirror), which is formed by 22 smaller datasets. The dataset is already blended by using the mix described in their paper. It is recommended to store this repository and the datasets in a file system shared by all the nodes (gpfs) in the case of Slurm based clusters, and in a shared workspace with RW permissions in the case of Base Command Platform based clusters.

The Pile dataset consists of 30 shards. Downloading, extracting, and preprocessing each file takes approximately 1 hour assuming a 30 MB/s download speed. The data preparation can be parallelized by using up to 30 nodes.

mC4: We provide utilities to download and prepare mC4 dataset (allen-ai version). Multilingual C4 (mC4) has 101 languages and is generated from 71 Common Crawl dumps. It is recommended to store this repository and the datasets in a file system shared by all the nodes (gpfs) in the case of Slurm based clusters, and in a shared workspace with RW permissions in the case of Base Command Platform based clusters.

Our scripts give user options to choose any subset of 101 languages to download and preprocess. We curated 24 languages as our default language list. The raw size of default languages is around 5 TB. Parallelization is enabled in downloading and preprocessing scripts. It will help to automatically distribute and balance the work on multi-node systems and provide significant speed up. Downloading and preprocessing the default language list takes approximately 7 hours assuming a 30 MB/s download speed and parallelization by using 20 nodes. The preprocessed dataset has a size of around 12 TB. It's recommended to use a file system with larger than 20 TB storage to prepare the data.

Currently, we don't support training with more than 25 languages, see [Known Issues].

The configuration used for data preparation for the Pile dataset or mC4 dataset must be specified in the conf/config.yaml file and data_preparation must be included in stages to run it.

5.1.2.1. Data Preparation for GPT Models

The data_preparation parameter in conf/config.yaml specifies which file to use for data preparation configuration purposes. The default value is set to download_gpt3_pile, which can be found in conf/data_preparation/download_gpt3_pile.yaml. It is used to download, extract, and preprocess the Pile dataset for GPT model. The parameters can be modified to perform the different tasks and to decide where to store the datasets, vocab, and merge files.

To download a reduced portion of the dataset to run tests, the file_numbers parameter can be updated to download only one of the shards by changing “0-29” to “0” (the syntax must be a combination of numbers separated by dashes "-" or commas ",") For example, file_numbers="0,3,5-7" will download and prepare files 0, 3, 5, 6, and 7.

5.1.2.1.1. Slurm

First, ensure the cluster related configuration in the conf/cluster/bcm.yaml file is correct. The cluster and cluster_type parameters in conf/config.yaml must be set to bcm. Then, modify the time_limit or any other parameter related to the job in the download_gpt3_pile.yaml file for GPT models. The data preparation can be parallelized by using up to 30 nodes to download all 30 files in parallel.

Example:

To run only the data preparation pipeline and not the training, evaluation or inference pipelines, set the conf/config.yaml file to:

stages:
  - data_preparation

And then run:

python3 main.py
5.1.2.1.2. Base Command Platform

In order to run the data preparation script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. By default, the data preparation script will download the data into the data/ directory. We recommend that the data_dir parameter is set to a workspace, so that the data is visible across multiple jobs later on. The vocab and merge files should also be stored to the same workspace as the dataset, for later usage. The data preparation code must be launched in a multi-node job. It can be parallelized to use between 2 and 30 nodes for faster preparation of the dataset.

With Base Command Platform, the 700+ GB dataset can be downloaded once and then shared by multiple users in the same ACE by setting the permissions of the nemo_megatron_data_ws workspace.

To run the data preparation pipeline for GPT models, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=[data_preparation] \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results data_preparation.file_numbers='0-29' \
data_preparation.vocab_save_dir=/mount/data/bpe data_preparation.merges_save_dir=/mount/data/bpe >> /results/data_gpt3_log.txt 2>&1

The command above assumes you want to prepare the entire dataset (files 0-29), and you mounted the data workspace in /mount/data, and the results workspace in /mount/results. Stdout and stderr are redirected to the /results/data_gpt3_log.txt file, so it can be downloaded from NGC. Any other parameter can also be added to the command to modify its behavior.

5.1.2.1.3. Common

Set the configuration for the data preparation job for GPT models in the YAML file:

run:
  name: download_gpt3_pile
  results_dir: ${base_results_dir}/${.name}
  time_limit: "4:00:00"
  dependency: "singleton"
  node_array_size: 30
  array: ${..file_numbers}
  bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.

dataset: pile
download_the_pile: True  # Whether to download the pile dataset from the internet.
the_pile_url: "https://mystic.the-eye.eu/public/AI/pile/train/"  # Source URL to download The Pile dataset from.
file_numbers: "0-29"  # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True  # True to preprocess the data from a jsonl file, False otherwise.
download_vocab_url: "https://huggingface.co/gpt2/resolve/main/vocab.json"  # URL to download the vocab from.
download_merges_url: "https://huggingface.co/gpt2/resolve/main/merges.txt"  # URL to download the merges from.
vocab_save_dir: ${data_dir}/bpe
merges_save_dir: ${data_dir}/bpe
tokenizer_type: GPT2BPETokenizer
rm_downloaded: True # Extract script will remove downloaded zst after extraction
rm_extracted: True # Preprocess script will remove extracted files after preproc.
5.1.2.2. Data Preparation for T5 Models

The data_preparation parameter in conf/config.yaml specifies which file to use for data preparation configuration purposes. The data_preparation parameter needs to be specified as t5/download_t5_pile for preparing the Pile dataset for T5 models. The config file can be found in conf/data_preparation/t5/download_t5_pile.yaml. GPT models and T5 models use different tokenizer and vocab files. The default parameters can be found in the corresponding config files.

To download a reduced portion of the dataset to run tests, the file_numbers parameter can be updated to download only one of the shards by changing “0-29” to “0” (the syntax must be a combination of numbers separated by dashes "-" or commas ","). For example, file_numbers="0,3,5-7" will download and prepare files 0, 3, 5, 6, and 7.

5.1.2.2.1. Slurm

First, ensure the cluster configuration settings in the conf/cluster/bcm.yaml file are correct. The cluster and cluster_type parameters in conf/config.yaml must be set to bcm. Then, modify the time_limit or any other parameter related to the job in the t5/download_t5_pile.yaml file for T5 models. The data preparation can be parallelized by using up to 30 nodes to download all 30 files in parallel.

Example:

To run only the data preparation pipeline and not the training, evaluation or inference pipelines, set the conf/config.yaml file to:

stages:
  - data_preparation: True

And then run:

python3 main.py
5.1.2.2.2. Base Command Platform

In order to run the data preparation script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. By default, the data preparation script will download the data into the data/ directory. We recommend that the data_dir parameter is set to a workspace, so that the data is visible across multiple jobs later on. The vocab and merge files should also be stored to the same workspace as the dataset. The data preparation code must be launched in a multi-node job, and can be parallelized to use between 2 and 30 nodes, for faster parallel preparation of the dataset.

With Base Command Platform, the 700+ GB dataset can be downloaded once and then shared by multiple users in the same ACE by setting the permissions of the nemo_megatron_data_ws workspace.

To run the data preparation pipeline for T5 models, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py data_preparation=t5/download_t5_pile \
stages=[data_preparation] \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results data_preparation.file_numbers='0-29' \
data_preparation.vocab_save_dir=/mount/data/bpe >> /results/data_t5_log.txt 2>&1

The command above assumes you want to prepare the entire dataset (files 0-29), and you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/data_t5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.1.2.2.3. Common

Set the configuration for the data preparation job for T5 models in the YAML file:

dataset: pile
download_the_pile: True    # Whether to download the pile dataset from the internet.
the_pile_url: "https://mystic.the-eye.eu/public/AI/pile/train/"    # Source URL to download The Pile dataset from.
file_numbers: "0-29"    # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True    # True to preprocess the data from a jsonl file, False otherwise.
download_vocab_url: "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt"    # URL to download the vocab from.
download_merges_url: null
vocab_save_dir: ${data_dir}/bpe
merges_save_dir: ${data_dir}/bpe
tokenizer_type: BertWordPieceCase # T5 models use BertWordPieceCase tokenizer
log_dir: ${base_results_dir}/data_preparation/t5_pile_logs    # Where to save the logs
rm_downloaded: True # Extract script will remove downloaded zst after extraction
rm_extracted: True # Preprocess script will remove extracted files after preproc.
nodes: 30
time_limit: "4:00:00"
bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.
5.1.2.3. Data Preparation for mT5 Models

The data_preparation parameter in conf/config.yaml specifies which file to use for data preparation configuration purposes. The data_preparation parameter needs to be specified as download_mc4 for preparing the mC4 dataset for mT5 models. The config file can be found in conf/data_preparation/download_mc4.yaml. mT5 models use SentencePiece multilingual tokenzier.

To download a reduced portion of the dataset to run tests, the languages parameter can be updated to download only one of the languages by changing it to lv. The list of all 101 languages can be found in mC4 dataset.

The data preparation can be parallelized by using multiple nodes (default 20 nodes) to download and preprocess all language files in parallel.

5.1.2.3.1. Slurm

First, ensure the cluster configuration settings in the conf/cluster/bcm.yaml file are correct. The cluster and cluster_type parameters in conf/config.yaml must be set to bcm. Then, modify the time_limit or any other parameter related to the job in the download_mc4.yaml file for mT5 models.

Example:

To run only the data preparation pipeline and not the training, evaluation or inference pipelines, set the conf/config.yaml file to:

stages:
  - data_preparation

And then run:

python3 main.py
5.1.2.3.2. Base Command Platform

In order to run the data preparation script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. By default, the data preparation script will download the data into the data/ directory. We recommend that the data_dir parameter is set to a workspace, so that the data is visible across multiple jobs later on. The tokenizer model file should also be stored to the same workspace as the dataset. The data preparation code must be launched in a multi-node job, and can be parallelized to use between 2 and 30 nodes, for faster parallel preparation of the dataset.

With Base Command Platform, the dataset can be downloaded once and then shared by multiple users in the same ACE by setting the permissions of the nemo_megatron_data_ws workspace.

To run the data preparation pipeline for mT5 models, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py data_preparation=mt5/download_mc4 \
stages=[data_preparation] \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \
base_results_dir=/mount/results data_preparation.languages=\'cs,da,de,el,en,es,fi,fr,hi,hu,it,ja,ko,lt,lv,nl,no,pl,pt,ro,ru,sk,sv,zh\' \
data_preparation.run.node_array_size=20 data_preparation.run.workers_per_node=4 >> /results/data_mt5_log.txt 2>&1

The command above assumes you want to prepare the mC4 dataset with 24 languages, and you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/data_mt5_log.txt file, to be able to download the logs from NGC. The full dataset may not fit into BCP workspaces. We recommand using a smaller subset of languages (total size is 1TB, e.g. cs,da,de,el,fr,hi). Any other parameter can also be added to the command to modify its behavior.

5.1.2.3.3. Common

Set the configuration for the data preparation job for mT5 models in the YAML file:

run:
  name: download_mc4
  results_dir: ${base_results_dir}/${.name}
  time_limit: "24:00:00"
  dependency: "singleton"
  node_array_size: 20
  cpus_per_node: 256
  workers_per_node: 4 # Number of workers per node in preprocessing step.
dataset: mc4
download_mc4: True  # Whether to download the mC4 dataset from the internet.
preprocess_data: True  # True to preprocess the data from a json.gz file, False otherwise.
mc4_dir: ${data_dir}/mc4 # Path to (m)C4 dataset repo.
git_lfs_dir: ${.mc4_dir}/lfs # Path to store git lfs files.
download_vocab_url: https://storage.googleapis.com/t5-data/vocabs/mc4.250000.100extra/sentencepiece.vocab # URL to download the vocab from.
download_tokenizer_url: https://storage.googleapis.com/t5-data/vocabs/mc4.250000.100extra/sentencepiece.model # URL to download tokenizer from
vocab_save_dir: ${.mc4_dir}/bpe
tokenizer_save_dir: ${.mc4_dir}/bpe
tokenizer_model: ${.tokenizer_save_dir}/mt5_tokenizer.model
languages: cs,da,de,el,en,es,fi,fr,hi,hu,it,ja,ko,lt,lv,nl,no,pl,pt,ro,ru,sk,sv,zh # language list in mC4 dataset to download and preprocess. Use `all` to download and preprocess all languages or specify language list as `en,es,ko,zh,...`
use_cleaned_english: True # whether to use cleaned version of english data
softlinks_dir: ${.mc4_dir}/softlinks # Path to languages soft links for preprocessing
preprocessed_dir: ${.mc4_dir}/preprocessed
max_split_size: 200 # (GB) Each split will be preprocessed individually. Tune this down to accommodate short wall time on clusters
download_worker_mapping: ${.mc4_dir}/download_mapping
preprocess_worker_mapping: ${.mc4_dir}/preprocess_mapping
rm_downloaded: False # Script will not remove downloaded after preprocessing
5.1.2.4. Data Preparation for BERT Models

The data_preparation parameter in conf/config.yaml specifies which file to use for data preparation configuration purposes. The default value is set to download_bert_pile, which can be found in conf/data_preparation/download_bert_pile.yaml. It is used to download, extract, and preprocess the Pile dataset for BERT model. The parameters can be modified to perform the different tasks and to decide where to store the datasets, vocab etc.

To download a reduced portion of the dataset to run tests, the file_numbers parameter can be updated to download only one of the shards by changing “0-29” to “0” (the syntax must be a combination of numbers separated by dashes "-" or commas ",") For example, file_numbers="0,3,5-7" will download and prepare files 0, 3, 5, 6, and 7.

5.1.2.4.1. Slurm

First, ensure the cluster related configuration in the conf/cluster/bcm.yaml file is correct. The cluster and cluster_type parameters in conf/config.yaml must be set to bcm. Then, modify the time_limit or any other parameter related to the job in the download_bert_pile.yaml file for BERT models. The data preparation can be parallelized by using up to 30 nodes to download all 30 files in parallel.

Example:

To run only the data preparation pipeline and not the training, evaluation or inference pipelines, set the conf/config.yaml file to:

stages:
  - data_preparation

And then run:

python3 main.py
5.1.2.4.2. Base Command Platform

In order to run the data preparation script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. By default, the data preparation script will download the data into the data/ directory. We recommend that the data_dir parameter is set to a workspace, so that the data is visible across multiple jobs later on. The vocab and merge files should also be stored to the same workspace as the dataset, for later usage. The data preparation code must be launched in a multi-node job. It can be parallelized to use between 2 and 30 nodes for faster preparation of the dataset.

With Base Command Platform, the 700+ GB dataset can be downloaded once and then shared by multiple users in the same ACE by setting appropriate permissions of the nemo_megatron_data_ws the workspace.

To run the data preparation pipeline for Bert models, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=[data_preparation] \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_bert \
base_results_dir=/mount/results data_preparation.file_numbers='0-29' \
data_preparation.vocab_save_dir=/mount/data/bpe data_preparation.merges_save_dir=/mount/data/bpe >> /results/data_bert_log.txt 2>&1

The command above assumes you want to prepare the entire dataset (files 0-29), and you mounted the data workspace in /mount/data, and the results workspace in /mount/results. Stdout and stderr are redirected to the /results/data_bert_log.txt file, so it can be downloaded from NGC. Any other parameter can also be added to the command to modify its behavior.

5.1.2.4.3. Common

Set the configuration for the data preparation job for BERT models in the YAML file:

run:
  name: download_bert_pile
  results_dir: ${base_results_dir}/${.name}
  time_limit: "4:00:00"
  dependency: "singleton"
  node_array_size: 30
  array: ${..file_numbers}
  bcp_preproc_npernode: 2 # 2 should be safe to use and x2 times faster.

dataset: pile
download_the_pile: True  # Whether to download the pile dataset from the internet.
the_pile_url: "https://mystic.the-eye.eu/public/AI/pile/train/"  # Source URL to download The Pile dataset from.
file_numbers: "0-29"  # The pile dataset consists of 30 files (0-29), choose which ones to download.
preprocess_data: True  # True to preprocess the data from a jsonl file, False otherwise.
download_vocab_url: "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased-vocab.txt"  # URL to download the vocab from.
vocab_save_dir: ${data_dir}/bpe
tokenizer_type: BertWordPieceLowerCase
rm_downloaded: True # Extract script will remove downloaded zst after extraction
rm_extracted: True # Preprocess script will remove extracted files after preproc.
5.1.2.4.4. LDDL

Language Datasets and Data Loaders (LDDL) is a utility library that minimizes the friction during dataset retrieval, preprocessing and loading for the language models. LDDL provides dataset preprocesssing and dataloaders that allow for efficient training of Bert with dynamic sequence lengths in order to maximize training performance. LDDL currently is not installed by default in the NeMo FW container. It can be installed with pip install git+https://github.com/NVIDIA/lddl.git. The directions for how to preprocess data into the LDDL binned format that can be used with NeMo can be found [here] (https://github.com/NVIDIA/LDDL#bert) for preprocessing data with binning.

With the data preprocessed in binned LDDL format the LDDL dataset can be used with the following changes to the YAML file:

trainer:
  data:
    data_prefix: 
      - /path/to/train/LDDL/Dataset
      - /path/to/val/LDDL/Dataset
      - /path/to/test/LDDL/Dataset
    dataloader_type: LDDL

Note: Nemo FW currently only works with LDDL datasets that have been preprocessed with binning.

5.2. Training with Predefined Configurations

5.2.1. Predefined Configurations of GPT Models

We provide nine configurations for several different GPT model sizes: 126M, 400M_improved, 1B_improved, 5B, 7B_improved, 20B, 40B, 40B_improved, and 175B parameters. These configurations include carefully selected hyperparameters, which should be used as a guideline for any custom model configurations. All these configurations are provided in the conf/training/gpt3/ directory. The desired configuration can be chosen by selecting the training parameter in the conf/config.yaml file. For Base Command Platform, all jobs must be launched in multi-node mode.

126M configuration:

The 126M model uses the bf16 data type. It can be trained in about 20 hours using 8 nodes with 8 GPUs per node. The model includes 12 transformer layers, a hidden size of 768, and 12 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the gpt3/126m.yaml config file for parameter details.

To train a 126M model on a Slurm cluster, modify the conf/config.yaml file to set:

- training: gpt3/126m
stages:
  - training

And run:

python3 main.py

To train a 126M GPT model on Base Command Platform cluster on 8 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/126m \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively, and that the $NGC_ARRAY_SIZE will use the number of nodes selected when creating the job (number of replicas).

To train with fewer or a different number of nodes, the relevant parameters can be adjusted either in the yaml config file or from the command line. More on this in section 5.7. For Base Command Platform, all jobs must be launched in multi-node mode.

5B configuration:

The 5B model uses the bf16 data type. It can be trained in about 5 days using 16 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 4096, and 32 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 1. For the details on all the parameters, see the 5b.yaml config file.

To train a 5B GPT model, modify the conf/config.yaml file to set:

- training: gpt3/5b
stages:
  - training

And run:

python3 main.py

To train a 5B GPT model on Base Command Platform cluster on 16 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/5b \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively, and that the $NGC_ARRAY_SIZE will use the number of nodes selected when creating the job (number of replicas).

20B configuration:

The 20B model uses the bf16 data type. It can be trained in about 6 days using 64 nodes with 8 GPUs per node. The model includes 44 transformer layers, a hidden size of 6144, and 48 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 4 and pipeline parallelism of 1. For the details on all the parameters, see the 20b.yaml config file.

To train a 20B GPT model, modify the conf/config.yaml file to set:

- training: gpt3/20b
stages:
  - training

And run:

python3 main.py

To train a 20B GPT model on Base Command Platform cluster on 64 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/20b \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively, and that the $NGC_ARRAY_SIZE will use the number of nodes selected when creating the job (number of replicas).

40B configuration:

The 40B model uses the bf16 data type. It can be trained in about 6 days using 128 nodes with 8 GPUs per node. The model includes 48 transformer layers, a hidden size of 8192, and 64 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 8 and pipeline parallelism of 1. For the details on all the parameters, see the 40b.yaml config file.

To train a 40B GPT model, modify the conf/config.yaml file to set:

- training: gpt3/40b
stages:
  - training

And run:

python3 main.py

To train a 40B GPT model on Base Command Platform cluster on 128 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/40b \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively, and that the $NGC_ARRAY_SIZE will use the number of nodes selected when creating the job (number of replicas).

175B configuration:

The 175B model uses the bf16 data type. It can be trained in about 24 days using 128 nodes with 8 GPUs per node. The model includes 96 transformer layers, a hidden size of 12288, and 96 attention heads. The sequence length is 2048, and the optimizer is Distributed Adam. This model uses tensor parallelism of 8 and pipeline parallelism of 16. This model uses interleaved pipeline scheduling, with a virtual pipeline chunk size of 6. For the details on all the parameters, see the 175b.yaml config file.

To train a 175B GPT model, modify the conf/config.yaml file to set:

- training: gpt3/175b
stages:
  - training

And run:

python3 main.py

To train a 175B GPT model on Base Command Platform cluster on 128 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=gpt3/175b \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.json \
training.model.tokenizer.merge_file=/mount/data/bpe/merges.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively, and that the $NGC_ARRAY_SIZE will use the number of nodes selected when creating the job (number of replicas).

FP8 with Transformer Engine Transformer Engine (TE) is a library for accelerating Transformer-based models on NVIDIA Hopper GPUs. It enables using 8-bit floating point (FP8) precision to provide better performance with lower memory utilization in both training and inference. NVIDIA open-sourced TE on github.

In NeMo Framework, you can now use fp8 to pre-train GPT models. For example, if you want to turn on fp8 to pre-train a GPT3 5B model, you can modify gpt3/5b training config inside conf/training/gpt3/5b.yaml file as following. To run a job with fp8, please set transformer_engine=True and fp8=True. Other fp8-associated knobs are set accordingly in the baseline pre-training scripts, which are ignored in bf16 training.

  ## Transformer Engine
  transformer_engine: True # turn on Transformer Engine
  fp8: True # enables fp8 in TransformerLayer forward
  fp8_e4m3: False # sets fp8_format = recipe.Format.E4M3
  fp8_hybrid: True # sets fp8_format = recipe.Format.HYBRID
  fp8_margin: 0 # scaling margin
  fp8_interval: 1 # scaling update interval
  fp8_amax_history_len: 1024 # Number of steps for which amax history is recorded per tensor
  fp8_amax_compute_algo: max # 'most_recent' or 'max'. Algorithm for computing amax from history
  use_emha: False

We observed similar convergence behavior but significant speed-up comparing fp8 and bf16 precision.

5.2.2. Predefined Configurations of T5 Models

We provide configuration files for five T5 model sizes: 220M, 3B, 11B, 23B, and 41B parameters. These configurations include carefully selected hyperparameters, which should be used as guidelines for any custom model configurations. The configuration files are provided in the conf/training/t5 directory. The desired configuration can be chosen by selecting the training parameter in the conf/config.yaml file. For Base Command Platform, all jobs must be launched in multi-node mode.

220M configuration:

The 220M model uses the bf16 data type. It can be trained in about 3.5 days using 4 nodes with 8 GPUs per node. The model includes 12 transformer layers, a hidden size of 768, a feedforward network size of 2048, and 12 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the t5/220m.yaml config file for parameter details.

To train a 220M model on a Slurm cluster, modify the conf/config.yaml file to set:

training: t5/220m
stages:
  - training

And run:

python3 main.py

To train a 220M model on Base Command Platform cluster on 4 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/220m \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively. $NGC_ARRAY_SIZE is automatically set to the number of nodes that will be used when creating the job (number of replicas).

To train with a different number of nodes, the relevant parameters (e.g. micro_batch_size) can be adjusted either in the appropriate yaml config file or from the command line. More on this in section 5.7. For Base Command Platform, all jobs must be launched in multi-node mode.

3B configuration:

The 3B model uses the bf16 data type. It can be trained in about 7.5 days using 20 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 2048, a feedforward network size of 5120, and 32 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. For the details on all the parameters, see the t5/3b.yaml config file.

To train a 3B model, modify the conf/config.yaml file to set:

training: t5/3b
stages:
  - training

And run:

python3 main.py

To train a 3B model on Base Command Platform cluster on 20 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/3b \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively. $NGC_ARRAY_SIZE is automatically set to the number of nodes that will be used when creating the job (number of replicas).

11B configuration:

The 11B model uses the bf16 data type. It can be trained in about 26.5 days using 20 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 4096, a feedforward network size of 10240, and 64 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model uses tensor parallelism of 4. For the details on all the parameters, see the t5/11b.yaml config file.

To train a 11B model, modify the conf/config.yaml file to set:

training: t5/11b
stages:
  - training

And run:

python3 main.py

To train a 11B model on Base Command Platform cluster on 20 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/11b \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively. $NGC_ARRAY_SIZE is automatically set to the number of nodes that will be used when creating the job (number of replicas).

23B configuration:

The 23B model uses the bf16 data type. It can be trained in about 36 days using 40 nodes with 8 GPUs per node. The model includes 36 transformer layers, a hidden size of 5120, a feedforward network size of 10880, and 64 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model uses tensor parallelism of 4 and pipeline parallelism of 2. For the details on all the parameters, see the t5/23b.yaml config file.

To train a 23B model, modify the conf/config.yaml file to set:

training: t5/23b
stages:
  - training

And run:

python3 main.py

To train a 23B model on Base Command Platform cluster on 40 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/23b \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively. $NGC_ARRAY_SIZE is automatically set to the number of nodes that will be used when creating the job (number of replicas).

41B configuration:

The 41B model uses the bf16 data type. It can be trained in about 60 days using 40 nodes with 8 GPUs per node. The model includes 36 transformer layers, a hidden size of 6144, a feedforward network size of 10880, and 96 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model uses tensor parallelism of 4 and pipeline parallelism of 2. For the details on all the parameters, see the t5/23b.yaml config file.

To train a 41B model, modify the conf/config.yaml file to set:

training: t5/41b
stages:
  - training

And run:

python3 main.py

To train a 41B model on Base Command Platform cluster on 40 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=t5/41b \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively. $NGC_ARRAY_SIZE is automatically set to the number of nodes that will be used when creating the job (number of replicas).

5.2.3. Predefined Configurations of mT5 Models

We provide configuration files for three mT5 model sizes: 170M, 390M, and 3B parameters. These configurations include carefully selected hyperparameters, which should be used as guidelines for any custom model configurations. The configuration files are provided in the conf/training/mt5 directory. The desired configuration can be chosen by selecting the training parameter in the conf/config.yaml file. For Base Command Platform, all jobs must be launched in multi-node mode.

170M configuration:

The 170M model uses the bf16 data type. It can be trained in about 4 days using 4 nodes with 8 GPUs per node. The model includes 8 transformer layers, a hidden size of 512, a feedforward network size of 1024, and 6 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the mt5/170m.yaml config file for parameter details.

To train a 170M model on a Slurm cluster, modify the conf/config.yaml file to set:

training: mt5/170m
stages:
  - training

And run:

python3 main.py

To train a 170M model on Base Command Platform cluster on 4 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=mt5/170m \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively. $NGC_ARRAY_SIZE is automatically set to the number of nodes that will be used when creating the job (number of replicas).

To train with a different number of nodes, the relevant parameters (e.g. micro_batch_size) can be adjusted either in the appropriate yaml config file or from the command line. More on this in section 5.7. For Base Command Platform, all jobs must be launched in multi-node mode.

390M configuration:

The 390M model uses the bf16 data type. It can be trained in about 4 days using 8 nodes with 8 GPUs per node. The model includes 8 transformer layers, a hidden size of 512, a feedforward network size of 2048, and 12 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the mt5/390m.yaml config file for parameter details.

To train a 390M model on a Slurm cluster, modify the conf/config.yaml file to set:

training: mt5/390m
stages:
  - training

And run:

python3 main.py

To train a 390M model on Base Command Platform cluster on 8 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=mt5/390m \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively. $NGC_ARRAY_SIZE is automatically set to the number of nodes that will be used when creating the job (number of replicas).

3B configuration:

The 3B model uses the bf16 data type. It can be trained in about 14 days using 20 nodes with 8 GPUs per node. The model includes 24 transformer layers, a hidden size of 2048, a feedforward network size of 5120, and 32 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model uses tensor parallelism of 2. For the details on all the parameters, see the mt5/3b.yaml config file.

To train a 3B model, modify the conf/config.yaml file to set:

training: mt5/3b
stages:
  - training

And run:

python3 main.py

To train a 3B model on Base Command Platform cluster on 20 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=mt5/3b \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
training.trainer.num_nodes=\$NGC_ARRAY_SIZE cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively. $NGC_ARRAY_SIZE is automatically set to the number of nodes that will be used when creating the job (number of replicas).

5.2.4. Training Logs with TensorBoard and Weights and Biases

The training code can log the model and system related metrics to both TensorBoard and Weights & Biases (W&B). The local files will be stored in the directory specified in the training.exp_manager.explicit_log_dir parameter. TensorBoard logs are saved by default.

However, W&B needs the API key to be specified to work properly. To upload the logs to W&B, the user must first store the W&B API key to a file (on the first line of the file), and select the path to the file that contains the key using the wandb_api_key_file parameter. For Base Command Platform, this file can be stored in a dataset or workspace mounted to the job. To enable the logging of the training metrics to W&B, the following training parameters must be set:

exp_manager:
        create_wandb_logger: True
        wandb_logger_kwargs:
            project: [W&B project name]
            name: [W&B run name]

The logs show the reduced_train_loss, val_loss, train_step_timing (which is the best way to measure the time it takes to finish each global step), and other relevant metrics.

5.2.5. Predefined Configurations of BERT Models

We provide configuration files for four BERT model sizes: 110M, 4B, 20B, and 100B parameters. These configurations include carefully selected hyperparameters, which should be used as guidelines for any custom model configurations. The configuration files are provided in the conf/training/bert directory. The desired configuration can be chosen by selecting the training parameter in the conf/config.yaml file. For Base Command Platform, all jobs must be launched in multi-node mode.

110M configuration:

The 110M model uses the bf16 data type. The model includes 12 transformer layers, a hidden size of 768, a feedforward network size of 3072 and 12 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. This model does not use any model parallelism. See the bert/110m.yaml config file for parameter details.

To train a 110M model on a Slurm cluster, modify the conf/config.yaml file to set:

training: bert/110m
stages:
  - training

And run:

python3 main.py

To train a 110M model on Base Command Platform cluster on 4 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=bert/110m \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively. $NGC_ARRAY_SIZE is automatically set to the number of nodes that will be used when creating the job (number of replicas).

To train with a different number of nodes, the relevant parameters (e.g. micro_batch_size) can be adjusted either in the appropriate yaml config file or from the command line. More on this in section 5.7. For Base Command Platform, all jobs must be launched in multi-node mode.

4B configuration:

The 4B model uses the bf16 data type. The model includes 48 transformer layers, a hidden size of 2560, a feedforward network size of 10240, and 40 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. For the details on all the parameters, see the bert/4b.yaml config file.

To train a 4B model, modify the conf/config.yaml file to set:

training: bert/4b
stages:
  - training

And run:

python3 main.py

To train a 4B model on Base Command Platform cluster on 20 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=bert/4b \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively. $NGC_ARRAY_SIZE is automatically set to the number of nodes that will be used when creating the job (number of replicas).

20B configuration:

The 20B model uses the bf16 data type. The model includes 48 transformer layers, a hidden size of 6144, a feedforward network size of 24576, and 96 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. For the details on all the parameters, see the bert/20b.yaml config file.

To train a 20B model, modify the conf/config.yaml file to set:

training: bert/20b
stages:
  - training

And run:

python3 main.py

To train a 20B model on Base Command Platform cluster on 20 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=bert/20b \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively. $NGC_ARRAY_SIZE is automatically set to the number of nodes that will be used when creating the job (number of replicas).

100B configuration:

The 100B model uses the bf16 data type. The model includes 96 transformer layers, a hidden size of 9216, a feedforward network size of 36864, and 96 attention heads with GeGLU activation function. The sequence length is 512, and the optimizer is Distributed Adam. For the details on all the parameters, see the bert/100b.yaml config file.

To train a 100B model, modify the conf/config.yaml file to set:

training: bert/100b
stages:
  - training

And run:

python3 main.py

To train a 100B model on Base Command Platform cluster on 20 nodes, use the command:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py training=bert/100b \
stages=[training] \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results training.trainer.num_nodes=\$NGC_ARRAY_SIZE \
training.model.tokenizer.vocab_file=/mount/data/bpe/vocab.txt cluster_type=bcp

The command above assumes that the data and results workspaces are mounted in the /mount/data and /mount/results directories respectively. $NGC_ARRAY_SIZE is automatically set to the number of nodes that will be used when creating the job (number of replicas).

5.3. Using AutoConfigurator to Find the Optimal Configuration

AutoConfigurator searches for the Hyper-Parameters (HPs) that achieve the highest throughput for training and inference for Large Language Models (LLMs) using NeMo-Megatron.

Note: The inference HP search is only available for GPT models.

5.3.1. AutoConfigurator Capabilities

AutoConfigurator is intended to quickly iterate over different model configurations, to find the best configuration with minimal time and money spending. To achieve that, AutoConfigurator provides several different capabilities, as shown in the table below:

Feature GPT T5 mT5 Bert
Model Size Recommendation Yes Yes Yes Yes
Base Config Generation Yes Yes Yes Yes
Training HP Search Yes Yes Yes Yes
Parallel Training HP Search BCM Only BCM Only BCM Only BCM Only
Inference HP Search BCM Only No No No
Parallel Inference HP Search BCM Only No No No
Slurm Based Clusters Yes Yes Yes Yes
Base Command Platform Based Clusters Yes Yes Yes Yes
5.3.1.1. Model Size Recommendation

For users who do not know what model size they wish to train, AutoConfigurator is capable of recommending a model size, given the hardware and training constraints. If the number of GPUs, the TFLOPS per GPU, the maximum time to train, and the number of tokens to train for are known, then it tool can recommend a model size that can be trained with the specified hardware and time constraints.

For example, if the user has 20 NVIDIA DGX nodes available (80GB GPU memory), and wants to train a GPT model for a maximum of 5 days, AutoConfigurator will recommend using a 5B parameter GPT model.

5.3.1.2. Base Config Generation

If the model size is provided by the user, or after the model size is suggested, AutoConfigurator will generate a base configuration for the target model. This configuration will be a valid configuration in YAML format, which can be trained using NeMo-Megatron. However, the throughput optimization will happen at the next step (Training AutoConfigurator HP Search).

5.3.1.3. Training AutoConfigurator HP Search

Given the input model size and the base configuration, AutoConfigurator will now search over four different critical Hyper-Parameters, that have great impact on the training throughput but do not affect model convergence: Tensor Parallelism (TP), Pipeline Parallelism (PP), Micro Batch Size (MBS), and Activation Checkpointing Layers (ActCkpt).

First, AutoConfigurator will use heuristics to choose good candidates for those four parameters, and generate the grid of candidate configurations. All the candidate configurations will be saved to the results directory, and will include YAML files with the corresponding config. NOTE: some of these configurations might not work, due to high memory usage or for other reasons. The next step will determine which configurations are valid.

Once all the candidate configurations are generated, it will use heuristics to sort the most promising candidate configurations. Then, it will launch the most promising candidates in parallel, where the number of candidates can be set by the limit_search_runs parameter, to perform a grid search over the four training parameters. This search will launch the jobs using NeMo-Megatron, and it will train each config for a maximum of max_minutes_per_run minutes and a maximum of max_steps_per_run training steps, whichever is reached first on the target cluster. During this search, the jobs will run with the number of nodes specified in the configuration files, using the num_nodes parameter. Once all the jobs have finished running, the final result will be summarized in a CSV file.

5.3.1.4. Inference AutoConfigurator HP Search

AutoConfigurator can also search the best HPs for inference purposes. It will empirically measure the throughput and latency for each given configuration in the grid search space, and return a comprehensive table with all the numbers. It will search over three different critical HPs, which have great impact on the inference throughput and latency: Tensor Parallelism (TP), Pipeline Parallelism (PP), and Batch Size (BS). Technically, AutoConfigurator is also capable of searching over different input/output sequence lengths. However, we do not recommend adding multiple different sequence lengths to the same search, since the model that uses the shortest sequence lengths will always achieve higher throughput and lower latency. Therefore, we recommend performing several different inference searches for different sequence lengths. Once the search space has been defined, it will launch a job for each config in parallel, and measure the throughput and latency. This search will launch the jobs using NeMo-Megatron on the target cluster. Once all the jobs have finished running, the final result will be summarized in a CSV file.

5.3.2. Usage

In this section, we will explain how to run each of the stages described above.

5.3.2.1. General Configuration

5.3.2.1.1. Slurm

First, our configuration setup assumes that the /opt/NeMo-Megatron-Launcher/auto_configurator, /opt/NeMo-Megatron-Launcher/launcher_scripts and /opt/FasterTransformer directories have been copied from the container to the local file system.

The first parameter that must be set is the auto_configurator_path parameter inside the conf/config.yaml file. This parameter must point to the absolute path where the auto_configurator directory is stored in
the file system. Additionally, if using a Slurm-based cluster, the config file in the conf/cluster/bcm.yaml subfolder has the parameters to set the generic cluster related information, such as the partition or account parameters.

The auto_configurator_path parameter will automatically be mounted to the container at the same path as in the local file system. Any additional directories that should be mounted must be specified using the container_mounts parameter. If the paths contain the colon character (:), the code will assume both the source and destination paths are provided. Otherwise, the given paths will be mounted to the same path inside the container.

The launcher_scripts_path and fastertransformer_path must point to the path where launcher_scripts and FasterTransformer directories are located in the local file system. The locations specified in the default config should be valid if /opt was extracted correctly. Next, the data_dir value must point to the path where the training dataset is located. Note that the dataset for GPT, T5 and mT5 values will be different, so modify this parameter accordingly. Follow the data preparation steps to learn how to download and preprocess the datasets for each model. The dataset in this path does not need to be the full size dataset; only a small representative sample of the dataset is needed, since AutoConfigurator does not train the models to convergence. Finally, the base_results_dir parameter can be modified to point to the location where the results will be stored. See all the parameters for the conf/config.yaml file below:

defaults:
  - _self_
  - cluster: bcm
  - search_config: gpt3/5b
  - override hydra/job_logging: stdout

run_training_hp_search: True
run_inference_hp_search: True

cluster_type: bcm  # bcm or bcp
auto_configurator_path: ???  # Path to the location of auto_configurator codebase.
launcher_scripts_path: ${auto_configurator_path}/../launcher_scripts
fastertransformer_path: ${auto_configurator_path}/../FasterTransformer
base_results_dir: ${auto_configurator_path}/results
data_dir: ${launcher_scripts_path}/data
training_container: nvcr.io/ea-bignlp/nemofw-training:23.05-py3
container_mounts:
    - null
wandb:  # Weights and Biases (W&B) logging.
  enable: False  # Whether to save logs to W&B.
  api_key_file: null # Path to the file where the w&B api key is stored. Key must be on the first line.
  project: nemo-megatron-autoconfig # Name of the W&B project to store the logs in. The name of the run will be populated automatically.
5.3.2.1.2. Base Command Platform

In Base Command Platform, the dataset, vocabulary, and merge files used for the training HP search must be available as a dataset and mounted accordingly. This guide assumes the dataset will be mounted to /mount/data. The results of running the AutoConfigurator will be stored in /mount/results/auto_configurator, so we recommend to mount a workspace to /mount/results.

The main configuration file can be found in conf/config.yaml. All the parameters can be overridden from the CLI, as we will show in the next section.

5.3.2.2. Running Predefined Configs

The predefined configs we provide have been well tested, and the outputs produced by AutoConfigurator have been verified manually. Running one of these configs will first generate a base config file for the specified model size. Then, it will launch the training and inference grid search jobs. When all the jobs have finished, a final recommendation will be produced for both training and inference, which will show the optimal hyper-parameters for the given model.

The predefined configs can be found in the conf/search_config directory. Each YAML file shows one model type (GPT, T5 or mT5) and one model size (up to 175B parameters for GPT and up to 42B parameters for T5 and mT5). To run the desired config, we will need to modify the search_config parameter in the conf/config.yaml file. For example, if we wish to run a 5B GPT model, we can set this value to gpt3/5b (the .yaml ending should not be included).

5.3.2.2.1. Model Config

To run the gpt3/5b config, we need to set up the conf/search_config/gpt3/5b.yaml file correctly.

train_settings:
  model_size_in_b: 5 # unit in billion parameters
  num_nodes: 16
  gpus_per_node: 8
  gpu_memory_gb: 80  # Memory per GPU, in GB. Currently 40GB and 80GB A100s supported.
  max_training_days: 5 # unit in days
  limit_search_runs: 100 # Max number of runs to be launched in parallel for grid search.
  output_top_n: 10  # The result will print the top N fastest training configs.
  max_steps_per_run: 50 # Max steps per run for the grid search.
  max_minutes_per_run: 10 # minutes per run for the grid search.
  tflops_per_gpu: 140  # Estimated tflops per GPU.
  num_tokens_in_b: 300  # Unit in billions, typically 300B for GPT3 models.
  vocab_size: 51200
  logs: ${base_results_dir}/${search_config_value}_${.gpu_memory_gb}gb  # Example base_results_dir/gpt3/126m
  tensor_parallel_sizes: auto  # auto to use our recommendation, or a list, such as [1, 2, 4, 8]
  pipeline_parallel_sizes: auto  # auto to use our recommendation, or a list, such as [1, 2, 4, 8, 10]
  min_model_parallel_size: auto  # auto to use our recommendation, or a value for the minimum desired parallelism
  max_model_parallel_size: auto  # auto to use our recommendation, or a value for the maximum desired parallelism
  micro_batch_sizes: auto  # auto to use our recommendation, or a list, such as [1, 2, 4, 8, 16]
  act_ckpt_layers: auto  # auto to use our recommendation, or a list, such as [0, 1, 2, 3]
 
inference_settings:
  run:
    model_type: gpt3
    model_train_name: gpt3_5b
    gpus_per_node: 8
    data_type: "fp16" # fp32|fp16|bf16
    time_limit: 0:30:00
    results_dir: ${base_results_dir}/${search_config_value}_${search_config.train_settings.gpu_memory_gb}gb
    tensor_parallel_sizes: [1,2,4]
    pipeline_parallel_sizes: [1,2]
  benchmark:
    input_len: 60
    output_len: 20
    batch_sizes: [4,8,16,32,64,128,256]
    beam_width: 1
    topk: 4
    topp: 0.0

For the training parameters, the model_size_in_b parameter indicates how many billions of parameters the model should contain, and AutoConfigurator will provide a config and HPs for a model of that size. The num_nodes parameter indicates how many nodes AutoConfigurator should use to run each training job. The gpus_per_node parameter indicates how many GPUs are available in each node. To modify the behavior of the heuristics depending on whether 40GB or 80GB A100 GPUs are available, the gpu_memory_gb can be set to 40 or 80, respectively, and it will recommend candidate configs that are more suitable to each setting. The max_training_days parameter shows how many days this model will be trained for, when training to full convergence. It will be written to the final config YAML files. This parameter can also be used when model_size_in_b is set to null. The limit_search_runs parameter can be used to limit the number of configs that will be searched during the HP search stage. We recommend selecting a value between 30 and 100 for this parameter. AutoConfigurator will probably need to search at least 30 different configs to find the optimal one. However, if the computing resources are available in your cluster, we recommend increasing this parameter to a value close to 100. The output_top_n parameter can be used to configure how much details the output summary file will include. By default, when set to 10, it will output the top 10 configurations. The max_steps_per_run parameter indicates how many steps to train each configuration for. The max_minutes_per_run parameter indicates how long to run each configuration for, in minutes. We recommend using at least 20 minutes per run for the smaller models, and increasing it to over 60 minutes for the larger models. The training run will be stopped when either max_steps_per_run or max_minutes_per_run is reached. The tflops_per_gpu parameter provides an estimate of the TFLOPs each GPU can achieve when training large language models with NeMo Framework. This value is only used to provide an estimate of how long the model will take to train to full convergence, so you can know the time to train before you even begin training your model. The num_tokens_in_b parameter indicates how many billions of tokens you will train your model for, when training to full convergence. It will be used when estimating how long it will take to train the model, to the desired number of tokens. The vocab_size parameter must show the vocabulary size that will be used during training. The logs parameter can be used to configure where the result logs will be saved. By default, this directory will be created inside the base_results_dir indicated in the conf/config.yaml file. Finally, the tensor_parallel_sizes, pipeline_parallel_sizes, min_model_parallel_size, max_model_parallel_size, micro_batch_sizes, and act_ckpt_layers parameters can be used to override the heuristics that choose the grid search space and the maximum and minimum parallelism allowed for each model. If these are left as auto, AutoConfigurator will select appropriate values. However, if you wish to override them, you can use these parameters. For example, if you only wish to search for configurations with Tensor Parallelism (TP) values of 1 and 2, you can set tensor_parallel_sizes: [1, 2] and leave the other parameters as auto.

In the inference parameters, gpus_per_node must be used to tell the system how many GPUs are available in each node. tensor_parallel_sizes is used to set the TP values to perform the HP search. pipeline_parallel_sizes is used to set the PP values to perform the HP search. batch_sizes is used to set all the possible batch sizes for the HP search. input_len can be set to the sequence length of the input that will be passed to the model. output_len can be set to the output length that will be produced by the model.

5.3.2.2.2. Base Config Generation

Every time we call python3 main.py, a base configuration will be generated for the given model, and it will be saved to the logs directory indicated in your config files. The base configuration consists of a YAML file that can be run using the NeMo-Megatron training container. However, this base configuration has not yet been optimized to achieve the highest possible throughput, the optimization will take place in the next step (Training HP Search).

5.3.2.2.3. Training AutoConfigurator HP Search

####### 5.3.2.2.3.1. Slurm

To run the training HP search pipeline, the parameter run_training_hp_search must be set to True in the conf/config.yaml file. The model used to search the best training HPs must be selected using the search_config parameter in conf/config.yaml. For example, by default, this parameter will be set to gpt3/5b, so AutoConfigurator will search the optimal training HPs for a 5B parameter GPT model. The configuration for this model can be found in the conf/search_config/gpt3/5b.yaml file. To configure the behavior of the HP search, the following parameters can be modified in the correspoinding YAML file. To run the training AutoConfigurator HP search after all the parameters are set, you should call python3 main.py.

####### 5.3.2.2.3.2. Base Command Platform

To run the HP Tool in BCP, the cluster_type parameter must be set to bcp. All the parameters can be configured through CLI overrides. For example, to launch a training HP search for the 126m GPT model, run this command:

python3 /opt/NeMo-Megatron-Launcher/auto_configurator/main.py search_config=gpt3/0.126b run_inference_hp_search=False auto_configurator_path=/opt/NeMo-Megatron-Launcher/auto_configurator data_dir=/mount/data/the_pile_gpt3 base_results_dir=/mount/results/auto_configurator search_config.train_settings.num_nodes=$NGC_ARRAY_SIZE cluster_type=bcp

This command assumes that the dataset directory and the results directory are datasets and workspaces mounted correctly. The user can also override any training parameters, by overriding any parameter in the search_config dictionary with the search_config.train_settings.* parameter, using hydra overrides. The values that can be overridden are shown below:

train_settings:
  model_size_in_b: 5 # unit in billion parameters
  num_nodes: 16
  gpus_per_node: 8
  gpu_memory_gb: 80  # Memory per GPU, in GB. Currently 40GB and 80GB A100s supported.
  max_training_days: 5 # unit in days
  limit_search_runs: 100 # Max number of runs to be launched in parallel for grid search.
  output_top_n: 10  # The result will print the top N fastest training configs.
  max_steps_per_run: 50 # Max steps per run for the grid search.
  max_minutes_per_run: 10 # minutes per run for the grid search.
  tflops_per_gpu: 140  # Estimated tflops per GPU.
  num_tokens_in_b: 300  # Unit in billions, typically 300B for GPT3 models.
  vocab_size: 51200
  logs: ${base_results_dir}/${search_config_value}_${.gpu_memory_gb}gb  # Example base_results_dir/gpt3/126m
  tensor_parallel_sizes: auto  # auto to use our recommendation, or a list, such as [1, 2, 4, 8]
  pipeline_parallel_sizes: auto  # auto to use our recommendation, or a list, such as [1, 2, 4, 8, 10]
  min_model_parallel_size: auto  # auto to use our recommendation, or a value for the minimum desired parallelism
  max_model_parallel_size: auto  # auto to use our recommendation, or a value for the maximum desired parallelism
  micro_batch_sizes: auto  # auto to use our recommendation, or a list, such as [1, 2, 4, 8, 16]
  act_ckpt_layers: auto  # auto to use our recommendation, or a list, such as [0, 1, 2, 3]
5.3.2.2.4. Inference AutoConfigurator HP Search

To run the inference HP search pipeline, the parameter run_inference_hp_search must be set to True in the conf/config.yaml file. The model used to search the best inference HPs must be selected using the search_config parameter in conf/config.yaml. For example, by default, this parameter will be set to gpt3/5b, so AutoConfigurator will search the optimal inference HPs for a 5B parameter GPT model. The configuration for this model can be found in the conf/search_config/gpt3/5b.yaml file. To configure the behavior of the HP search, the following parameters can be modified in the correspoinding YAML file.

5.3.2.3. Running Custom Model Size Configs

The HP Tool is capable of recommending a model size, based on your hardware and training time constraints. For instance, if you want to train a GPT model, but don't know what model size is appropriate, you can input the number of nodes (and GPUs per node) available in your cluster, the amount of time you want to spend training the model, and AutoConfigurator will recommend a model size that can be trained in that time with your hardware. To see an example of this, you can look at the conf/search_config/gpt3/unknown_size.yaml file. In this file, the model_size_in_b parameter is set to null. This is how you can tell it to recommend a model size to you. For the recommendation to work correctly, the num_nodes, gpus_per_node, and max_training_days parameters must indicate the number of nodes and GPUs per node available, and how long you wish to train the model for. Also, AutoConfigurator needs to know the vocabulary size, number of tokens you will train the model for, and the estimated TFLOPS per GPU your hardware can achieve. These can be modified using the vocab_size, num_tokens_in_b, and tflops_per_gpu parameters, respectively. Once all these parameters are set correctly, and after selecting the gpt3/unknown_size as the config to run in the search_config parameter in the conf/config.yaml file, the training pipeline can be executed calling python3 main.py. This will produce a base configuration for the suggested model size. If run_training_hp_search or run_inference_hp_search are set to True, it will also search for the HPs for training or inference, using the rest of the configuration yaml file as input. AutoConfigurator will behave the same way as when using a predefined config.

5.3.2.4. Interpreting the Results

When AutoConfigurator generates the base configuration for a model, it will be saved inside the directory specified in the logs parameter in your config files. By default, this will be .../results/<model_name>/<model_size>_<gpu_mem_size>/. As the default search_config value is set to gpt3/5b and the default gpu_memory_gb is set to 80, the results can be found in the .../results/gpt3/5b_80gb/ directory. The base config will be available inside that directory, with the name base_cfg_<model_size>.yaml.

If the training HP search pipeline is run, the results will be in three different directories inside your logs directory. The candidate_configs directory contains all the YAML files with all the configurations generated by the HP search. The training_logs directory contains all the logs of training each of the individual configs AutoConfigurator generated. If limit_search_runs was set to 30, then there should be 30 different directories with the logs for each model.

Finally, after all the training runs have finished and the final run has analyzed the throughput of each configuration, the final model recommendation will be stored in the final_results directory. This directory will contain a log file which lists the output_top_n fastest configs, sorted from fastest to slowest. The directory will also contain a csv file with all the results from every config that was run with AutoConfigurator for a given model size. The results will be sorted from highest throughput to slowest throughput. The CSV file also includes information such as the samples per second achieved by each model, the time per global step, the TFLOPS per GPU achieved, and so on. The final_results directory will also contain a YAML file, which corresponds to the config with the lowest training time. This is the recommended model for training.

For the inference HP search, the results can be found inside the directory specified in the results_dir parameter of the YAML config file. Inside that directory, you will find: .../inference/final_summary/final_output.csv. This csv file will have the results of every model that was run by the AutoConfigurator HP search.

Notes:

  • The result of the Training HP Search will vary when it is run with different numbers of nodes. This is mainly caused by the new distributed optimizer, which provides higher memory savings when using more nodes (i.e. higher data parallel value).
5.3.2.5. Logging Runs with Weights and Biases

Weights and Biases (W&B) can be used to log all the training search runs. To achieve this, the wandb parameters must be modified in the conf/config.yaml file. First, enable must be set to True. Then, the api_key_file must be set to point to the path where the file which contains the W&B API key. The API key must be in the first line of that file. Finally, the project parameter must have the name of the W&B project where the metrics will be stored. The name of each run does not need to be provided. It will be automatically generated by AutoConfigurator, using the model name, model size, and hyper-parameters used for each specific run.

wandb:  # Weights and Biases (W&B) logging.
    enable: True 
    api_key_file: null
    project: nemo-megatron-autoconfig

5.4. Training with Custom Configurations

The training config files can be modified, or other files can be created to be used for training. They should follow the same structure and guidelines as the existing model configurations.

5.4.1. Example: Changing Embedding Type for T5 Models

Here we show an example to change the embedding type for T5 models. Let's assume a case you want to train a 220M T5 model. Instead of using default absolute learnable position embeddings, you want to use relative position embeddings.

First of all, you might want to check the training configuration file in conf/training/(model_type)/(model_size).yaml. In this case it will be conf/training/t5/220m.yaml. In the configuration file, you can find all the options we support. You can find the parameters of your interests, in this case they will be

position_embedding_type: 'learned_absolute' # Position embedding type. Options ['learned_absolute', 'relative']
relative_attention_num_buckets: 32 # Relative position number of buckets for computing the bias
relative_attention_max_distance: 128 # max_distance to keep relative distance in the attention_num_buckets.

For Slurm based systems, you can directly modify the configuration file in line. In this case, you can change above three lines into

position_embedding_type: 'relative' # Position embedding type. Options ['learned_absolute', 'relative']
relative_attention_num_buckets: 32 # Relative position number of buckets for computing the bias
relative_attention_max_distance: 128 # max_distance to keep relative distance in the attention_num_buckets.

and submit the training job.

For BCP, you can override the default configurations by adding argument training.model.position_embedding_type='relative' when submitting the training job.

For more details of submitting training jobs on Slurm and BCP, please check Section 5.6.

5.5. Bring Your Own Dataset

If you want to train the GPT, T5, or mT5 models on your own dataset (which is already filtered and cleaned), you must first convert the dataset files to jsonl files.

As discussed in previous sections, the data_preparation parameter in conf/config.yaml specifies which file to use for data preparation configuration purposes. The data_preparation parameter needs to be specified as generic/custom_dataset for bringing your own dataset and data_preparation must be included in stages to run it. The custom_dataset config file can be found in conf/data_preparation/generic/custom_dataset.yaml. With our scripts, you can train your own tokenizer and preprocess your own dataset into a format that can be consumed by our training scripts.

Custom dataset only supports SentencePiece tokenizers at the moment. You can either train a fresh SentencePiece tokenizer with our scripts or load existing ones.

The data preparation can be parallelized by using multiple nodes (by default 20 nodes) to preprocess all custom dataset files in parallel.

5.5.1. Slurm

First, ensure the cluster related configuration in the conf/cluster/bcm.yaml file is correct. The cluster and cluster_type parameters in conf/config.yaml must be set to bcm. Then, modify the time_limit or any other parameter related to the job in the custom_dataset.yaml file. The data preparation can be parallelized by using nodes * workers_per_node number of workers (up to one workder for each dataset file).

Example:

To run only the data preparation pipeline and not the training, evaluation or inference pipelines, set the conf/config.yaml file to:

stages:
  - data_preparation

And then run:

python3 main.py

5.5.2. Base Command Platform

In order to run the data preparation script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. By default, the data preparation script will put the preprocessed data into the data/ directory. We recommend that the data_dir parameter is set to a workspace, so that the data is visible across multiple jobs later on. The tokenizer model files should also be stored to the same workspace as the dataset, for later usage. The data preparation code must be launched in a multi-node job. It can be parallelized to use up to number of nodes which is equal to the number of custom dataset files for faster preparation of the dataset.

To run the data preparation pipeline, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=[data_preparation] \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts
data_dir=/mount/data \
base_results_dir=/mount/results data_preparation=custom_dataset \
dataprepartion.train_tokenizer_args.inp=/path/to/text/file/for/training/tokenizer \
datapreparation.raw_dataset_files=[/path/to/custom_data_files] \
>> /results/data_custom_dataset_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. Stdout and stderr are redirected to the /results/data_gpt3_log.txt file, so it can be downloaded from NGC. Any other parameter can also be added to the command to modify its behavior.

5.5.3. Common

Set the configuration for the custom data preparation job in the YAML file:

run:
  name: custom_dataset
  results_dir: ${base_results_dir}/${.name}
  time_limit: "24:00:00"
  dependency: "singleton"
  node_array_size: 4
  cpus_per_node: 256
  workers_per_node: 4 # Number of workers per node in preprocessing step.
dataset: custom_dataset
custom_dataset_dir: ${data_dir}/custom_dataset
train_tokenizer: True # True to train a sentence piece tokenizer
train_tokenizer_args: # For all options please check: https://github.com/google/sentencepiece/blob/master/doc/options.md
   input: null # text file for training tokenizer
   input_format: "text" # text or tsv
   model_prefix: "custom_sp_tokenizer"
   model_type: "bpe" # model algorithm: unigram, bpe, word or char
   vocab_size: 8000 # Vocabulary size
   character_coverage: 0.9995 # character coverage to determine the minimum symbols
   unk_id: 1
   bos_id: 2
   eos_id: 3
   pad_id: 0
bpe_save_dir: ${.custom_dataset_dir}/bpe # Dir to save sentence piece tokenizer model and vocab files

preprocess_data: True  # True to preprocess the data from json, jsonl or json.gz files, False otherwise.
raw_dataset_files:
  - null # Each file should be input json, jsonl or json.gz file
tokenizer_model: ${.bpe_save_dir}/${data_preparation.train_tokenizer_args.model_prefix}.model # trained SentencePiece tokenizer model
preprocess_worker_mapping: ${.custom_dataset_dir}/preprocess_mapping
preprocessed_dir: ${.custom_dataset_dir}/preprocessed

Note: depending on the dataset and system, it's possible that system memory gets OOM with very large dataset shard files. The solution to this issue is to reduce dataset shard sizes. If you see similar issue, please consider breaking up json, jsonl or json.gz files into smaller chunks before running preprocessing.

5.6. Model Training

We provide an easy-to-use yet powerful pipeline to perform distributed training of both GPT, T5 and mT5 models across multiple nodes and GPUs. We also provide well-established recipes for different sizes models, where the throughput has been maximized, and the convergence properties of the models have been tested and confirmed.

5.6.1. GPT Training

The configuration used for the training pipeline must be specified in the conf/config.yaml file, specifying the training parameter, specifying which file to use for training purposes. The training must be included in stages to run the training pipeline. The default value is set to gpt3/5b, which can be found in conf/training/gpt3/5b.yaml. The parameters can be modified to adjust the hyperparameters of the training runs. All supported model types and sizes can be found in conf/training folder.

We support global batch size rampup during training. It can be set by changing rampup_batch_size parameter under the training config. Should be a list of 3 values: [<start_batch_size>, <batch_size_increment>, <rampup_samples>]
. Example: rampup_batch_size=[256, 128, 50000000]
. In case of using ramp up batch size, nodes scheduler will be created. It allows the use of a smaller number of nodes for smaller batch size stages. Nodes scheduler will be created automatically according to training.trainer.num_nodes parameter which corresponds to the maximum number of nodes you want to use for the maximum global batch size. Please, note that ramp up batch size only works with fused_adam optimizer for now.

5.6.1.1. Slurm

Set configuration for your Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

And set the training job specific parameters in the conf/training/(model_type)/(model_size).yaml file, using the run section:

run:
    name: gpt3_126m
    results_dir: ${base_results_dir}/${.name}
    time_limit: "1-12:00:00"
    dependency: "singleton"

To run only the training pipeline and not the data preparation, evaluation or inference pipelines, set the conf/config.yaml file to:

stages:
  - training

And then run:

python3 main.py
5.6.1.2. Base Command Platform

Select the cluster related configuration following the NGC documentation. Then, use the python3 main.py command to launch the job and override the desired parameters from the training job parameters.

5.6.2. T5 Training

The configuration used for the training pipeline must be specified in the conf/config.yaml file, specifying the training parameter, specifying which file to use for training purposes. The training must be included in stages to run the training pipeline. The training parameter needs to be set to t5/(model_size) for T5 models. For example, one can use t5/220m which can be found in conf/training/t5/220m.yaml. The parameters can be modified to adjust the hyperparameters of the training runs. All supported model types and sizes can be found in conf/training folder.

5.6.2.1. Slurm

Set configuration for your Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

And set the training job specific parameters in the conf/training/(model_type)/(model_size).yaml file, using the run section:

run:
    name: t5_220m
    results_dir: ${base_results_dir}/${.name}
    time_limit: "7-00:00:00"
    dependency: "singleton"

To run only the training pipeline and not the data preparation, evaluation or inference pipelines, set the conf/config.yaml file to:

stages:
  - training

And then run:

python3 main.py
5.6.2.2. Base Command Platform

Select the cluster related configuration following the NGC documentation. Then, use the python3 main.py command to launch the job and override the desired parameters from the training job parameters.

5.6.3. mT5 Training

The configuration used for the training pipeline must be specified in the conf/config.yaml file, specifying the training parameter, specifying which file to use for training purposes. The training must be included in stages to run the training pipeline. The training parameter needs to be set to t5/(model_size) for T5 models. For example, one can use mt5/390m which can be found in conf/training/mt5/390m.yaml. The parameters can be modified to adjust the hyperparameters of the training runs. All supported model types and sizes can be found in conf/training folder.

5.6.3.1. Slurm

Set configuration for your Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

And set the training job specific parameters in the conf/training/(model_type)/(model_size).yaml file, using the run section:

run:
    name: mt5_390m
    results_dir: ${base_results_dir}/${.name}
    time_limit: "7-00:00:00"
    dependency: "singleton"

To run only the training pipeline and not the data preparation, evaluation or inference pipelines, set the conf/config.yaml file to:

stages:
  - training

And then run:

python3 main.py
5.6.3.2. Base Command Platform

Select the cluster related configuration following the NGC documentation. Then, use the python3 main.py command to launch the job and override the desired parameters from the training job parameters.

5.6.4. BERT Training

The configuration used for the training pipeline must be specified in the conf/config.yaml file, specifying the training parameter, specifying which file to use for training purposes. The training must be included in stages to run the training pipeline. The training parameter needs to be set to bert/(model_size) for T5 models. For example, one can use bert/110m which can be found in conf/training/bert/110m.yaml. The parameters can be modified to adjust the hyperparameters of the training runs. All supported model types and sizes can be found in conf/training folder.

5.6.4.1. Slurm

Set configuration for your Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

And set the training job specific parameters in the conf/training/(model_type)/(model_size).yaml file, using the run section:

run:
    name: bert_110m
    results_dir: ${base_results_dir}/${.name}
    time_limit: "7-00:00:00"
    dependency: "singleton"

To run only the training pipeline and not the data preparation, evaluation or inference pipelines, set the conf/config.yaml file to:

stages:
  - training

And then run:

python3 main.py
5.6.4.2. Base Command Platform

Select the cluster related configuration following the NGC documentation. Then, use the python3 main.py command to launch the job and override the desired parameters from the training job parameters.

5.7. Resuming Training with Different Number of Nodes

To be able to resume a training run with a different number of nodes, we recommend to keep the global batch size unchanged. This ensures that each training step will be almost identical, regardless of the number of nodes. The number of nodes selected must be compatible with the rest of the parameters: GBS must be a multiple of (MBS * num_gpus) / (tensor_parallelism * pipeline parallelism)

where MBS is the micro batch size. For instance, the default GBS for the 5B GPT model is 1440; the MBS is 2; the number of GPUs is 20*8 = 160; the tensor_parallelism value is set to 2; and the pipeline_parallelism value is set to 1. Therefore, the GBS is set to a valid value:

1440 % (2 * 160) / (2 * 1) == 0

5.8. Checkpoint Conversion

We provide a simple tool to convert the checkpoints from .ckpt format to .nemo format, which will later be used for evaluation (in T5 models) and inference purposes.

5.8.1. GPT Conversion

The configuration used for the checkpoint conversion needs to be specified in the conf/config.yaml file, specifying the conversion parameter, which specifies the file to use for conversion purposes. The default value is set to gpt3/convert_gpt3, which can be found in conf/conversion/gpt3/convert_gpt3.yaml for GPT models.

The conversion must be included in stages to run the conversion pipeline.

5.8.1.1. Common

To specify the input checkpoint to be used for conversion for GPT models, use the model parameters in conf/conversion/convert_gpt3.yaml:

model:
    model_type: gpt # gpt or t5
    checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints
    checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt)
    hparams_file: ${conversion.run.train_dir}/results/hparams.yaml
    tensor_model_parallel_size: 2 # 1 for 126m, 2 for 5b, and 8 for 20b or larger models
    pipeline_model_parallel_size: 1 
    model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
    vocab_file: ${data_dir}/bpe/vocab.json
    merge_file: ${data_dir}/bpe/merges.txt

To specify the output location and file name of the converted .nemo file for GPT models, use the run parameters in conf/conversion/gpt3/convert_gpt3.yaml:

run:
    name: convert_${conversion.run.model_train_name}
    nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node
    time_limit: "2:00:00"
    ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}}
    convert_name: convert_nemo
    model_train_name: gpt3_5b
    train_dir: ${base_results_dir}/${.model_train_name}
    results_dir: ${.train_dir}/${.convert_name}
    output_path: ${.train_dir}/${.convert_name}
    nemo_file_name: megatron_gpt.nemo # name of nemo checkpoint; must be .nemo file
5.8.1.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the conversion pipeline and not the data preparation, training, evaluation or inference pipelines set the conf/config.yaml file to:

stages:
  - conversion

then run:

python3 main.py
5.8.1.3. Base Command Platform

In order to run the conversion script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The conversion script must be launched in a multi-node job.

To run the conversion pipeline to convert a 126M checkpoint stored in /mount/results/gpt3_126m/results/checkpoints, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=[conversion] \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results conversion.run.model_train_name=gpt3_126m conversion.model.vocab_file=/mount/data/bpe/vocab.json \
conversion.model.merge_file=/mount/data/bpe/merges.txt conversion.run.results_dir=/mount/results/gpt3_126m/convert_nemo \
conversion.model.checkpoint_folder=/mount/results/gpt3_126m/results/checkpoints conversion.model.tensor_model_parallel_size=1 \
>> /results/convert_gpt3_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/convert_gpt3_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.8.2. T5 Conversion

The configuration used for the checkpoint conversion needs to be specified in the conf/config.yaml file, specifying the conversion parameter, which specifies the file to use for conversion purposes. The conversion parameter needs to be set to t5/convert_t5 for T5 models, which can be found in conf/conversion/t5/convert_t5.yaml.

The conversion must be included in stages to run the conversion pipeline.

5.8.2.1. Common

To specify the input checkpoint to be used for conversion for T5 models, use the model parameters in conf/conversion/t5/convert_t5.yaml:

model:
    model_type: t5 # gpt or t5
    checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints
    checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt)
    hparams_file: ${conversion.run.train_dir}/results/hparams.yaml
    tensor_model_parallel_size: 1 # 1 for 220m, 2 for 3b
    pipeline_model_parallel_size: 1 
    model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
    vocab_file: ${data_dir}/bpe/vocab.txt
    merge_file: null

To specify the output location and file name of the converted .nemo file for T5 models, use the run parameters in conf/conversion/t5/convert_t5.yaml:

run:
    name: convert_${conversion.run.model_train_name}
    nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node
    time_limit: "2:00:00"
    ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}}
    convert_name: convert_nemo
    model_train_name: t5_220m
    train_dir: ${base_results_dir}/${.model_train_name}
    results_dir: ${.train_dir}/${.convert_name}
    output_path: ${.train_dir}/${.convert_name}
    nemo_file_name: megatron_t5.nemo # name of nemo checkpoint; must be .nemo file
5.8.2.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the conversion pipeline and not the data preparation, training, evaluation or inference pipelines set the conf/config.yaml file to:

stages:
  - conversion

then run:

python3 main.py
5.8.2.3. Base Command Platform

In order to run the conversion script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The conversion script must be launched in a multi-node job.

To run the conversion pipeline to convert a T5 220M checkpoint stored in /mount/results/t5_220m/results/checkpoints, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py conversion=convert_t5 \
stages=[conversion] \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results conversion.model.vocab_file=/mount/data/bpe/vocab.txt \
conversion.run.model_train_name=t5_220m conversion.run.results_dir=/mount/results/t5_220m/results/convert_nemo \
conversion.model.checkpoint_folder=/mount/results/t5_220m/checkpoints \
conversion.model.tensor_model_parallel_size=1 conversion.model.pipeline_model_parallel_size=1 \
>> /results/convert_t5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/convert_t5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.8.3. mT5 Conversion

The configuration used for the checkpoint conversion needs to be specified in the conf/config.yaml file, specifying the conversion parameter, which specifies the file to use for conversion purposes. The conversion parameter needs to be set to mt5/convert_mt5 for mT5 models, which can be found in conf/conversion/mt5/convert_mt5.yaml.

The conversion must be included in stages to run the conversion pipeline.

5.8.3.1. Common

To specify the input checkpoint to be used for conversion for mT5 models, use the model parameters in conf/conversion/mt5/convert_mt5.yaml:

model:
  model_type: t5 # gpt or t5, use t5 for mt5 as well
  checkpoint_folder: ${conversion.run.train_dir}/results/checkpoints
  checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt)
  hparams_file: ${conversion.run.train_dir}/results/hparams.yaml
  tensor_model_parallel_size: 1 
  pipeline_model_parallel_size: 1
  model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
  vocab_file: null
  merge_file: null
  tokenizer_model: ${data_dir}/mc4/bpe/mt5_tokenizer.model

To specify the output location and file name of the converted .nemo file for mT5 models, use the run parameters in conf/conversion/convert_mt5.yaml:

run:
  name: convert_${conversion.run.model_train_name}
  nodes: ${divide_ceil:${conversion.model.model_parallel_size}, 8} # 8 gpus per node
  time_limit: "2:00:00"
  ntasks_per_node: ${divide_ceil:${conversion.model.model_parallel_size}, ${.nodes}}
  convert_name: convert_nemo
  model_train_name: mt5_390m
  train_dir: ${base_results_dir}/${.model_train_name}
  results_dir: ${.train_dir}/${.convert_name}
  output_path: ${.train_dir}/${.convert_name}
  nemo_file_name: megatron_mt5.nemo # name of nemo checkpoint; must be .nemo file
5.8.3.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the conversion pipeline and not the data preparation, training, evaluation or inference pipelines set the conf/config.yaml file to:

stages:
  - conversion

then run:

python3 main.py
5.8.3.3. Base Command Platform

In order to run the conversion script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The conversion script must be launched in a multi-node job.

To run the conversion pipeline to convert a mT5 390M checkpoint stored in /mount/results/mt5_390m/results/checkpoints, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py conversion=convert_mt5 \
stages=[conversion] \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts 
data_dir=/mount/data \
conversion.run.model_train_name=mt5_390m \
base_results_dir=/mount/results conversion.run.results_dir=/mount/results/mt5_390m/results/convert_nemo \
conversion.model.checkpoint_folder=/mount/results/mt5_390m/checkpoints \
conversion.model.tensor_model_parallel_size=1 conversion.model.pipeline_model_parallel_size=1 \
>> /results/convert_mt5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/convert_mt5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.9. Model Fine-tuning

We also provide an easy-to-use tool to help fine-tuning the trained checkpoints on SQuAD for T5 models and on XQuAD for mT5 models. Fine-tuning for GPT models is not supported.

5.9.1. T5 Fine-tuning

The configuration used for the fine-tuning needs to be specified in the conf/config.yaml file, specifying the fine_tuning parameter, which specifies the file to use for fine-tuning purposes. The fine_tuning parameter must be included in stages to run the fine-tuning pipeline. To fine-tune checkpoint on squad task, set fine_tuning parameter to t5/squad, which can be found in conf/fine_tuning/t5/squad.yaml. The parameters can be modified to adapt different GLUE tasks and checkpoints in fine-tuning runs. One will need to tune the fine-tuning hyper parameters to reach the best accuracy for a specific GLUE task. The provided hyper parameters are only optimized for T5 220M model on squad task.

5.9.1.1. Common

To specify the configuration for what tasks to run for fine_tuning, use the run.task_name parameter. And use all the run parameters to define the job specific config:

run:
    name: ${.task_name}_${.model_train_name}
    time_limit: "04:00:00"
    dependency: "singleton"
    convert_name: convert_nemo
    model_train_name: t5_220m
    task_name: "squad"
    results_dir: ${base_results_dir}/${.model_train_name}/${.task_name}

To specify which model checkpoint to load and its definition, use the model parameter:

model:
    restore_from_path: ${base_results_dir}/${fine_tuning.run.model_train_name}/${fine_tuning.run.convert_name}/megatron_t5.nemo # Path to a trained T5 .nemo file
    tensor_model_parallel_size: 1
    pipeline_model_parallel_size: 1
5.9.1.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the evaluation pipeline and not the data preparation, training, conversion or inference pipelines set the conf/config.yaml file to:

stages:
  - fine_tuning

then run:

python3 main.py
5.9.1.3. Base Command Platform

In order to run the fine-tuning script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The evaluation script must be launched in a multi-node job.

To run the fine-tuning pipeline to fine-tune a 220M T5 model converted checkpoint stored in /mount/results/t5_220m/convert_nemo, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py fine_tuning=t5/squad stages=[fine_tuning] \
 cluster_type=bcp \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
fine_tuning.run.model_train_name=t5_220m \
fine_tuning.model.restore_from_path=/mount/results/t5_220m/convert_nemo/results/megatron_t5.nemo \
>> /results/finetune_t5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/finetune_t5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.9.2. mT5 Fine-tuning

XQuAD benchmark are supported for mT5 models.

The configuration used for the fine-tuning needs to be specified in the conf/config.yaml file, specifying the fine_tuning parameter, which specifies the file to use for fine-tuning purposes. The fine_tuning parameter must be included in stages to run the fine-tuning pipeline. To fine-tune checkpoint on xquad task, set fine_tuning parameter to mt5/xquad, which can be found in conf/fine_tuning/mt5/xquad.yaml.

5.9.2.1. Common

To specify the configuration for what tasks to run for fine-tuning, use the run.task_name parameter. And use all the run parameters to define the job specific config:

run:
  name: ${.task_name}_${.model_train_name}
  time_limit: "04:00:00"
  dependency: "singleton"
  convert_name: convert_nemo
  model_train_name: mt5_220m
  task_name: "xquad"
  results_dir: ${base_results_dir}/${.model_train_name}/${.task_name}

To specify which model checkpoint to load and its definition, use the model parameter:

model:
  restore_from_path: ${base_results_dir}/${fine_tuning.run.model_train_name}/${fine_tuning.run.convert_name}/megatron_mt5.nemo # Path to a trained mt5 .nemo file
  tensor_model_parallel_size: 1
  pipeline_model_parallel_size: 1
5.9.2.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the evaluation pipeline and not the data preparation, training, conversion or inference pipelines set the conf/config.yaml file to:

stages:
  - fine_tuning

then run:

python3 main.py
5.9.2.3. Base Command Platform

In order to run the fine-tuning script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The evaluation script must be launched in a multi-node job.

To run the fine-tuning pipeline to fine-tune a 390M mT5 model converted checkpoint stored in /mount/results/mt5_390m/convert_nemo, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py  fine_tuning=mt5/xquad stages=[fine_tuning] \
 cluster_type=bcp \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
fine_tuning.run.model_train_name=mt5_390m \
fine_tuning.model.restore_from_path=/mount/results/mt5_390m/convert_nemo/results/megatron_mt5_xquad.nemo \
>> /results/finetune_mt5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/finetune_mt5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.9.3. Fine-tuning on Custom Tasks

We also support fine-tuning on custom down-stream tasks in T5 and mT5. In order to benchmark on your own dataset, you are required to split the original dataset into two files, i.e. a txt file corresponding to the source (context) data, and txt file corresponding to the target data. Each line of these two files forms a fine-tuning sample.

Custom fine-tuning configuration files can be found in conf/fine_tuning/t5/custom_task.yaml for T5 models and conf/fine_tuning/mt5/custom_task.yaml for mT5 models. The essential parameters are listed below. You need to specify the dataset paths and preferred benchmark metrics.

  data:
    train_ds:
      src_file_name: ??? # Path to the txt file corresponding to the source data.
      tgt_file_name: ??? # Path to the txt file corresponding to the target data.

    validation_ds:
      src_file_name: ??? # Path to the txt file corresponding to the source data.
      tgt_file_name: ??? # Path to the txt file corresponding to the target data.
      metric:
        name: "exact_string_match" # Name of the evaluation metric to use.
        average: null # Average the metric over the dataset. Options: ['macro', 'micro']. Works only for 'F1', 'accuracy' etc. Refer to torchmetrics for metrics where this is supported.
        num_classes: null

You can follow the instructions in T5 and mT5 fine-tuning sections to submit a custom task job.

5.10. Model Prompt Learning

Within NeMo Framework we refer to p-tuning and prompt tuning methods collectively as prompt learning. Both methods are parameter efficient alternatives to fine-tuning pretrained language models. Our NeMo implementation makes it possible to use one pretrained GPT, T5 or mT5 models on many downstream tasks without needing to tune the model's full set of parameters. It also allows for adding new tasks to your model without overwriting or disrupting previous tasks for which the model has already been p-tuned/prompt-tuned. Because the original model parameters are frozen and never altered by either method, p-tuning/prompt-tuning also avoid cartographic forgetting issues often encountered when fine-tuning models.

Instead of selecting discrete text prompts in a manual or automated fashion, prompt tuning and p-tuning utilize virtual prompt embeddings that can be optimized via gradient decent. The only difference between prompt tuning and p-tuning within NeMo-Megatron is the architecture used to tune the soft prompt tokens during training.

For more details of our implementation, please check Prompt Learning in NeMo.

5.10.1. GPT Prompt Learning

SQuAD v1.1 benchmark is supported for prompt learning. With default prompt learning config file, our scripts will download and preprocess original SQuAD v1.1 dataset to prompt learning dataset format. You can also bring your own task dataset as long as it has been processed into the prompt learning dataset format.

The configuration used for the prompt learning needs to be defined in the conf/config.yaml file by modifying the prompt_learning parameter, which specifies the file to use for prompt learning purposes. The prompt_learning parameter must be included in stages to run the prompt learning pipeline. To prompt learning on squad task, set prompt_learning parameter to gpt3/squad, which can be found in conf/prompt_learning/gpt3/squad.yaml. It is possible to use optimizations such as sequence-parallelism from the base GPT model while prompt-learning as well. To enable this, set model.sequence_sequence_parallel=True.

5.10.1.1. Common

To specify the configuration for prompt learning, use all the run parameters to define the job specific config:

run:
  name: ${.task_name}_${.model_train_name}
  time_limit: "04:00:00"
  dependency: "singleton"
  convert_name: convert_nemo
  model_train_name: gpt3_5b
  task_name: "squad"
  results_dir: ${base_results_dir}/${.model_train_name}/prompt_learning_${.task_name}

To specify which language model checkpoint to load and its definition, use the model parameter:

model:
  language_model_path: ${base_results_dir}/${prompt_learning.run.model_train_name}/${prompt_learning.run.convert_name}/megatron_gpt.nemo # Restore lanugage model from pre-trained .nemo checkpoint
  tensor_model_parallel_size: 1
  pipeline_model_parallel_size: 1
5.10.1.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: 1
gpus_per_node: null
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the prompt learning pipeline and not the data preparation, training, conversion or other pipelines set the conf/config.yaml file to:

stages:
  - prompt_learning

then run:

python3 main.py
5.10.1.3. Base Command Platform

In order to run the prompt learning script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The evaluation script must be launched in a multi-node job.

To run the prompt learning pipeline to prompt-learn a 5B GPT model converted checkpoint stored in /mount/results/gpt3_5b/convert_nemo, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py prompt_learning=gpt3/squad \
stages=[prompt_learning] cluster_type=bcp \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
prompt_learning.run.model_train_name=gpt3_5b \
prompt_learning.model.language_model_path=/mount/results/gpt3_5b/convert_nemo/results/megatron_gpt.nemo \
>> /results/prompt_learning_gpt3_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/prompt_learning_gpt3_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.10.2. T5 and mT5 Prompt Learning

The configuration used for the prompt learning needs to be defined in the conf/config.yaml file by modifying the prompt_learning parameter, which specifies the file to use for prompt learning purposes. The prompt_learning parameter must be included in stages to run the prompt learning pipeline. To prompt learning on squad task, set prompt_learning parameter to t5/squad, which can be found in conf/prompt_learning/t5/squad.yaml for T5 models (or mt5/squad, which can be found in conf/prompt_learning/mt5/squad.yaml for mT5 models).

5.10.2.1. Common

To specify the configuration for prompt learning, use all the run parameters to define the job specific config:

run:
  name: ${.task_name}_${.model_train_name}
  time_limit: "04:00:00"
  dependency: "singleton"
  convert_name: convert_nemo
  model_train_name: t5_220m # or mt5_390m
  task_name: "squad"
  results_dir: ${base_results_dir}/${.model_train_name}/prompt_learning_${.task_name}

To specify which language model checkpoint to load and its definition, use the model parameter:

model:
  language_model_path: ${base_results_dir}/${prompt_learning.run.model_train_name}/${prompt_learning.run.convert_name}/megatron_t5.nemo # or megatron_mt5.nemo
  tensor_model_parallel_size: 1
  pipeline_model_parallel_size: 1
5.10.2.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: 1
gpus_per_node: null
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the prompt learning pipeline and not the data preparation, training, conversion or other pipelines set the conf/config.yaml file to:

stages:
  - prompt_learning

then run:

python3 main.py
5.10.2.3. Base Command Platform

In order to run the prompt learning script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The evaluation script must be launched in a multi-node job.

To run the prompt learning pipeline to prompt-learn a 220M T5 model converted checkpoint stored in /mount/results/t5_220m/convert_nemo, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py prompt_learning=t5/squad \
stages=[prompt_learning] cluster_type=bcp \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts  data_dir=/mount/data base_results_dir=/mount/results \
prompt_learning.run.model_train_name=t5_220m \
prompt_learning.model.language_model_path=/mount/results/t5_220m/convert_nemo/results/megatron_t5.nemo \
>> /results/prompt_learning_t5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/prompt_learning_t5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

To run the prompt learning pipeline to prompt-learn a 390M mT5 model converted checkpoint stored in /mount/results/mt5_390m/convert_nemo, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py prompt_learning=mt5/squad \
stages=[prompt_learning] cluster_type=bcp \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
prompt_learning.run.model_train_name=mt5_390m \
prompt_learning.model.language_model_path=/mount/results/t5_220m/convert_nemo/results/megatron_mt5.nemo \
>> /results/prompt_learning_mt5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/prompt_learning_mt5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.11. Model Adapter Learning and IA3 Learning

NeMo Framework supports Adapter Learning and Infused Adapter by Inhibiting and Amplifying Inner Activations (IA3) learning. Both methods are parameter-efficient alternatives to fine-tuning pretrained language models. Our NeMo implementation makes it possible to use one pretrained GPT or T5 models on many downstream tasks without tuning the model's full set of parameters. Because the original model parameters are frozen and never altered by either method, these also avoid cartographic forgetting issues often encountered when fine-tuning models.

Unlike prompt-learning and p-tuning, Adapter learning and IA3 do not insert virtual prompts into the input. Adapter learning introduces feedforward layers within the core transformer architecture which are updated for specific downstream tasks. IA3 adds even fewer parameters that simply scale the hidden representations in the transformer layer, these scaling parameters can be trained for specific downstream tasks.

5.11.1. GPT Adapter Learning and IA3 Learning

SQuAD v1.1 benchmark is supported for Adapter learning and IA3. With default adapter learning and IA3 config file, our scripts will download and preprocess original SQuAD v1.1 dataset to adapter learning and IA3 dataset format (the same format as prompt learning). You can also bring your own task dataset as well.

The configuration used for the adapter learning needs to be defined in the conf/config.yaml file by modifying the adapter_learning parameter, which specifies the file to use for adapter learning purposes. The adapter_learning parameter must be included in stages to run the adapter learning pipeline. To adapter learning on squad task, set adapter_learning parameter to gpt3/squad, which can be found in conf/adapter_learning/gpt3/squad.yaml.

IA3 learning can be defined in the same way inside conf/config.yaml file by modifying the ia3_learning parameter, which specifies the file to use for IA3 learning purposes. The ia3_learning parameter must be included in stages to run the IA3 learning pipeline. To IA3 learning on squad task, set ia3_learning parameter to gpt3/squad, which can be found in conf/ia3_learning/gpt3/squad.yaml.

5.11.1.1. Common

To specify the configuration for adapter learning (or IA3 learning), use all the run parameters to define the job specific config:

run:
  name: ${.task_name}_${.model_train_name}
  time_limit: "04:00:00"
  dependency: "singleton"
  convert_name: convert_nemo
  model_train_name: gpt3_5b
  task_name: "squad"
  results_dir: ${base_results_dir}/${.model_train_name}/adapter_learning_${.task_name} # or ia3_learning

To specify which language model checkpoint to load and its definition, use the model parameter:

model:
  language_model_path: ${base_results_dir}/${adapter_learning.run.model_train_name}/${adapter_learning.run.convert_name}/megatron_gpt.nemo # # or ia3_learning
  tensor_model_parallel_size: 1
  pipeline_model_parallel_size: 1
5.11.1.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: 1
gpus_per_node: null
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the adapter learning pipeline and not the data preparation, training, conversion or other pipelines set the conf/config.yaml file to:

stages:
  - adapter_learning # or ia3_learning

then run:

python3 main.py
5.11.1.3. Base Command Platform

In order to run the adapter learning script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The evaluation script must be launched in a multi-node job.

To run the adapter learning pipeline to adapter-learn a 5B GPT model converted checkpoint stored in /mount/results/gpt3_5b/convert_nemo, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py adapter_learning=gpt3/squad \
stages=[adapter_learning] cluster_type=bcp \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts  data_dir=/mount/data base_results_dir=/mount/results \
adapter_learning.run.model_train_name=gpt3_5b \
adapter_learning.model.language_model_path=/mount/results/gpt3_5b/convert_nemo/results/megatron_gpt.nemo \
>> /results/adapter_learning_gpt3_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/adapter_learning_gpt3_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

To run the IA3 learning pipeline ro IA3-learn a 5B GPT model converted checkpoint stored in /mount/results/gpt3_5b/convert_nemo, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py ia3_learning=gpt3/squad \
stages=[ia3_learning] cluster_type=bcp \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
ia3_learning.run.model_train_name=gpt3_5b \
ia3_learning.model.language_model_path=/mount/results/gpt3_5b/convert_nemo/results/megatron_gpt.nemo \
>> /results/ia3_learning_gpt3_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/ia3_learning_gpt3_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.11.2. T5 Adapter Learning and IA3 Learning

The configuration used for the adapter learning needs to be defined in the conf/config.yaml file by modifying the adapter_learning parameter, which specifies the file to use for adapter learning purposes. The adapter_learning parameter must be included in stages to run the adapter learning pipeline. To adapter learning on squad task, set adapter_learning parameter to t5/squad, which can be found in conf/adapter_learning/t5/squad.yaml for T5 models.

IA3 learning can be defined in the same way inside conf/config.yaml file by modifying the ia3_learning parameter, which specifies the file to use for IA3 learning purposes. The ia3_learning parameter must be included in stages to run the IA3 learning pipeline. To IA3 learning on squad task, set ia3_learning parameter to t5/squad, which can be found in conf/adapter_learning/t5/squad.yaml for T5 models.

5.11.2.1. Common

To specify the configuration for adapter learning (or IA3 learning), use all the run parameters to define the job specific config:

run:
  name: ${.task_name}_${.model_train_name}
  time_limit: "04:00:00"
  dependency: "singleton"
  convert_name: convert_nemo
  model_train_name: t5_220m
  task_name: "squad"
  results_dir: ${base_results_dir}/${.model_train_name}/adapter_learning_${.task_name} # or ia3_learning

To specify which language model checkpoint to load and its definition, use the model parameter:

model:
  language_model_path: ${base_results_dir}/${adapter_learning.run.model_train_name}/${adapter_learning.run.convert_name}/megatron_t5.nemo # or ia3_learning
  tensor_model_parallel_size: 1
  pipeline_model_parallel_size: 1
5.11.2.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: 1
gpus_per_node: null
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the adapter learning pipeline and not the data preparation, training, conversion or other pipelines set the conf/config.yaml file to:

stages:
  - adapter_learning # or ia3_learning

then run:

python3 main.py
5.11.2.3. Base Command Platform

In order to run the adapter learning script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The evaluation script must be launched in a multi-node job.

To run the adapter learning pipeline to adapter-learn a 220M T5 model converted checkpoint stored in /mount/results/t5_220m/convert_nemo, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py adapter_learning=t5/squad \
stages=[adapter_learning] cluster_type=bcp \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
adapter_learning.run.model_train_name=t5_220m \
adapter_learning.model.language_model_path=/mount/results/t5_220m/convert_nemo/results/megatron_t5.nemo \
>> /results/adapter_learning_t5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/adapter_learning_t5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

To run the IA3 learning pipeline to IA3-learn a 220M T5 model converted checkpoint stored in /mount/results/t5_220m/convert_nemo, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py ia3_learning=t5/squad \
stages=[ia3_learning] cluster_type=bcp \
launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data base_results_dir=/mount/results \
ia3_learning.run.model_train_name=t5_220m \
ia3_learning.model.language_model_path=/mount/results/t5_220m/convert_nemo/results/megatron_t5.nemo \
>> /results/ia3_learning_t5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/ia3_learning_t5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.12 LoRA Model and Generalized PEFT Framework

Many Parameter Efficient Fine-Tuning (PEFT) models have overlapping functionalities. In order to enhance NeMo's codebase, we have worked towards unifying the implementation of all supported PEFT methods, making it more streamlined. Furthermore, we have introduced the Low-rank Adapter PEFT model for GPT-style base models in NeMo.

The new PEFT framework is built upon the SFT models and datasets, thereby inheriting all the dataset preparation requirements from SFT. For more details, please refer to the SFT section below.

5.12.1 PEFT Training and Inference

We offer a training and inference script in NeMo. Below is an example of how to use the training script. The TRAIN_FILEs (and VALIDATION_FILEs) follow the same format as SFT.

Take note of the model.peft.peft_scheme argument. You can train a LoRA, P-tuning, Adapter, or IA3 model by setting this argument to the desired PEFT method.

python3 /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_peft_tuning.py \
  model.restore_from_path=<BASE_GPT_MODEL> \ 
  model.data.train_ds.num_workers=0 \
  model.data.validation_ds.num_workers=0 \
  model.data.train_ds.file_names=[<TRAIN_FILE1>,<TRAIN_FILE2>,...] \ 
  model.data.train_ds.concat_sampling_probabilities=[0.3,0.2,..] \ # should sum to 1 and be of the same length as number of training files
  model.data.validation_ds.file_names=[<VALIDATION_FILE1>, <VALIDATION_FILE2>,...] \ 
  model.data.train_ds.prompt_template='{input} Answer: {output}' \
  model.peft.peft_scheme='lora'  # can be replaced with 'adapter', 'ptuning' or 'ia3'
  model.answer_only_loss=True 

At the end of training a '.nemo' model is generated which contains the parameters for the PEFT model. Similarly, the PEFT framework has a single inference script as well:

python3 /opt/NeMo/examples/nlp/language_modeling/tuning/megatron_gpt_peft_eval.py \
model.restore_from_path=<BASE_GPT_MODEL> \
model.peft.restore_from_path=<PEFT_MODEL> \
model.data.test_ds.file_names=[<TEST_FILE>] \
model.data.test_ds.names=['my_test_set'] \
model.data.test_ds.tokens_to_generate=30 \
inference.greedy=True \
inference.outfile_path=<OUTPUT_FILE>

Additionally, NeMo has a notebook which walks through the steps (which these scripts encapsulate) to train and run inference for PEFT models: https://github.com/NVIDIA/NeMo/blob/main/tutorials/nlp/lora.ipynb

5.13. Model Evaluation

5.13.1. GPT Evaluation

We also provide a simple tool to help evaluate the trained checkpoints. You can evaluate the capabilities of the GPT model on the following ZeroShot downstream evaluation tasks: lambada, boolq, race, piqa, hellaswag, winogrande, wikitext2, and wikitext103.

The model evaluation must be performed using a training checkpoint (.ckpt format), not a converted checkpoint (.nemo format).

The configuration used for the evaluation needs to be specified in the conf/config.yaml file, specifying the evaluation parameter, which specifies the file to use for evaluation purposes. The evaluation parameter must be included in stages to run the evaluation pipeline. The default value is set to gpt3/evaluate_all, which can be found in conf/evaluation/gpt3/evaluate_all.yaml. The parameters can be modified to adapt different evaluation tasks and checkpoints in evaluation runs. For Base Command Platform, all these parameters should be overridden from the command line.

5.13.1.1. Common

To specify the configuration for what tasks to run for evaluation, use the run.tasks parameter. And use all the run parameters to define the job specific config:

run:
    name: ${.eval_name}_${.model_train_name}
    time_limit: "4:00:00"
    nodes: ${divide_ceil:${evaluation.model.model_parallel_size}, 8} # 8 gpus per node
    ntasks_per_node: ${divide_ceil:${evaluation.model.model_parallel_size}, ${.nodes}}
    eval_name: eval_all
    model_train_name: gpt3_5b
    train_dir: ${base_results_dir}/${.model_train_name}
    tasks: all_tasks    # supported: lambada, boolq, race, piqa, hellaswag, winogrande, wikitext2, wikitext103 OR all_tasks
    results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}

To specify which model checkpoint to load and its definition, use the model parameter:

model:
    model_type: nemo-gpt3
    checkpoint_folder: ${evaluation.run.train_dir}/results/checkpoints
    checkpoint_name: latest # latest OR name pattern of a checkpoint (e.g. megatron_gpt-*last.ckpt)
    hparams_file: ${evaluation.run.train_dir}/results/hparams.yaml
    tensor_model_parallel_size: 2 #1 for 126m, 2 for 5b, 8 for 20b
    pipeline_model_parallel_size: 1
    model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
    precision: bf16 # must match training precision - 32, 16 or bf16
    eval_batch_size: 4
    vocab_file: ${data_dir}/bpe/vocab.json
    merge_file: ${data_dir}/bpe/merges.txt
5.13.1.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the evaluation pipeline and not the data preparation, training, conversion or inference pipelines set the conf/config.yaml file to:

stages:
  - evaluation

then run:

python3 main.py
5.13.1.3. Base Command Platform

In order to run the evaluation script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The evaluation script must be launched in a multi-node job.

To run the evaluation pipeline to evaluate a 126M GPT model checkpoint stored in /mount/results/gpt3_126m/checkpoints, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=[evaluation] \
 cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results evaluation.model.vocab_file=/mount/data/bpe/vocab.json \
evaluation.model.merge_file=/mount/data/bpe/merges.txt evaluation.run.results_dir=/mount/results/gpt3_126m/evaluation \
evaluation.model.checkpoint_folder=/mount/results/gpt3_126m/results/checkpoints evaluation.model.eval_batch_size=16 \
evaluation.model.tensor_model_parallel_size=1 \
>> /results/eval_gpt3_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/eval_gpt3_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.13.1.4 Interleaved Pipeline Parallelism

If your model was trained with interleaved pipeline parallelism, then the model must converted to a non-interleaved model. In order to check if your model used interleaved, inspect the training config and verify that model.virtual_pipeline_model_parallel_size > 0.

To convert the model, use the script from the NeMo Toolkit: examples/nlp/language_modeling/megatron_change_num_partitions.py

CUDA_VISIBLE_DEVICES=0 python3 -u /opt/NeMo/examples/nlp/language_modeling/megatron_change_num_partitions.py  \
  --num_gpu_per_node=1 \
  --model_extracted_dir=${RESULTS_DIR}/checkpoints \
  --target_file=${RESULTS_DIR}/checkpoints/megatron_gpt_converted.nemo \
  --ckpt_name='megatron_gpt--val_loss=2.59-step=9421-consumed_samples=2411520.0-last.ckpt' \
  --tensor_model_parallel_size=1 \
  --target_tensor_model_parallel_size=1 \
  --pipeline_model_parallel_size=4 \
  --target_pipeline_model_parallel_size=4 \
  --virtual_pipeline_model_parallel_size=3 \
  --hparams_file=${RESULTS_DIR}/hparams.yaml \
  --precision=bf16 "

Note the conversion script should only be run with a single GPU.

The output of the conversion script is a .nemo file. This file should be added to your evaluation config:

evaluation.model.nemo_model=/path/to/converted.nemo \
evaluation.model.checkpoint_folder=null \
evaluation.model.checkpoint_name=null \
evaluation.model.hparams_file=null \

5.13.2. T5 Evaluation

On top of fine-tuned checkpoint, you can run the evaluation scripts to evaluate the capabilities of the finetuned T5 model on SQuAD. The model evaluation must be performed with a fine-tuned checkpoint in .nemo format.

The configuration used for the evaluation needs to be specified in the conf/config.yaml file, specifying the evaluation parameter, which specifies the file to use for evaluation purposes. The evaluation parameter must be included in stages to run the evaluation pipeline. The default value is set to t5/squad, which can be found in conf/evaluation/t5/squad.yaml. The parameters can be modified to adapt different evaluation tasks and checkpoints in evaluation runs. For Base Command Platform, all these parameters should be overridden from the command line.

5.13.2.1. Common

To specify the configuration for what tasks to run for evaluation, use the run.task_name parameter. And use all the run parameters to define the job specific config:

run:
    name: eval_${.task_name}_${.model_train_name}
    time_limit: "04:00:00"
    dependency: "singleton"
    model_train_name: t5_220m
    task_name: "squad"
    fine_tuning_results_dir: ${base_results_dir}/${.model_train_name}/${.task_name}
    results_dir: ${base_results_dir}/${.model_train_name}/${.task_name}_eval

To specify which fine-tuned checkpoint to load and its definition, use the model parameter:

model:
    restore_from_path: ${evaluation.run.fine_tuning_results_dir}/checkpoints/megatron_t5_glue.nemo # Path to a finetuned T5 .nemo file
    tensor_model_parallel_size: 1
    pipeline_model_parallel_size: 1
5.13.2.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the evaluation pipeline and not the data preparation, training, conversion or inference pipelines set the conf/config.yaml file to:

stages:
  - evaluation

then run:

python3 main.py
5.13.2.3. Base Command Platform

In order to run the evaluation script on Base Command Platform for T5 models, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The evaluation script must be launched in a multi-node job.

To run the evaluation pipeline to evaluate a 220M T5 model which has been fine-tuned on squad task and checkpoint stored in /mount/results/t5_220m/squad/results/checkpoints, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py evaluation=t5/squad \
stages=[evaluation] \
 cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts  data_dir=/mount/data \
base_results_dir=/mount/results evaluation.run.model_train_name=t5_220m \
evaluation.model.restore_from_path=/mount/results/t5_220m/squad/results/checkpoints/megatron_t5_glue.nemo \
>> /results/eval_t5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/eval_t5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.13.3. mT5 Evaluation

On top of fine-tuned checkpoint, you can run the evaluation scripts to evaluate the capabilities of the finetuned mT5 model on the following downstream evaluation tasks: xquad. Usually the task of fine-tuning and evaluation should be the same.

The model evaluation must be performed with a fine-tuned checkpoint in .nemo format.

The configuration used for the evaluation needs to be specified in the conf/config.yaml file, specifying the evaluation parameter, which specifies the file to use for evaluation purposes. The evaluation parameter must be included in stages to run the evaluation pipeline. The default value is set to mt5/xquad, which can be found in conf/evaluation/mt5/xquad.yaml. The parameters can be modified to adapt different evaluation tasks and checkpoints in evaluation runs. For Base Command Platform, all these parameters should be overridden from the command line.

5.13.3.1. Common

To specify the configuration for what tasks to run for evaluation, use the run.task_name parameter. And use all the run parameters to define the job specific config:

run:
    name: eval_${.task_name}_${.model_train_name}
    time_limit: "04:00:00"
    dependency: "singleton"
    model_train_name: mt5_390m
    task_name: "xquad"
    fine_tuning_results_dir: ${base_results_dir}/${.model_train_name}/${.task_name}
    results_dir: ${base_results_dir}/${.model_train_name}/${.task_name}_eval

To specify which fine-tuned checkpoint to load and its definition, use the model parameter:

model:
    restore_from_path: ${evaluation.run.fine_tuning_results_dir}/checkpoints/megatron_mt5_xquad.nemo # Path to a finetuned T5 .nemo file
    tensor_model_parallel_size: 1
    pipeline_model_parallel_size: 1
5.13.3.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the evaluation pipeline and not the data preparation, training, conversion or inference pipelines set the conf/config.yaml file to:

stages:
  - evaluation

then run:

python3 main.py
5.13.3.3. Base Command Platform

In order to run the evaluation script on Base Command Platform for mT5 models, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The evaluation script must be launched in a multi-node job.

To run the evaluation pipeline to evaluate a 390M mT5 model which has been fine-tuned on xquad task and checkpoint stored in /mount/results/mt5_390m/xquad/results/checkpoints, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py evaluation=mt5/xquad \
stages=[evaluation] cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \
base_results_dir=/mount/results evaluation.run.model_train_name=mt5_390m \
evaluation.model.restore_from_path=/mount/results/mt5_390m/xquad/results/checkpoints/megatron_mt5_xquad.nemo \
>> /results/eval_mt5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/eval_mt5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.13.4. Prompt Learned GPT Evaluation

We also provide a simple tool to help evaluate the prompt learned GPT checkpoints. You can evaluate the capabilities of the prompt learned GPT model on a customized prompt learning test dataset. We provide an example to evaluate our checkpoint, which went through prompt learning on SQuAD v1.1, on the SQuAD v1.1 test dataset created in prompt learning step.

The configuration used for the evaluation needs to be defined in the conf/config.yaml file by modifying the evaluation parameter, which specifies the file to be used for evaluation purposes. The evaluation parameter must be included in stages to run the evaluation pipeline. The value should be set to prompt_gpt3/squad.yaml, which can be found in conf/evaluation/prompt_gpt3/squad.yaml. The parameters can be modified to adapt different evaluation tasks and checkpoints in evaluation runs. For Base Command Platform, all these parameters should be overridden from the command line.

5.13.4.1. Common

To specify the configuration, use all the run parameters to define the job specific config. ( run.tasks has to be set to prompt to run evaluation on prompt learning test tasks):

run:
  name: ${.eval_name}_${.model_train_name}
  time_limit: "4:00:00"
  nodes: ${divide_ceil:${evaluation.model.model_parallel_size}, 8} # 8 gpus per node
  ntasks_per_node: ${divide_ceil:${evaluation.model.model_parallel_size}, ${.nodes}}
  eval_name: eval_prompt_squad
  model_train_name: gpt3_5b
  tasks: "prompt" # general prompt task
  prompt_learn_dir: ${base_results_dir}/${.model_train_name}/prompt_learning_squad # assume prompt learning was on squad task
  results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}

To specify which model checkpoint to load and which prompt learning test dataset to evaluate, use the model parameter:

model:
  model_type: nemo-gpt3-prompt
  nemo_model: ${evaluation.run.prompt_learn_dir}/megatron_gpt_prompt.nemo
  tensor_model_parallel_size: 2 #1 for 126m, 2 for 5b, 8 for 20b
  pipeline_model_parallel_size: 1
  model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
  precision: bf16 # must match training precision - 32, 16 or bf16
  eval_batch_size: 4
  prompt_dataset_paths: ${data_dir}/prompt_data/v1.1/squad_test.jsonl
  disable_special_tokens: False # Whether to disable virtual tokens in prompt model evaluation. This is equivalent to evaluate without prompt-/p-tuning.
5.13.4.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: 1
gpus_per_node: null
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the evaluation pipeline and not the data preparation, training, conversion or inference pipelines set the conf/config.yaml file to:

stages:
  - evaluation

then run:

python3 main.py
5.13.4.3. Base Command Platform

In order to run the evaluation script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The evaluation script must be launched in a multi-node job.

To run the evaluation pipeline to evaluate a prompt learned 5B GPT model checkpoint stored in /mount/results/gpt3_5b/checkpoints, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=[evaluation] evaluation=prompt_gpt3/squad \
 cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \
base_results_dir=/mount/results evaluation.run.results_dir=/mount/results/gpt3_5b/eval_prompt_squad \
evaluation.model.nemo_model=/mount/results/gpt3_5b/prompt_learning_squad/results/megatron_gpt_prompt.nemo \
evaluation.model.nemo_model=4 evaluation.model.tensor_model_parallel_size=2 \
>> /results/eval_prompt_gpt3_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/eval_prompt_gpt3_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.13.5. Prompt Learned T5 and mT5 Evaluation

We also provide a simple tool to help evaluate the prompt learned T5 or mT5 checkpoints. You can evaluate the capabilities of the prompt learned models on a customized prompt learning test dataset. We provide an example to evaluate our checkpoint, which went through prompt learning on SQuAD v1.1, on the SQuAD v1.1 test dataset created in prompt learning step.

The configuration used for the evaluation needs to be defined in the conf/config.yaml file by modifying the evaluation parameter, which specifies the file to use for evaluation purposes. The evaluation parameter must be included in stages to run the evaluation pipeline. The value should be set to prompt_t5/squad.yaml, which can be found in conf/evaluation/prompt_t5/squad.yaml for T5 models (or prompt_mt5/squad.yaml, which can be found in conf/evaluation/prompt_mt5/squad.yaml for mT5 models). The parameters can be modified to adapt different evaluation tasks and checkpoints in evaluation runs. For Base Command Platform, all these parameters should be overridden from the command line.

5.13.5.1. Common

To specify the configuration, use all the run parameters to define the job specific config ( run.tasks has to be set to prompt to run evaluation on prompt learning test tasks):

run:
  name: eval_${.task_name}_${.model_train_name}
  time_limit: "04:00:00"
  dependency: "singleton"
  model_train_name: t5_220m # or mt5_390m
  task_name: "squad"
  prompt_learning_dir: ${base_results_dir}/${.model_train_name}/prompt_learning_squad # assume prompt learning was on squad task
  results_dir: ${base_results_dir}/${.model_train_name}/${.task_name}_eval

To specify which model checkpoint to load and which prompt learning test dataset to evaluate, use the following parameters:

data:
  test_ds:
    - ${data_dir}/prompt_data/v1.1/squad_test.jsonl
  num_workers: 4
  global_batch_size: 16
  micro_batch_size: 16
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
pipeline_model_parallel_split_rank: ${divide_floor:${.pipeline_model_parallel_size}, 2}
model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
language_model_path: ${base_results_dir}/${evaluation.run.model_train_name}/convert_nemo/results/megatron_t5.nemo  # or megatron_mt5.nemo
virtual_prompt_model_file: ${evaluation.run.prompt_learning_dir}/results/megatron_t5_prompt.nemo # or megatron_mt5_prompt.nemo
5.13.5.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: 1
gpus_per_node: null
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the evaluation pipeline and not the data preparation, training, conversion or inference pipelines set the conf/config.yaml file to:

stages:
  - evaluation

then run:

python3 main.py
5.13.5.3. Base Command Platform

In order to run the evaluation script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The evaluation script must be launched in a multi-node job.

To run the evaluation pipeline to evaluate a prompt learned 220M T5 model checkpoint stored in /mount/results/t5_220m/prompt_learning_squad, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=[evaluation] evaluation=prompt_t5/squad \
 cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \
base_results_dir=/mount/results evaluation.run.results_dir=/mount/results/t5_220m/eval_prompt_squad \
evaluation.model.virtual_prompt_model_file=/mount/results/t5_220m/prompt_learning_squad/results/megatron_t5_prompt.nemo \
>> /results/eval_prompt_t5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/eval_prompt_t5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

To run the evaluation pipeline to evaluate a prompt learned 390M mT5 model checkpoint stored in /mount/results/mt5_390m/prompt_learning_squad, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=[evaluation] evaluation=prompt_mt5/squad \
 cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \
base_results_dir=/mount/results evaluation.run.results_dir=/mount/results/mt5_390m/eval_prompt_squad \
evaluation.model.virtual_prompt_model_file=/mount/results/mt5_390m/prompt_learning_squad/results/megatron_mt5_prompt.nemo \
>> /results/eval_prompt_mt5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/eval_prompt_mt5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.13.6. Adapter Learned and IA3 Learned GPT Evaluation

We also provide a simple tool to help evaluate the adapter and IA3 learned GPT checkpoints. You can evaluate the capabilities of the adapter learned GPT model on a customized adapter learning test dataset. We provide an example to evaluate our checkpoint, which went through adapter learning or IA3 learning on SQuAD v1.1.

The configuration used for the evaluation needs to be defined in the conf/config.yaml file by modifying the evaluation parameter, which specifies the file to be used for evaluation purposes. The evaluation parameter must be included in stages to run the evaluation pipeline. The value should be set to adapter_gpt3/squad.yaml for adapter learning, which can be found in conf/evaluation/adapter_gpt3/squad.yaml. The value should be set to ia3_gpt3/squad.yaml for IA3 learning, which can be found in conf/evaluation/ia3_gpt3/squad.yaml. The parameters can be modified to adapt different evaluation tasks and checkpoints in evaluation runs. For Base Command Platform, all these parameters should be overridden from the command line.

5.13.6.1. Common

To specify the configuration, use all the run parameters to define the job specific config. ( run.tasks has to be set to adapter to run evaluation on adapter learning test tasks):

run:
  name: ${.eval_name}_${.model_train_name}
  time_limit: "4:00:00"
  nodes: ${divide_ceil:${evaluation.model.model_parallel_size}, 8} # 8 gpus per node
  ntasks_per_node: ${divide_ceil:${evaluation.model.model_parallel_size}, ${.nodes}}
  eval_name: eval_adapter_squad # or eval_ia3_squad
  model_train_name: gpt3_5b
  tasks: "adapter" # general adapter task
  adapter_learn_dir: ${base_results_dir}/${.model_train_name}/adapter_learning_squad # or ia3_learning_squad
  results_dir: ${base_results_dir}/${.model_train_name}/${.eval_name}

To specify which model checkpoint to load and which adapter learning test dataset to evaluate, use the model parameter:

data:
  test_ds:
    - ${data_dir}/prompt_data/v1.1/squad_test.jsonl
  num_workers: 4
  global_batch_size: 16
  micro_batch_size: 16
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
pipeline_model_parallel_split_rank: ${divide_floor:${.pipeline_model_parallel_size}, 2}
model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
language_model_path: ${base_results_dir}/${evaluation.run.model_train_name}/convert_nemo/results/megatron_gpt.nemo 
adapter_model_file: ${evaluation.run.adapter_learning_dir}/results/megatron_gpt_adapter.nemo # or megatron_gpt_ia3.nemo
5.13.6.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: 1
gpus_per_node: null
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the evaluation pipeline and not the data preparation, training, conversion or inference pipelines set the conf/config.yaml file to:

stages:
  - evaluation

then run:

python3 main.py
5.13.6.3. Base Command Platform

In order to run the evaluation script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The evaluation script must be launched in a multi-node job.

To run the evaluation pipeline to evaluate an adapter learned 220M T5 model checkpoint stored in /mount/results/gpt3_5b/adapter_learning_squad, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=[evaluation] evaluation=adapter_gpt3/squad \
 cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \
base_results_dir=/mount/results evaluation.run.results_dir=/mount/results/gpt3_5b/eval_adapter_squad \
evaluation.model.adapter_model_file=/mount/results/gpt3_5b/adapter_learning_squad/results/megatron_gpt3_adapter.nemo \
>> /results/eval_adapter_gpt3_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/eval_adapter_gpt3_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

To run the evaluation pipeline to evaluate an IA3 learned 220M T5 model checkpoint stored in /mount/results/gpt3_5b/ia3_learning_squad, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=[evaluation] evaluation=ia3_gpt3/squad \
 cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \
base_results_dir=/mount/results evaluation.run.results_dir=/mount/results/gpt3_5b/eval_ia3_squad \
evaluation.model.adapter_model_file=/mount/results/gpt3_5b/ia3_learning_squad/results/megatron_t5_ia3.nemo \
>> /results/eval_ia3_t5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/eval_ia3_t5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.13.7. Adapter Learned and IA3 Learned T5 Evaluation

The configuration used for the evaluation needs to be defined in the conf/config.yaml file by modifying the evaluation parameter, which specifies the file to use for evaluation purposes. The evaluation parameter must be included in stages to run the evaluation pipeline. The value should be set to adapter_t5/squad.yaml, which can be found in conf/evaluation/adapter_t5/squad.yaml for adapter learned T5 models (or ia3_t5/squad.yaml, which can be found in conf/evaluation/ia3_t5/squad.yaml for IA3 learned models). The parameters can be modified to adapt different evaluation tasks and checkpoints in evaluation runs. For Base Command Platform, all these parameters should be overridden from the command line.

5.13.7.1. Common

To specify the configuration, use all the run parameters to define the job specific config:

run:
  name: eval_${.task_name}_${.model_train_name}
  time_limit: "04:00:00"
  dependency: "singleton"
  model_train_name: t5_220m
  task_name: "squad"
  adapter_learning_dir: ${base_results_dir}/${.model_train_name}/adapter_learning_squad # or ia3_learning_squad
  results_dir: ${base_results_dir}/${.model_train_name}/${.task_name}_eval

To specify which model checkpoint to load and which test dataset to evaluate, use the following parameters:

data:
  test_ds:
    - ${data_dir}/prompt_data/v1.1/squad_test.jsonl
  num_workers: 4
  global_batch_size: 16
  micro_batch_size: 16
tensor_model_parallel_size: 1
pipeline_model_parallel_size: 1
pipeline_model_parallel_split_rank: ${divide_floor:${.pipeline_model_parallel_size}, 2}
model_parallel_size: ${multiply:${.tensor_model_parallel_size}, ${.pipeline_model_parallel_size}}
language_model_path: ${base_results_dir}/${evaluation.run.model_train_name}/convert_nemo/results/megatron_t5.nemo 
adapter_model_file: ${evaluation.run.adapter_learning_dir}/results/megatron_t5_adapter.nemo # or megatron_t5_ia3.nemo
5.13.7.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: 1
gpus_per_node: null
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the evaluation pipeline and not the data preparation, training, conversion or inference pipelines set the conf/config.yaml file to:

stages:
  - evaluation

then run:

python3 main.py
5.13.7.3. Base Command Platform

In order to run the evaluation script on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The evaluation script must be launched in a multi-node job.

To run the evaluation pipeline to evaluate an adapter learned 220M T5 model checkpoint stored in /mount/results/t5_220m/adapter_learning_squad, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=[evaluation] evaluation=adapter_t5/squad \
 cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \
base_results_dir=/mount/results evaluation.run.results_dir=/mount/results/t5_220m/eval_adapter_squad \
evaluation.model.adapter_model_file=/mount/results/t5_220m/adapter_learning_squad/results/megatron_t5_adapter.nemo \
>> /results/eval_adapter_t5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/eval_adapter_t5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

To run the evaluation pipeline to evaluate an IA3 learned 220M T5 model checkpoint stored in /mount/results/t5_220m/ia3_learning_squad, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py stages=[evaluation] evaluation=ia3_t5/squad \
 cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data \
base_results_dir=/mount/results evaluation.run.results_dir=/mount/results/t5_220m/eval_ia3_squad \
evaluation.model.adapter_model_file=/mount/results/t5_220m/ia3_learning_squad/results/megatron_t5_ia3.nemo \
>> /results/eval_ia3_t5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/eval_ia3_t5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.14. Model Export

We also provide a tool to enable deployment of the NeMo Framework model on the NVIDIA Triton Inference Server with FasterTransformer Backend.

The export supports only GPT. You can checkout T5 and mT5 support in FasterTransformer repository but it is limited to older versions of NeMo and Megatron-LM.

5.14.1. GPT Export

GPT model is evaluated with lambada task which results can be compared with results from evaluation stage.

The configuration used for the export needs to be specified in the conf/config.yaml file, specifying the export parameter, which specifies the file to use for export purposes. The export parameter must be inclueded in stages to run the training pipeline export stage. The default value is set to gpt3/export_gpt3, which can be found in conf/export/gpt3/export_gpt3.yaml. The parameters can be modified to adapt different export and set of tests run on prepared Triton Model Repository. For Base Command Platform, all these parameters should be overridden from the command line.

5.14.1.1. Common

Also the other run parameters might be used to define the job specific config:

run:
  name: export_${.model_train_name}
  time_limit: "2:00:00"
  model_train_name: "gpt3_5b"
  training_dir: ${base_results_dir}/${.model_train_name}
  config_summary: tp${export.model.tensor_model_parallel_size}_pp${export.triton_deployment.pipeline_model_parallel_size}_${export.model.weight_data_type}_${export.triton_deployment.data_type}
  results_dir: ${base_results_dir}/${.model_train_name}/export_${.config_summary}
  model_type: "gpt3"

To specify which trained model checkpoint to use as source model and parameters of conversion to the FasterTransformer format, use the model parameter:

model:
  checkpoint_path: ${export.run.training_dir}/checkpoints
  # FT checkpoint will be saved in ${.triton_model_dir}/1/${.tensor_model_parallel_size}-gpu
  tensor_model_parallel_size: 8
  weight_data_type: fp16   # fp32|fp16
  processes: 16
  load_checkpoints_to_cpu: False

To specify the NVIDIA Triton Inference Server model directory and FasterTransformer backend parameters, use the triton_deployment parameter.

triton_deployment:
  triton_model_dir: ${export.run.results_dir}/model_repo/${export.run.model_train_name}
  max_batch_size: 1
  pipeline_model_parallel_size: 1
  int8_mode: False
  enable_custom_all_reduce: False
  data_type: fp16  # fp32|fp16|bf16
5.14.1.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the export pipeline, include export under stages in the conf/config.yaml:

stages:
  - export

then run:

python3 main.py
5.14.1.3. Base Command Platform

In order to run the export stage on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The export scripts must be launched in a multi-node job.

To run the export pipeline to evaluate a 126M GPT model checkpoint stored in /mount/results/gpt3_126m/checkpoints, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py \
stages=[export] \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_gpt3 \
base_results_dir=/mount/results \
export.run.model_train_name=gpt3_126m \
export.model.tensor_model_parallel_size=2 \
export.triton_deployment.pipeline_model_parallel_size=1 \
>> /results/export_gpt3_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/export_gpt3_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.14.2. T5 Export

T5 models are evaluated with lambada task which results can be compared with results from evaluation stage.

The configuration used for the export needs to be specified in the conf/config.yaml file, specifying the export parameter, which specifies the file to use for export purposes. The export parameter must be inclueded in stages to run the training pipeline export stage. The value can be set to t5/export_t5, which can be found in conf/export/t5/export_t5.yaml. The parameters can be modified to adapt different export and set of tests run on prepared Triton Model Repository. For Base Command Platform, all these parameters should be overridden from the command line.

5.14.2.1. Common

Also the other run parameters might be used to define the job specific config:

run:
  name: export_${.model_train_name}
  time_limit: "2:00:00"
  model_train_name: "t5_220m"
  training_dir: ${base_results_dir}/${.model_train_name}
  config_summary: tp${export.model.tensor_model_parallel_size}_pp${export.triton_deployment.pipeline_model_parallel_size}_${export.model.weight_data_type}_${export.triton_deployment.data_type}
  results_dir: ${base_results_dir}/${.model_train_name}/export_${.config_summary}
  model_type: "t5"

To specify which trained model checkpoint to use as source model and parameters of conversion to the FasterTransformer format, use the model parameter:

model:
  checkpoint_path: ${export.run.training_dir}/checkpoints
  # FT checkpoint will be saved in ${.triton_model_dir}/1/${.tensor_model_parallel_size}-gpu
  tensor_model_parallel_size: 8
  weight_data_type: fp16   # fp32|fp16
  processes: 16
  load_checkpoints_to_cpu: False

To specify the NVIDIA Triton Inference Server model directory and FasterTransformer backend parameters, use the triton_deployment parameter.

triton_deployment:
  triton_model_dir: ${export.run.results_dir}/model_repo/${export.run.model_train_name}
  max_batch_size: 1
  pipeline_model_parallel_size: 1
  int8_mode: False
  enable_custom_all_reduce: False
  data_type: fp16  # fp32|fp16|bf16
5.14.2.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the export pipeline, include export under stages in the conf/config.yaml:

stages:
  - export

then run:

python3 main.py
5.14.2.3. Base Command Platform

In order to run the export stage on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The export scripts must be launched in a multi-node job.

To run the export pipeline to evaluate a 220M T5 model checkpoint stored in /mount/results/t5_220m/checkpoints, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py \
stages=[export] \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_t5 \
base_results_dir=/mount/results \
export.run.model_train_name=t5_220m \
export.model.tensor_model_parallel_size=1 \
export.triton_deployment.pipeline_model_parallel_size=1 \
>> /results/export_t5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/export_t5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.14.3. mT5 Export

T5 models are evaluated with lambada task which results can be compared with results from evaluation stage.

The configuration used for the export needs to be specified in the conf/config.yaml file, specifying the export parameter, which specifies the file to use for export purposes. The export parameter must be inclueded in stages to run the training pipeline export stage. The value can be set to mt5/export_mt5, which can be found in conf/export/mt5/export_mt5.yaml. The parameters can be modified to adapt different export and set of tests run on prepared Triton Model Repository. For Base Command Platform, all these parameters should be overridden from the command line.

5.14.3.1. Common

Also the other run parameters might be used to define the job specific config:

run:
  name: export_${.model_train_name}
  time_limit: "2:00:00"
  model_train_name: "mt5_125m"
  training_dir: ${base_results_dir}/${.model_train_name}
  config_summary: tp${export.model.tensor_model_parallel_size}_pp${export.triton_deployment.pipeline_model_parallel_size}_${export.model.weight_data_type}_${export.triton_deployment.data_type}
  results_dir: ${base_results_dir}/${.model_train_name}/export_${.config_summary}
  model_type: "mt5"

To specify which trained model checkpoint to use as source model and parameters of conversion to the FasterTransformer format, use the model parameter:

model:
  checkpoint_path: ${export.run.training_dir}/checkpoints
  # FT checkpoint will be saved in ${.triton_model_dir}/1/${.tensor_model_parallel_size}-gpu
  tensor_model_parallel_size: 8
  weight_data_type: fp16   # fp32|fp16
  processes: 16
  load_checkpoints_to_cpu: False

To specify the NVIDIA Triton Inference Server model directory and FasterTransformer backend parameters, use the triton_deployment parameter.

triton_deployment:
  triton_model_dir: ${export.run.results_dir}/model_repo/${export.run.model_train_name}
  max_batch_size: 1
  pipeline_model_parallel_size: 1
  int8_mode: False
  enable_custom_all_reduce: False
  data_type: fp16  # fp32|fp16|bf16
5.14.3.2. Slurm

Set configuration for a Slurm cluster in the conf/cluster/bcm.yaml file:

partition: null
account: null
exclusive: True
gpus_per_task: null
gpus_per_node: 8
mem: 0
overcommit: False
job_name_prefix: "nemo-megatron-"

Example:

To run only the export pipeline, include export under stages in the conf/config.yaml:

stages:
  - export

then run:

python3 main.py
5.14.3.3. Base Command Platform

In order to run the export stage on Base Command Platform, set the cluster_type parameter in conf/config.yaml to bcp. This can also be overridden from the command line, using hydra. The export scripts must be launched in a multi-node job.

To run the export pipeline to evaluate a 125M mT5 model checkpoint stored in /mount/results/mt5_125m/checkpoints, run:

python3 /opt/NeMo-Megatron-Launcher/launcher_scripts/main.py \
stages=[export] \
cluster_type=bcp launcher_scripts_path=/opt/NeMo-Megatron-Launcher/launcher_scripts data_dir=/mount/data/the_pile_mt5 \
base_results_dir=/mount/results \
export.run.model_train_name=mt5_125m \
export.model.tensor_model_parallel_size=1 \
export.triton_deployment.pipeline_model_parallel_size=1 \
>> /results/export_mt5_log.txt 2>&1

The command above assumes you mounted the data workspace in /mount/data, and the results workspace in /mount/results. The stdout and stderr outputs will also be redirected to the /results/export_mt5_log.txt file, to be able to download the logs from NGC. Any other parameter can also be added to the command to modify its behavior.

5.15 Instruction Following via Supervised Finetuning (SFT)

SFT is the process of finetuning all of the model's parameters on supervised data of inputs and outputs that teaches the model how to follow user specified instructions. It is typically done after model pre-training. This section describes the steps involved in finetuning a GPT model for instruction following. In the subsequent sections, we will describe how to format your data and run training.

5.15.1 SFT Data Formatting

To demonstrate how to format your SFT data, we'll take the Dolly dataset (https://github.com/databrickslabs/dolly) as an example, which consists of 15k instruction-context-response triples.

First, to download the data, run launcher_scripts/nemo_launcher/collections/dataprep_scripts/dolly_datapreep/download.py --path_to_save /path/to/save/data.jsonl

The downloaded data /path/to/save/data.jsonl is formattated as a JSONL file with each line formatted as:

{
    "instruction": "When did Virgin Australia start operating?",

    "context": "Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.[3] It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney.[4]",

    "response": "Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.",

    "category": "closed_qa"
}

From the above example, there is no clear "input" and "output" field that SFT requires. An example of how to process the above data format into a JSONL file that contains "input" and "output" fields is at launcher_scripts/nemo_launcher/collections/dataprep_scripts/dolly_datapreep/preprocess.py. The script converts the "Instruction", "Context" and "Response" fields into "Input" and "Output". The script also concatenates the "Instruction" and "Context" fields with a \n\n separator and randomizes the order in which they appear in the input to generate a new JSONL file.

python launcher_scripts/nemo_launcher/collections/dataprep_scripts/dolly_datapreep/preprocess.py --input /path/to/save/data.jsonl generates a file /path/to/save/data-output.jsonl that can provided to SFT training described below.

For dialogue dataset, it is formatted as a JSONL file with each line formatted as:

{
  "mask": "User",
  "system": "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.\n\n",
  "conversations": [
    {
      "from": "User",
      "value": "Who are you?"
    },
    {
      "from": "Assistant",
      "value": "I am NV Assistant, a language model trained by researchers from NVIDIA NeMo team."
    },
    {
      "from": "User",
      "value": "What can you do?"
    },
    {
      "from": "Assistant",
      "value": "I can chat with you."
    }
  ]
}, 

where the field system is used to define the system prompt for the conversation. The conversations is a list of multiple turn conversations. from is the name of the person and value is the actual conversation text. The mask field indicates which person's conversation is going to be masked during the SFT, so it is not used to compute the cross-entropy loss.

It is important to ensure that the dialogue length is within the model's maximum sequence length. Otherwise, the entire dialogue may be masked out because it is truncated inside the dataset. In this case, you will see a 'NaN' error during training. To avoid this issue, you can split long dialogues into shorter segments, or use a model that can handle longer sequences

5.15.2 SFT Training

Once you have one or more dataset you would like to finetune on, you can run the finetuning script from NeMo as follows:

TRAIN="[/path/to/dataset_1.jsonl,/path/to/dataset_2.jsonl]"

VALID="[/path/to/validation_data.jsonl]"

VALID_NAMES="[your-validation-dataset-name]"

CONCAT_SAMPLING_PROBS="[0.3,0.7]"

TP_SIZE=2

PP_SIZE=1

python /opt/NeMo/examples/nlp/language_modeling/megatron_gpt_sft.py \
  trainer.precision=bf16 \
  trainer.max_steps=1000 \
  trainer.devices=8 \
  trainer.val_check_interval=200 \
  model.megatron_amp_O2=True \
  model.restore_from_path=/path/to/your/gpt.nemo \
  model.tensor_model_parallel_size=${TP_SIZE} \
  model.pipeline_model_parallel_size=${PP_SIZE} \
  model.optim.lr=5e-6 \
  model.answer_only_loss=True \
  model.data.train_ds.micro_batch_size=1 \
  model.data.train_ds.global_batch_size=128 \
  model.data.train_ds.file_names=${TRAIN} \
  model.data.train_ds.concat_sampling_probabilities=${CONCAT_SAMPLING_PROBS} \
  model.data.validation_ds.micro_batch_size=1 \
  model.data.validation_ds.global_batch_size=128 \
  model.data.validation_ds.file_names=${VALID} \
  model.data.validation_ds.names=${VALID_NAMES} \
  model.data.test_ds.micro_batch_size=1 \
  model.data.test_ds.global_batch_size=128 \
  model.data.train_ds.num_workers=0 \
  model.data.validation_ds.num_workers=0 \
  model.data.test_ds.num_workers=0 \
  model.data.validation_ds.metric.name=loss \
  model.data.test_ds.metric.name=loss \
  exp_manager.create_wandb_logger=True \
  exp_manager.explicit_log_dir=/results \
  exp_manager.resume_if_exists=True \
  exp_manager.resume_ignore_no_checkpoint=True \
  exp_manager.create_checkpoint_callback=True \
  exp_manager.checkpoint_callback_params.monitor=validation_loss

The ${TP_SIZE} and ${PP_SIZE} above should correspond to the Tensor and Pipeline model parallel sizes the /path/to/your/gpt.nemo model was saved with.

For finetuning dialogue dataset, we just need to add one extra configuration line to indicate the dataset type is dialogue.

  model.data.chat=True

5.16. Reinforcement Learning from Human Feedback

NeMo-RLHF is a library to fine-tune LLMs using Reinforcement Learning from Human Feedback (RLHF) in a scalable and fully distributed manner.

NeMo-RLHF supports only GPT models and implements the Proximal Policy Optimization (PPO) algorithm. Support for other models and RL algorithms will be added in future releases. Furthermore, NeMo-RLHF is not currently integrated into NeMo-Megatron-Launcher, so the RLHF jobs must be launched directly from the NeMo-RLHF repository in /opt/nemo-rlhf, which should be copied to the local file system in the login node.

We provide configurations to try RLHF on the newly released 2B GPT model with 4096 sequence length available on HuggingFace. We recommend users use the Anthropic HH-RLHF or the Stack Exchange Preferences datasets to get started.

5.16.1. Reward Model Training

NeMo-RLHF can be used to train your own reward model. The reward model is trained using a pairwise comparison loss and therefore needs a dataset with response pairs, where one response in the pair is ranked better than the other. A good reward model is crucial for the success of the PPO training in the next stage.

5.16.1.1 Data preprocessing

With your own or publicly available data, start by processing them into a jsonl format. This is where prefixes should be inserted. Then use the preprocess_data_for_megatron.py script to convert this jsonl format into the NeMo format. Format your pairwise comparison dataset with the following structure:

{“text”: prompt1+good_response_1}
{“text”: prompt1+bad_response_1}
{“text”: prompt2+good_response_2}
{“text”: prompt2+bad_response_2}
...

where 1 and 2 are different prompts. Note that for the same prompt, prompt+good_response must come before prompt+bad_response in the dataset.

For reference we used the following command for preprocessing the dataset using the SentencePiece tokenizer.

python3 /opt/NeMo/scripts/nlp_language_modeling/preprocess_data_for_megatron.py \
    --input "test.jsonl" \
    --output-prefix "./output" \
    --tokenizer-model sp_tokenizer_256k.model \
    --tokenizer-library=sentencepiece \
    --json-keys text \
    --dataset-impl mmap \
    --workers 30 \
    --chunk_size=100 \
    --append-eod

Which will generate files with output_document.bin and output_document.idx to use for reward model training.

5.16.1.2 Reward Model Training

To launch reward model training we first need to start with a pre-trained or fine-tuned nemo checkpoint. Our training_rm.yaml file has default configurations for the 2B model but feel free to use any model you like. An example command to begin training is:

cd /opt/nemo-rlhf \
&& export PYTHONPATH="/opt/nemo-rlhf:${PYTHONPATH}" \
&& python -u rlhf/reward_models/train_reward_model.py \
    --config-path=rlhf/reward_models/conf \
    --config-name=training_rm \
    model.pretrained_checkpoint.restore_from_path='model.nemo' \
    "model.data.data_prefix={train: [${train_output_document}], validation: [${val_output_document}], test: [${test_output_document}]}"
5.16.1.3 Reward Model Evaluation

To learn how to serve the reward model for evaluation, see the section "Launching the Reward Model inference server" below.

5.16.2. PPO Training

After fine-tuning a GPT model using Supervised Finetuning(SFT) and training a Reward Model as explained in the previous sections, NeMo-RLHF can be used to launch PPO jobs to fine-tune the SFT model using RLHF. During PPO training, four different models will be interacting with each other:

  1. The PPO Actor Network (also known as the Policy Network): This is the model we are training, and it should start from an SFT model trained as explained in the SFT section.
  2. The Reward Model (RM) Network (also known as a Preference Model (PM)): This model will take a prompt and a response as inputs, and it will provide a single scalar value as output. This scalar value will be the reward, which the PPO algorithm will try to maximize. The RM should be a model trained as described in the RM Training section.
  3. The PPO Critic Network (also known as the Value Network): Since PPO is an actor-critic algorithm, we need a critic to help our actor learn more effectively. The critic will provide Value estimates to each token in the responses provided by the actor. These values can be seen as an estimate of the amount of reward the actor will receive after generating all the remaining tokens. The critic is loaded from the same RM we trained as described in the RM training section. Note: The RM generates a single reward for the entire sequence, whereas the Critic generates a value for each token.
  4. The Initial Policy Network (also known as the Reference Model): We use this model to compute a KL Divergence penalty term that ensures that the PPO Actor does not diverge too much from the Initial Policy. This way, we prevent the PPO Actor from overfitting to the reward models given by the RM, and ensure it does not forget the knowledge it acquired during pretraining and SFT. This model should be the same model as the PPO Actor Network.

To launch a full PPO training job, we need to launch the RM and the Initial Policy as inference servers. These two models are not trained, so they only need to perform inference and share their result with the PPO Actor. However, the PPO Actor and PPO Critic need to be trained.

Our architecture is designed to launch all four models completely separately. Therefore, we will launch two inference servers (one for the RM and one for the initial policy), one server that can do inference and training (the PPO Critic), and one master job to do training (the PPO Actor). Next we will look at how to launch each of those four jobs.

5.16.2.1 Launching the Reward Model Inference Server

To launch the Reward Model inference server in a Linux system, this command can be run inside the container:

cd /opt/nemo-rlhf \
&& export PYTHONPATH="/opt/nemo-rlhf:${PYTHONPATH}" \
&& export HYDRA_FULL_ERROR=1 \
&& python rlhf/reward_models/serve_reward_model.py \
    --config-path=/opt/nemo-rlhf/rlhf/reward_models/conf \
    --config-name=inference_rm \
    gpt_rm_model_file=/path/to/model.nemo \
    port=5555

This command will launch the RM inference server on the local computer, using port 5555. All the configuration parameters can be modified in the inference_rm.yaml file, or by overriding them through the CLI command. Ensure server=True is set in the configuration of this job to correctly launch the inference server.

5.16.2.2 Launching the Initial Policy Inference Server

To launch the Initial Policy inference server in a Linux system, this command can be run inside the container:

cd /opt/nemo-rlhf \
&& export PYTHONPATH="/opt/nemo-rlhf:${PYTHONPATH}" \
&& export HYDRA_FULL_ERROR=1 \
&& python rlhf/rlhf_nemo/serve_initial_policy.py \
    --config-path=/opt/nemo-rlhf/rlhf/rlhf_nemo/conf \
    --config-name=inference_initial_policy \
    gpt_model_file=/path/to/model.nemo \
    port=5556

This command will launch the Initial Policy inference server on the local computer, using port 5556. All the configuration parameters can be modified in the inference_initial_policy.yaml file, or by overriding them through the CLI command. Ensure server=True is set in the configuration of this job to correctly launch the inference server.

5.16.2.3 Launching the PPO Critic Training and Inference Server

The PPO Critic has to perform both training and inference. We designed the Critic to have both capabilities. To launch the PPO Critic server in a Linux system, this command can be run inside the container:

cd /opt/nemo-rlhf \
&& export PYTHONPATH="/opt/nemo-rlhf:${PYTHONPATH}" \
&& export HYDRA_FULL_ERROR=1 \
&& python rlhf/rlhf_nemo/serve_ppo_critic.py \
    --config-path=/opt/nemo-rlhf/rlhf/rlhf_nemo/conf \
    --config-name=gpt_ppo_critic \
    model.pretrained_checkpoint.restore_from_path=/path/to/trained_rm.nemo \
    port=5557

This command will launch the PPO Critic server on the local computer, using port 5557. All the configuration parameters can be modified in the gpt_ppo_critic.yaml file, or by overriding them through the CLI command. Ensure inference.server=True is set in the configuration of this job to correctly launch the server.

5.16.2.4 Launching the PPO Actor Training

The PPO Actor training job contains the master HTTP controller that makes the HTTP calls to all three servers when needed. To launch the PPO Actor server in a Linux system, this command can be run inside the container:

cd /opt/nemo-rlhf \
&& export PYTHONPATH="/opt/nemo-rlhf:${PYTHONPATH}" \
&& export HYDRA_FULL_ERROR=1 \
&& python rlhf/rlhf_nemo/train_gpt_ppo_actor.py \
    --config-path=/opt/nemo-rlhf/rlhf/rlhf_nemo/conf \
    --config-name=gpt_ppo_actor \
    "model.data.data_prefix={train: [/path/to/train_data], validation: [/path/to/val_data], test: [/path/to/test_data]}" \
    model.pretrained_checkpoint.restore_from_path=/path/to/model.nemo

This command will launch the PPO Actor job on the local computer. All the configuration parameters can be modified in the gpt_ppo_actor.yaml file, or by overriding them through the CLI command.

5.16.2.5 Launching every job at once with SLURM

Heterogeneous jobs can be used to launch all four jobs simultaneously in different nodes, using a script like the one shown next:

#!/bin/bash
#SBATCH -N 1 --ntasks-per-node 8 -t 4:00:00 --exclusive
#SBATCH hetjob
#SBATCH -N 1 --ntasks-per-node 8 -t 4:00:00 --exclusive
#SBATCH hetjob
#SBATCH -N 1 --ntasks-per-node 8 -t 4:00:00 --exclusive
#SBATCH hetjob
#SBATCH -N 8 --ntasks-per-node 8 -t 4:00:00 --exclusive

RM_MODEL=/path/to/reward_model.nemo
ACTOR_MODEL=/path/to/sft_model.nemo

DIR=/opt/nemo-rlhf
CONTAINER="nvcr.io/ea-bignlp/nemofw-training:23.05-py3"

# START HETEROGENEUS JOB 0

read -r -d '' cmd_rm_inference <<EOF
cd ${DIR} \
&& export PYTHONPATH="${DIR}:${PYTHONPATH}" \
&& export HYDRA_FULL_ERROR=1 \
&& python rlhf/reward_models/serve_reward_model.py \
    --config-path=/opt/nemo-rlhf/rlhf/reward_models/conf \
    --config-name=inference_rm \
    gpt_rm_model_file=${RM_MODEL} \
    port=${RM_PORT=5555}
EOF

srun --het-group=0 --container-image=${CONTAINER} bash -c "${cmd_rm_inference}" &

# END HETEROGENEUS JOB 0

####################################################

# START HETEROGENEUS JOB 1

read -r -d '' cmd_init_policy_inference <<EOF
cd ${DIR} \
&& export PYTHONPATH="${DIR}:${PYTHONPATH}" \
&& export HYDRA_FULL_ERROR=1 \
&& python rlhf/rlhf_nemo/serve_initial_policy.py \
    --config-path=/opt/nemo-rlhf/rlhf/rlhf_nemo/conf \
    --config-name=inference_initial_policy \
    gpt_model_file=${ACTOR_MODEL} \
    port=${INIT_POLICY_PORT=5556}
EOF

srun --het-group=1 -o $INIT_POLICY_OUTFILE -e $INIT_POLICY_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_init_policy_inference}" &

# END HETEROGENEUS JOB 1

sleep 30
######################################################

# START HETEROGENEUS JOB 2

read -r -d '' cmd_critic_inference <<EOF
cd ${DIR} \
&& export PYTHONPATH="${DIR}:${PYTHONPATH}" \
&& export HYDRA_FULL_ERROR=1 \
&& python -u rlhf/rlhf_nemo/serve_ppo_critic.py \
    --config-path=/opt/nemo-rlhf/rlhf/rlhf_nemo/conf \
    --config-name=gpt_ppo_critic \
    model.pretrained_checkpoint.restore_from_path=${RM_MODEL} \
    inference.port=${CRITIC_PORT=5557}
EOF

srun --het-group=2 --container-image=${CONTAINER} bash -c "${cmd_critic_inference}" &

# END HETEROGENEUS JOB 2

sleep 30
####################################################

# START HETEROGENEUS JOB 3

TRAIN_DATA_PATH=/path/to/train_data
VALID_DATA_PATH=/path/to/val_data
TEST_DATA_PATH=/path/to/test_data

host_rm="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_0 | head -n1)"
host_init_policy="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_1 | head -n1)"
host_critic="$(scontrol show hostnames=$SLURM_JOB_NODELIST_HET_GROUP_2 | head -n1)"

read -r -d '' cmd_ppo <<EOF
cd ${DIR} \
&& export PYTHONPATH="${DIR}:${PYTHONPATH}" \
&& export HYDRA_FULL_ERROR=1 \
&& python -u rlhf/rlhf_nemo/train_gpt_ppo_actor.py \
    --config-path=/opt/nemo-rlhf/rlhf/rlhf_nemo/conf \
    --config-name=gpt_ppo_actor \
    trainer.num_nodes=8 \
    "model.data.data_prefix={train: [${TRAIN_DATA_PATH}], validation: [${VALID_DATA_PATH}], test: [${TEST_DATA_PATH}]}" \
    model.pretrained_checkpoint.restore_from_path=${ACTOR_MODEL} \
    model.rlhf.reward_model.ip=${host_rm} \
    model.rlhf.reward_model.port=${RM_PORT} \
    model.rlhf.initial_policy.ip=${host_init_policy} \
    model.rlhf.initial_policy.port=${INIT_POLICY_PORT} \
    model.rlhf.critic.ip=${host_critic} \
    model.rlhf.critic.port=${CRITIC_PORT}
EOF

srun --het-group=3 --container-image=${CONTAINER} bash -c "${cmd_ppo}" &

# END HETEROGENEUS JOB 3

wait

It is important to launch each job with & after the srun command, to ensure each job doesn’t block the next one. The wait statement at the end of script ensures that the entire job does not exit until each individual job is finished.

Note: the three servers do not support data parallelism. Therefore, the SLURM –ntasks-per-node value should be set to the model parallelism value (tensor parallelism * pipeline parallelism) for that same job. And the trainer.devices value must also be set to that same value as well. However, the PPO actor supports data parallelism, so –ntasks-per-node can be set to the number of GPUs in each node.

5.16.2.6 PPO Hyper-parameters

All the model related parameters can be controlled the same way as in other NeMo training jobs. However, we also provide full control of the behavior of PPO during training, with a section in the config yaml files inside model.rlhf. These are the descriptions of the available hyper-parameters:

  • rlhf.reward_model: Provide the ip address and the port where the Reward Model will be running, to enable communication with it.
  • rlhf.critic: Provide the ip address and the port where the PPO Critic will be running, to enable communication with it.
  • rlhf.initial_policy: Provide the ip address and the port where the Initial Policy will be running, to enable communication with it.
  • rlhf.ppo.entropy_penalty: Control the effect of the entropy term in PPO.
  • rlhf.ppo.inital_pollicy_kl_penalty: Control the effect of the initial policy KL Divergence term in PPO.
  • rlhf.ppo.use_absolute_kl: Whether to use the absolute value of the initial policy KL Divergence or not.
  • rlhf.ppo.epochs: Number of epochs the actor and critic will perform on the data stored in the rollout buffer each time.
  • rlhf.ppo.num_rollout_samples: Number of samples that will be generated during the rollout stage before moving to the training stage.
  • rlhf.ppo.rollout_micro_batch_size: Micro batch size for the rollout phase. Each GPU will load this many prompts and generate responses for them.
  • rlhf.ppo.ratio_eps: epsilon value for clipping the PPO ratio during training.
  • rlhf.ppo.discount: discount factor for calculating the returns and advantages.
  • rlhf.ppo.gae_lambda: lambda value for the Generalized Advantage Estimation (GAE) calculation.
  • rlhf.ppo.normalize_advantage: whether to normalize the advantages to have a mean of zero and standard deviation of one.

During the rollout phase, the sampling parameters for the model can also be modified, by using the parameters in model.sampling_params.

5.16.3. Future Work

  • The reward model training only supports datasets with two responses per prompt. We will add support for training with datasets that have more than 2 responses per prompt in future releases.
  • The throughput of PPO will be greatly increased in future releases.
  • The stability of the PPO learning process is not good enough. We will continue working to improve the PPO learning for our models.

5.17 Curating pretraining datasets with the NeMo Data Curator

The NeMo Data Curator is a Python library that consists of a collection of scalable data-mining modules for curating NLP data for training LLMs. The modules within the NeMo Data Curator enable NLP researchers to mine high-quality text at scale from massive uncurated web corpora.

Currently, within the NeMo Data Curator, we support the following data-curation modules:

  • Configurable data download and text extraction:
    • Default implementations of download and extraction of Common Crawl, Wikipedia, and ArXiv data
    • Users can easily customize the download and extraction and extend to other datasets (see NeMo Data Curator internal documentation available in the container for more information)
  • Text reformatting and cleaning via ftfy
  • Quality filtering:
    • Multilingual heuristic-based filtering
    • Classifier-based filtering via fastText
  • Document-level deduplication
    • Exact deduplication
    • Fuzzy deduplication. Our implementation of fuzzy deduplication builds off of the following existing libraries:
      • For computing MinHash signatures we use a modified version of the MinHasher class provided in pyLSH
      • For the locality sensitive hashing, we extended the Redis-based implementation found in datasketch beyond a single Redis server to a Redis Cluster. This enables this module to efficiently deduplicate large datasets that do not fit in memory of a single node (e.g., several TB of text)

The modules are implemented in a scalable manner using Message Passing Interface (MPI) for Python (mpi4py) and we use Dask for creating balanced input jsonl files. With the scalable modules within the NeMo Data Curator, we have been have been able to fully process a Common Crawl Snapshot (consisting of 60 TB of compressed WARC files) in approximately two days using 30 CPU nodes (with hardware similar to the c5.24xlarge Amazon AWS C5 instance). Please note that the core functions used within the NeMo Data Curator (e.g., html extraction, text cleaning, heuristic filtering, etc.) have not been fully optimized. The main goal of the NeMo Data Curator is to provide users the capability to apply these functions to their large datasets using many compute nodes.

If users to desire to use the NeMo Data Curator in order to curate their own pretraining datasets, they should copy it out of the container using the command provided in the environment preparation section of the quick start guide. Within the nemo-data-curator directory, they can use the example SLURM scripts and additional documentation provided in the docs sub-directory and README of that directory.

6. Deploying the NeMo Megatron Model

This section describes the deployment of the NeMo Megatron model on the NVIDIA Triton Inference Server with FasterTransformer Backend on both single and multiple node environments. NVIDIA Triton Inference Server supports many inference scenarios, of which two most important are:

  • Offline inference scenario - with a goal to maximize throughput regardless of the latency, usually achieved with increasing batch size and using server static batching feature.
  • Online inference scenario - with a goal to maximize throughput within a given latency budget, usually achieved with small batch sizes and increasing concurrency requests to the server, using dynamic batching feature.

6.1. Run NVIDIA Triton Server with Generated Model Repository

The inputs:

  • NVIDIA Triton model repository with FasterTransformer checkpoint ready for inference at production.
  • Docker image with NVIDIA Triton and FasterTransformer backend.

The outputs:

  • Running NVIDIA Triton model instance serving model in cluster.

To run at slurm FasterTransformer backend, do the following:

srun \
     --nodes=<NUMBER OF NODES>\
     --partition=<SLURM PARITION>\
     --mpi <MPI MODE>\
     --container-image <NEMO_LAUNCHER INFERENCE CONTAINER>\
     --container-mounts <TRITON MODEL REPOSITORY>:<TRITON MODEL REPOSITORY> \
     bash -c "export CUDA_VISIBLE_DEVICES=<LIST OF CUDA DEVICES> && tritonserver --model-repository <TRITON MODEL REPOSITORY>"

Parameters:

  • NUMBER OF NODES: Number of machines in cluster, which should be used to run inference.
  • SLURM PARTITION: Slurm partition with DGX machines for inference.
  • MPI MODE: FasterTransformer uses MPI for interprocess communication like pmix library.
  • NEMO_LAUNCHER INFERENCE CONTAINER: Separate docker container streamlined for just inference.
  • TRITON MODEL REPOSITORY: Triton model repository created by FasterTransformer export stage.
  • LIST OF CUDA DEVICES: List of CUDA devices, which should be used by inference like 0,1,2,3.

When you run inference, then number of machines and GPUs must match configuration set during FasterTransformer export. You set tensor parallel (TP) and pipeline parallel configuration (PP). This created wight files divided between GPUs and machines. A tensor parallel configuration determines how many GPUs are used to process one transformer layer. If you set TP to 16 but your cluster contains just 8 GPU machines, then you need 2 nodes to run inference. FasterTransformer consumes all GPUs accessible to Triton process. If you set TP to 4 but your machines contain 8 GPUs, then you must hide some GPUs from the process. An environment variable CUDA_VISIVLE_DEVICES can be used to list devices accessible to CUDA library for a process, so you can use it to limit number of GPUs used by Triton instance. The example configuration for 126m can't be run with tensor parallel set to 8 because head number in transformer layer must be divisible by tensor parallel value.

Table below contains example configurations for DGX 8 GPU machines:

TP PP #GPUs #Nodes CUDA DEVICES
1 1 1 1 0
2 1 2 1 0,1
4 1 4 1 0,1,2,3
8 1 8 1 Not necessary
8 2 16 2 Not necessary
16 1 16 2 Not necessary
8 3 24 3 Not necessary
8 4 32 4 Not necessary
16 2 32 4 Not necessary

The script saves NVIDIA Triton logs so you can verify what happens when FasterTransformer loads a checkpoint. The command above starts the server, so that users can test it with other tools created later. You can use this script to demo inference. The job does not stop on its own, if you don't stop it manually, it will stop when the time limit is reached on the cluster.

FasterTransformer backend ignores missing files for weights and uses random tensors in such a scenario. You should make sure that your NVIDIA Triton instance is serving requests with real weights by inspecting logs.

If you notice warning about missing files, you should double check your model:

[WARNING] file /triton-model-repository/model_name/1/1-gpu/model.wpe.bin cannot be opened, loading model fails!
[WARNING] file /triton-model-repository/model_name/1/1-gpu/model.wte.bin cannot be opened, loading model fails!
[WARNING] file /triton-model-repository/model_name/1/1-gpu/model.final_layernorm.bias.bin cannot be opened, loading model fails!
[WARNING] file /triton-model-repository/model_name/1/1-gpu/model.final_layernorm.weight.bin cannot be opened, loading model fails!

6.2. GPT Text Generation with Ensemble

FasterTransformer for GPT implements a part of whole text generation application.

An ensemble model represents a pipeline of models and the connection of input and output tensors between those models. Ensemble models are intended to be used to encapsulate a procedure that involves multiple models, such as "data preprocessing -> inference -> data postprocessing". Using ensemble models for this purpose can avoid the overhead of transferring intermediate tensors and minimize the number of requests that must be sent to Triton.

A text generation example for GPT is implemented as ensemble example: gpt folder. This example contains four folders:

  • ensemble: ensemble definition folder.
  • fastertransformer: FasterTransformer backend folder.
  • postprocessing: Detokeniser to generate text.
  • preprocessing: Tokenizer to translate text into token IDs.

You should replace your fastertransformer folder with model store generated by FasterTransformer export described above. The ensemble expects a model name to be fastertransformer so make sure that your generated configuration uses such model name.

The inference container doesn't contain PyTorch so you need to install dependencies for ensemble. You can start you compute node for Triton in interactive mode to access terminal directly.

Inside machine running container for Triton Inference server install PyTorch and regex packages:

pip install torch regex

Execute Triton inference server like described above in point 6.1. You can demonize process.

CUDA_VISIBLE_DEVICES=0 mpirun -n 1 --allow-run-as-root tritonserver --model-store /your/folders/fastertransformer_backend/all_models/gpt &

Install Triton client:

pip install tritonclient[all]

Execute end_to_end_test.py example:

python3 /your/folders/fastertransformer_backend/tools/end_to_end_test.py

The end_to_end_test.py script contains a string examples, which you can replace with your text.

6.3. UL2 Checkpoint Deployment

You can deploy UL2 T5 checkpoints using readme created by FasterTransformer.

You can use huggingface t5 conversion script see below:

python3 FasterTransformer/examples/pytorch/t5/utils/huggingface_t5_ckpt_convert.py \
        -in_file <UL2 checkpoint folder from training> \
        -saved_dir <FasterTransformer destination folder> \
        -inference_tensor_para_size <tensor parallel size> \
        -weight_data_type <data type>

Triton FasterTransformer backend repo contains configuration example config.pbtxt.

You can use Triton configuration script prepare_triton_model_config.py to modify config.pbtxt to match configuration of your UL2 checkpoint and your cluster configuration.

7. Performance

7.1. GPT Results

7.1.1. Training Accuracy Results

Training Accuracy: NVIDIA DGX SuperPOD (8 x 8 x A100 80GB for 126M GPT Model; 16 x 8 x A100 80GB for 5B GPT Model)

We evaluated the 126M parameter and 5B parameter models on 8 different language tasks. The results can be found in the table below. All the tasks are provided as part of the evaluation harness, so the user can evaluate any .nemo checkpoint file on all these tasks.

Task Metric 126M 5B
Lambada Accuracy 38.70% 68.93%
PPL 25.8 4.22
Boolq Accuracy 56.94% 65.29%
Race Accuracy 28.71% 38.66%
Accuracy Norm 34.74% 41.62%
Piqa Accuracy 61.21% 73.88%
Accuracy Norm 61.97% 75.40%
Hellaswag Accuracy 28.48% 46.45%
Accuracy Norm 29.54% 60.85%
Winogrande Accuracy 50.43% 60.77%
Wikitext2 Word PPL 31.35 12.36
Byte PPL 1.9 1.6
Bits per Byte PPL 0.64 0.47
Wikitext103 Word PPL 31.35 12.36
Byte PPL 1.9 1.6
Bits per Byte PPL 0.64 0.47

Training the 5B GPT model to convergence takes 6.5 days, and the loss curve can be seen in the figure below:

The table below shows the converged training loss, the throughput, and the total time to train for the 5B GPT model, using a given number of GPUs and a given Global Batch Size (GBS).

#GPUs GBS Seq Length #Tokens Loss Throughput (Tokens/sec) Time to Train (days)
160 1440 2048 300B 1.685 726,384 4.8

7.1.2. Training Performance Results

Training performance:

  • NVIDIA DGX SuperPOD (16 x 8 x A100 80GB for 5B GPT model)
  • NVIDIA DGX SuperPODs (128 x 8 x A100 80GB for 175B GPT model)

We measured the throughput of training 5B and 175B parameter GPT models on different numbers of DGX nodes, and we achieved near-linear scaling. For example, when scaling from 1 node to 32 nodes with a 5B model, we achieve a 28.73x speed-up. When scaling from 8 nodes to 128 (16x more nodes) nodes with a 175B model, we achieve 14.62x speed-up. The tables and charts below show the performance results.

Nodes
1 2 4 8 16 32
Tokens per Second 40345 79815 161754 312774 659481 1159288
5B Perfect Linear Scaling (Tokens) 40345 80690 161380 322760 645520 1291040
Speed-up 1x 1.98x 4.01x 7.75x 16.35x 28.73x

Nodes
8 16 32 64 128
Tokens per Second 7500 14950 29537 58211 109684
175B Perfect Linear Scaling (Tokens) 7500 15000 30000 60000 120000
Speed-up 1x 1.99x 3.94x 7.76x 14.62x

7.1.3. Inference Performance

Inference performance was measured for NVIDIA DGX SuperPOD (1 x 8 x A100 80GB).

Inference parameters:

  • batch size: 1
  • input tokens length: 60
  • output tokens length: 20

GPT Model size Average latency [ms] TP PP GPUs
5B 87 8 4 32
20B 202 8 4 32
175B 893 8 4 32
530B 977 32 1 32

7.2. T5 Results

7.2.1. Training Accuracy Results

The user can also prompt-learn on top of any .nemo trained checkpoint file on SQuAD task mentioned in T5 prompt-learning section. The results can be found in the table below.

Task Metric 220M 3B
SQuAD Exact Match 74.20 78.52
SQuAD F1 84.54 87.17

Training the 220M T5 model to convergence takes 4 days, and the loss curve can be seen in the figure below:

The table below shows the converged training loss, the throughput, and the total time to train for the 220M T5 model, using a given number of GPUs and a given Global Batch Size (GBS).

#GPUs GBS Seq Length #Tokens Loss Throughput (Tokens/sec) Time to Train (days)
32 2048 512 1T 1.501 3,273,728 4

Training the 3B T5 model to convergence takes 11 days, and the loss curve of a fully trained model can be seen in the figure below:

The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).

#GPUs GBS Seq Length #Tokens Loss Throughput (Tokens/sec) Time to Train (days)
160 2160 512 1T 1.147 1,395,131 11

7.2.2. Training Performance Results

Training Performance: NVIDIA DGX SuperPOD (20 x 8 x A100 80GB for 3B T5 Model)

We measured the throughput of training a 3B parameter T5 model on NVIDIA DGX SuperPOD using a different number of nodes. When scaling from 1 node to 20 nodes, we achieve 16.38x speed-up. We are actively working on improving the scaling performance for T5 models. The table and chart below show the performance results.

Nodes
1 2 4 5 10 20
Tokens per Second 110769 215579 417644 515100 957506 1626353
3B Perfect Linear Scaling (Tokens) 110769 221538 443077 553846 1107692 2215385
Speed-up 1x 1.95x 3.77x 4.65x 8.64x 14.68x

7.2.3. Inference Performance

Inference performance was measured for NVIDIA DGX SuperPOD (1 x 8 x A100 80GB).

Inference parameters:

  • batch size: 1
  • input tokens length: 60
  • output tokens length: 20

T5 Model size Average latency [ms] TP PP GPUs
3B 94 2 1 2
11B 123 4 1 4
23B 213 4 1 4
41B 332 8 1 8

7.3. mT5 Results

7.3.1. Training Accuracy Results

Training Accuracy: NVIDIA DGX SuperPOD (4 x 8 x A100 80GB for 170M mT5 Model; 8 x 8 x A100 80GB for 390M mT5 Model; 20 x 8 x A100 80GB for 3B mT5 Model)

We evaluated our mT5 models on XQuAD task. The results can be found in the table below. The user can fine-tune on top of any .nemo trained checkpoint file on XQuAD task mentioned in mT5 fine-tuning section.

Task-Language Metric 170M 390M
XQuAD-de Exact Match 43.0 54.7
XQuAD-en Exact Match 63.8 68.8
XQuAD-es Exact Match 47.0 55.3
XQuAD-hi Exact Match 34.5 47.1
XQuAD-zh Exact Match 46.8 56.1

The user can also prompt-learn on top of any .nemo trained checkpoint file on SQuAD task mentioned in mT5 prompt-learning section. The results can be found in the table below.

Task Metric 390M 3B
SQuAD Exact Match 76.86 81.55
SQuAD F1 84.67 89.34

Training the 170M mT5 model to convergence takes 4 days, and the loss curve can be seen in the figure below:

The table below shows the converged training loss, the throughput, and the total time to train for the 170M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).

#GPUs GBS Seq Length #Tokens Loss Throughput (Tokens/sec) Time to Train (days)
32 2048 512 1T 1.980 4,112,062 4

Training the 390M mT5 model to convergence takes 4 days, and the loss curve can be seen in the figure below:

The table below shows the converged training loss, the throughput, and the total time to train for the 390M mT5 model, using a given number of GPUs and a given Global Batch Size (GBS).

#GPUs GBS Seq Length #Tokens Loss Throughput (Tokens/sec) Time to Train (days)
64 2048 512 1T 1.584 3,744,914 4

Training the 3B mT5 model to convergence takes 14 days, and the loss curve of a fully trained model can be seen in the figure below:

The table below shows the converged training loss, the throughput, and the total time to train for the 3B T5 model, using a given number of GPUs and a given Global Batch Size (GBS).

#GPUs GBS Seq Length #Tokens Loss Throughput (Tokens/sec) Time to Train (days)
160 1920 512 1T 1.134 911,065 14

7.3.2. Training Performance Results

Training Performance: NVIDIA DGX SuperPOD (20 x 8 x A100 80GB for 3B mT5 Model)

We measured the throughput of training a 3B parameter mT5 model on NVIDIA DGX SuperPOD using a different number of nodes. When scaling from 1 node to 20 nodes, we achieve 14.87x speed-up. We are actively working on improving the scaling performance for mT5 models. The table and chart below show the performance results.

Nodes
1 2 4 5 10 20
Tokens per Second 91166 179583 346263 429088 798570 1303767
3B Perfect Linear Scaling (Tokens) 91166 182331 364663 455829 911657 1823314
Speed-up 1x 1.97x 3.8x 4.71x 8.76x 14.3x

7.3.3. Inference Performance

Inference performance was measured for NVIDIA DGX SuperPOD (1 x 8 x A100 80GB).

Inference parameters:

  • batch size: 1
  • input tokens length: 60
  • output tokens length: 20

mT5 Model size Average latency [ms] TP PP GPUs
380M 35 1 1 1
3B 102 2 1 2
11B 134 4 1 4
23B 230 4 1 4

7.4. BERT Results

7.4.1. Training Accuracy Results

Training Accuracy: NVIDIA DGX SuperPOD (16 x 8 x A100 80GB for 4b Bert Model)

Training the 4B BERT model for 95 Billion takes 1.5 days, and the loss curve can be seen in the figure below:

The table below shows the converged training loss, the throughput, and the total time to train for the 4B BERT model, using a given number of GPUs and a given Global Batch Size (GBS).

#GPUs GBS Seq Length #Tokens Loss Throughput (Tokens/sec) Time to Train (days)
16 2048 512 217B 1.44 728178 1.5

7.4.2. Training Performance Results

Training Performance: NVIDIA DGX SuperPOD (20 x 8 x A100 80GB for 4B BERT Model)

We measured the throughput of training a 4B parameter BERT model on NVIDIA DGX SuperPOD using a different number of nodes. When scaling from 1 node to 16 nodes, we achieve 12.71x speed-up. The table and chart below show the performance results.

Nodes
1 2 4 8 16
Tokens per Second 57287 108695 215358 393167 728178
4B Perfect Linear Scaling (Tokens) 57287 114574 229148 458296 916592
Speed-up 1x 1.89x 3.75x 6.86x 12.71x

7.4.3. Training Performance Results (LDDL)

We measured the performance of different Bert configurations with and without LDDL and saw an average 25% reduction in training time. The table and chart below show the performance results.

Bert Config Train time without LDDL Trian time with LDDL MODEL SPEC TFLOPS w/o LDDL TFLOPS(LDDL) Speedup (%)
110m 0.078 0.076 8 Nodes TP1 PP1 GBS 256 18.280 18.900 2.63%
4b 1.794 1.393 16 Nodes TP 1 PP1 GBS 2048 108.900 140.400 28.79%
20b 7.956 6.79 32 Nodes TP4 PP4 GBS 4096 137.300 160.870 17.17%
100b 9.743 7.54 128Nodes TP4 PP16 GBS 4096 124.88 162.83 29.22%

8. Changelog

NeMo Framework 23.05

  • Low-Rank Adaptation (LoRA) Support for GPT
  • LDDL (Language Datasets and Data Loaders) for BERT on 100B model resulting in a 30% performance speedup
  • Unify dataset and model classes for all PEFT (p-tuning, adapters, IA3) with SFT model class as parent for GPT
  • Converter from Interleaved PP to non-Interleaved PP
  • Dialog dataset guidance for SFT to help create better chat models
  • Support Dynamic Sequence Length Batches with GPT SFT
  • Data parallelism enabled for RLHF servers, providing a 2x end-to-end speedup in most jobs

NeMo Framework 23.04.1

  • Addressed issue in RLHF which prevented some jobs from running in Slurm clusters
  • Corrections related to the renaming of NeMo Megatron to NeMo Framework
  • Modified run.name in the *_improved configuration files to match the correct parameter count

NeMo Framework 23.04

  • NeMo Data Curator - a scalable Python library for curating large-scale datasets required for training large language foundation models
  • Enable Continued Training for P-Tuning
  • Switch to Megatron Core for Model Parallelism in NeMo Framework
  • Extend the Data Validation Tool to provide P-Tuning GPU Runtime Estimates
  • Tensor and Pipeline Parallelism Conversion Support for GPT and T5
  • Supervised Fine-Tuning Support for GPT
  • RLHF (Reinforcement Learning from Human Feedback) for GPT
  • New GPT model sizes - 400M_improved, 1B_improved, 7B_improved, 40B_improved based on new and improved model configurations
  • List of GPT model configuration changes
Configuration Previous New
Activation GeLU Fast-SwiGLU
Position Embedding Learned Absolute RoPE
Dropout 0.1 0
Embeddings and Output Layer Tied Untied
Bias terms Yes No
Normalization LayerNorm LayerNorm1p

NeMo Framework 23.03

  • Per micro-batch data loader for GPT and BERT
  • SquaredReLU and SwiGLU activation function support for GPT and T5
  • Rotary Position Embedding (RoPE) for GPT and RETRO
  • Early stopping support when P-Tuning/Prompt Tuning GPT, T5, and mT5
  • Refactored Adapter learning implementation to mimic the Parameter-Efficient Transfer Learning for NLP approach
  • Flash Attention for GPT models in Transformer Engine

Announcement

Coming Soon! The data curation module, Prospector-LM, which is a scalable Python library for curating large-scale datasets and can be leveraged for training large language foundation models.

NeMo Framework 23.01

  • BERT with tensor parallelism support (training only)
  • BERT with pipeline parallelism support (training only)
  • Sequence Parallelism and Selective Activation Checkpointing for BERT (training only)
  • Interleaved Pipeline Scheduling for BERT
  • Distributed Adam Optimizer for BERT
  • AutoConfigurator for BERT
  • 110M, 4B, 20B, and 100B BERT training configurations
  • Support for the Mixture of Experts for T5 (no expert parallelism, training only)
  • Performance improvement for GPT P-Tuning (20% - 25% speed-up)
  • ALiBi Position Embeddings for T5 and mT5 (training only)
  • Log total model size (across modal parallel ranks) for GPT, T5, mT5, and BERT

NeMo Framework 22.11

  • Interleaved Pipeline Scheduling for GPT (training only)
  • FP8 support using Transformer Engine (training only)
  • Distributed Adam Optimizer for T5 and mT5
  • P-Tuning and Prompt Tuning for GPT with Sequence Parallelism
  • Training configurations improved throughput by 7.9% (5B GPT), 9.6% (3B T5), 4.3% (11B T5), 52.4% (23B T5), and 26.6% (41B T5)

NeMo Framework 22.09

  • NeMo Framework supports training and inference containers on OCI. For more details about orchestration scripts, reach out to [email protected]
  • P-Tuning and Prompt Tuning for T5 and mT5 with pipeline parallelism (training only)
  • Adapter learning for GPT and T5 with tensor parallelism and pipeline parallelism (training only)
  • IA3 learning for GPT and T5 with tensor parallelism and pipeline parallelism (training only)
  • AutoConfigurator to find the highest throughput configs for training on Base Command Platform
  • AutoConfigurator: parallel inference hyperparameter search for GPT on Base Command Manager

NeMo Framework 22.08.01

  • Cloud service providers: support for Amazon Web Services (performance validated up to 20 p4d.24xlarge instances)
  • Cloud service providers: switched orchestration from Azure CycleCloud to NVIDIA Nephele for Microsoft Azure

NeMo Framework 22.08

  • Distributed Adam Optimizer for GPT
  • Asymmetric encoder-decoder configuration for T5 and mT5
  • Support for untying embeddings from the classifier layer for T5 and mT5
  • Relative Position Embeddings for T5 and mT5 (pipeline parallelism>=3)
  • P-Tuning and Prompt Tuning for T5 and mT5 with tensor parallelism (training only)
  • Code refactor - improved consistency and readability of configurations and logs
  • SQuAD fine-tuning and evaluation support for T5 with pipeline parallelism =<2
  • XQuAD fine-tuning and evaluation support for mT5 with pipeline parallelism =<2

NeMo Framework 22.06-hotfix.01

  • Fix: AutoConfigurator for T5 and mT5
  • Fix: Evaluation harness in GPT
  • Fix: Prompt learning in GPT
  • Fix: Out of memory when pretraining GPT with Sequence Parallelism

NeMo Framework 22.06

  • Sequence Parallelism and Selective Activation Checkpointing for GPT
  • Relative Position Embeddings for T5
    • We used mC4 dataset (24 Languages) for pretraining the mT5 and verified our results on KNLI, KorQuAD, KLUE-STS, and XNLI tasks
  • AutoConfigurator update with Sequence Parallelism and Selective Activation Checkpointing for GPT
  • AutoConfigurator: support for DGX A100 40GB configurations for GPT, T5, and mT5
  • P-Tuning and Prompt Tuning for GPT with pipeline parallelism (training only)
  • Operation fusions for higher training throughput (2%-7% speed-up)
  • Default GPT configurations changed to include Sequence Parallelism and Selective Activation Checkpointing: 20B (speed-up: 14%), 40B (speed-up: 9%), 175B (speed-up: 15%)

NeMo Framework 22.05.01

  • Cloud service providers: support for Microsoft Azure (performance validated up to 36 Standard_ND96amsr_A100_v4 instances)
  • Cluster validation tools (DGMI, NCCL)
  • 20B GPT training configuration improved by 2.7% for higher throughput

NeMo Framework 22.05

  • Asynchronous gradient all-reduce for GPT, T5, mT5 models with pipeline parallel size equal to 1
  • P-Tuning and Prompt Tuning for GPT with tensor parallelism (training only)
  • AutoConfigurator to find the highest throughput configs for training and inference on Base Command Manager
  • Custom tokenizer support (training only)
  • GPT with pipeline parallelism support on Base Command Manager (inference)
  • Hyperparameters for text generation: top-p, top-k, and temperature

NeMo Framework 22.04

  • T5 with pipeline parallelism support (training only)
  • Switched from GeLU to GeGLU as activation function for T5
  • mT5 with tensor parallelism and pipeline parallelism support (training only)
  • 11B, 23B, and 41B T5 training configurations
  • 170M, 390M, and 3B mT5 training configurations
  • Automatic and configurable Non-Uniform Memory Access (NUMA) mapping

NeMo Framework 22.03

  • T5 with tensor parallelism support (optimized for <20B parameters, training only)
  • 220M and 3B T5 training configurations
  • GLUE fine-tuning and evaluation support for T5

NeMo Framework 22.02

  • GPT with pipeline parallelism support (training only)
  • 40B and 175B GPT training configurations

NeMo Framework 22.01

  • GPT with tensor parallelism support on Base Command Platform
  • O2-style AMP (accelerated training of larger models)
  • Chatbot sample application using your trained GPT model
  • Training metric monitoring and visualization with Weights & Biases

9. Known Issues

Fixes for the following issues will be released shortly:

  • The inference hyperparameter search is not available in this release for T5 and mT5.
  • Accuracy and performance measurement for GPT-3 is currently not supported. Please use the NeMo Megatron 22.05 inference container to use this feature.
  • The fine-tuning SQuAD results for T5 are lower than expected.
  • There has been a slight regression in T5 performance and this will be addressed in an upcoming release.
  • Evaluation for GPT has been tested for PP <=2 and may have issues for PP >2. It is recommended to convert to TP only for Evaluation.
  • Transformer Engine (TE)-based GPT models are currently not supported for any Parameter Efficient Fine Tuning (PEFT) techniques - this will be added soon.
  • TE-based GPT Eval will take more memory than non-TE-based GPT Eval.

More Repositories

1

nvidia-docker

Build and run Docker containers leveraging NVIDIA GPUs
16,896
star
2

open-gpu-kernel-modules

NVIDIA Linux open GPU kernel module source
C
13,784
star
3

DeepLearningExamples

State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
Jupyter Notebook
12,579
star
4

FastPhotoStyle

Style transfer, deep learning, feature transform
Python
11,020
star
5

NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
Python
10,137
star
6

TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
C++
9,059
star
7

Megatron-LM

Ongoing research training transformer models at scale
Python
8,631
star
8

vid2vid

Pytorch implementation of our method for high-resolution (e.g. 2048x1024) photorealistic video-to-video translation.
Python
8,482
star
9

apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
Python
7,915
star
10

pix2pixHD

Synthesizing and manipulating 2048x1024 images with conditional GANs
Python
6,488
star
11

TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
C++
6,429
star
12

FasterTransformer

Transformer related optimization, including BERT, GPT
C++
5,313
star
13

cuda-samples

Samples for CUDA Developers which demonstrates features in CUDA Toolkit
C
5,203
star
14

thrust

[ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl
C++
4,842
star
15

DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
C++
4,839
star
16

tacotron2

Tacotron 2 - PyTorch implementation with faster-than-realtime inference
Jupyter Notebook
4,562
star
17

cutlass

CUDA Templates for Linear Algebra Subroutines
C++
4,552
star
18

DIGITS

Deep Learning GPU Training System
HTML
4,105
star
19

NeMo-Guardrails

NeMo Guardrails is an open-source toolkit for easily adding programmable guardrails to LLM-based conversational systems.
Python
3,338
star
20

flownet2-pytorch

Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks
Python
2,938
star
21

nccl

Optimized primitives for collective multi-GPU communication
C++
2,820
star
22

libcudacxx

[ARCHIVED] The C++ Standard Library for your entire system. See https://github.com/NVIDIA/cccl
C++
2,286
star
23

k8s-device-plugin

NVIDIA device plugin for Kubernetes
Go
2,269
star
24

waveglow

A Flow-based Generative Network for Speech Synthesis
Python
2,133
star
25

trt-llm-rag-windows

A developer reference project for creating Retrieval Augmented Generation (RAG) chatbots on Windows using TensorRT-LLM
Python
2,011
star
26

MinkowskiEngine

Minkowski Engine is an auto-diff neural network library for high-dimensional sparse tensors
Python
2,007
star
27

semantic-segmentation

Nvidia Semantic Segmentation monorepo
Python
1,746
star
28

DeepRecommender

Deep learning for recommender systems
Python
1,662
star
29

Stable-Diffusion-WebUI-TensorRT

TensorRT Extension for Stable Diffusion Web UI
Python
1,660
star
30

cub

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
Cuda
1,648
star
31

warp

A Python framework for high performance GPU simulation and graphics
Python
1,573
star
32

OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
Python
1,511
star
33

GenerativeAIExamples

Generative AI reference workflows optimized for accelerated infrastructure and microservice architecture.
Python
1,450
star
34

TransformerEngine

A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
Python
1,429
star
35

stdexec

`std::execution`, the proposed C++ framework for asynchronous and parallel programming.
C++
1,258
star
36

VideoProcessingFramework

Set of Python bindings to C++ libraries which provides full HW acceleration for video decoding, encoding and GPU-accelerated color space and pixel format conversions
C++
1,253
star
37

nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Go
1,239
star
38

trt-samples-for-hackathon-cn

Simple samples for TensorRT programming
Python
1,211
star
39

Q2RTX

NVIDIA’s implementation of RTX ray-tracing in Quake II
C
1,201
star
40

open-gpu-doc

Documentation of NVIDIA chip/hardware interfaces
C
1,193
star
41

deepops

Tools for building GPU clusters
Shell
1,165
star
42

partialconv

A New Padding Scheme: Partial Convolution based Padding
Python
1,145
star
43

CUDALibrarySamples

CUDA Library Samples
Cuda
1,122
star
44

gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Go
1,117
star
45

MatX

An efficient C++17 GPU numerical computing library with Python-like syntax
C++
1,104
star
46

aistore

AIStore: scalable storage for AI applications
Go
1,074
star
47

sentiment-discovery

Unsupervised Language Modeling at scale for robust sentiment classification
Python
1,055
star
48

nvidia-container-runtime

NVIDIA container runtime
Makefile
1,035
star
49

gpu-monitoring-tools

Tools for monitoring NVIDIA GPUs on Linux
C
974
star
50

retinanet-examples

Fast and accurate object detection with end-to-end GPU optimization
Python
876
star
51

flowtron

Flowtron is an auto-regressive flow-based generative network for text to speech synthesis with control over speech variation and style transfer
Jupyter Notebook
867
star
52

mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
Jupyter Notebook
842
star
53

jetson-gpio

A Python library that enables the use of Jetson's GPIOs
Python
834
star
54

cccl

CUDA C++ Core Libraries
C++
789
star
55

gdrcopy

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
C++
766
star
56

nv-wavenet

Reference implementation of real-time autoregressive wavenet inference
Cuda
728
star
57

spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
Scala
722
star
58

libnvidia-container

NVIDIA container runtime library
C
722
star
59

tensorflow

An Open Source Machine Learning Framework for Everyone
C++
719
star
60

cuda-python

CUDA Python Low-level Bindings
Python
695
star
61

MAXINE-AR-SDK

NVIDIA AR SDK - API headers and sample applications
C
671
star
62

nccl-tests

NCCL Tests
Cuda
667
star
63

nvvl

A library that uses hardware acceleration to load sequences of video frames to facilitate machine learning training
C++
665
star
64

modulus

Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
Python
662
star
65

gvdb-voxels

Sparse volume compute and rendering on NVIDIA GPUs
C
656
star
66

dcgm-exporter

NVIDIA GPU metrics exporter for Prometheus leveraging DCGM
Go
650
star
67

BigVGAN

Official PyTorch implementation of BigVGAN (ICLR 2023)
Python
633
star
68

runx

Deep Learning Experiment Management
Python
630
star
69

DLSS

NVIDIA DLSS is a new and improved deep learning neural network that boosts frame rates and generates beautiful, sharp images for your games
C
588
star
70

NVFlare

NVIDIA Federated Learning Application Runtime Environment
Python
541
star
71

Dataset_Synthesizer

NVIDIA Deep learning Dataset Synthesizer (NDDS)
C++
530
star
72

nvcomp

Repository for nvCOMP docs and examples. nvCOMP is a library for fast lossless compression/decompression on the GPU that can be downloaded from https://developer.nvidia.com/nvcomp.
C++
510
star
73

jitify

A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).
C++
499
star
74

libglvnd

The GL Vendor-Neutral Dispatch library
C
467
star
75

enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
Shell
459
star
76

multi-gpu-programming-models

Examples demonstrating available options to program multiple GPUs in a single node or a cluster
Cuda
438
star
77

MDL-SDK

NVIDIA Material Definition Language SDK
C++
438
star
78

PyProf

A GPU performance profiling tool for PyTorch models
Python
437
star
79

AMGX

Distributed multigrid linear solver library on GPU
Cuda
434
star
80

gpu-rest-engine

A REST API for Caffe using Docker and Go
C++
421
star
81

framework-reproducibility

Providing reproducibility in deep learning frameworks
Python
417
star
82

nvbench

CUDA Kernel Benchmarking Library
Cuda
416
star
83

cuCollections

C++
410
star
84

hpc-container-maker

HPC Container Maker
Python
404
star
85

NvPipe

NVIDIA-accelerated zero latency video compression library for interactive remoting applications
Cuda
384
star
86

cuda-quantum

C++ and Python support for the CUDA Quantum programming model for heterogeneous quantum-classical workflows
C++
363
star
87

DCGM

NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs
C++
321
star
88

data-science-stack

NVIDIA Data Science stack tools
Shell
317
star
89

cuQuantum

Home for cuQuantum Python & NVIDIA cuQuantum SDK C++ samples
Jupyter Notebook
305
star
90

ai-assisted-annotation-client

Client side integration example source code and libraries for AI-Assisted Annotation SDK
C++
302
star
91

video-sdk-samples

Samples demonstrating how to use various APIs of NVIDIA Video Codec SDK
C++
301
star
92

nvidia-settings

NVIDIA driver control panel
C
286
star
93

cnmem

A simple memory manager for CUDA designed to help Deep Learning frameworks manage memory
C++
280
star
94

radtts

Provides training, inference and voice conversion recipes for RADTTS and RADTTS++: Flow-based TTS models with Robust Alignment Learning, Diverse Synthesis, and Generative Modeling and Fine-Grained Control over of Low Dimensional (F0 and Energy) Speech Attributes.
Roff
269
star
95

fsi-samples

A collection of open-source GPU accelerated Python tools and examples for quantitative analyst tasks and leverages RAPIDS AI project, Numba, cuDF, and Dask.
Jupyter Notebook
265
star
96

torch-harmonics

Differentiable spherical harmonic transforms and spherical convolutions in PyTorch
Jupyter Notebook
262
star
97

egl-wayland

The EGLStream-based Wayland external platform
C
261
star
98

tensorrt-laboratory

Explore the Capabilities of the TensorRT Platform
C++
259
star
99

CleanUNet

Official PyTorch Implementation of CleanUNet (ICASSP 2022)
Python
258
star
100

gpu-feature-discovery

GPU plugin to the node feature discovery for Kubernetes
Go
255
star