Wednesday, June 12, 2024

Speed up Mixtral 8x7B pre-training with skilled parallelism on Amazon SageMaker


Combination of Consultants (MoE) architectures for big language fashions (LLMs) have not too long ago gained reputation resulting from their capacity to extend mannequin capability and computational effectivity in comparison with totally dense fashions. By using sparse skilled subnetworks that course of totally different subsets of tokens, MoE fashions can successfully improve the variety of parameters whereas requiring much less computation per token throughout coaching and inference. This permits more cost effective coaching of bigger fashions inside fastened compute budgets in comparison with dense architectures.

Regardless of their computational advantages, coaching and fine-tuning giant MoE fashions effectively presents some challenges. MoE fashions can wrestle with load balancing if the tokens aren’t evenly distributed throughout specialists throughout coaching, and a few specialists might turn into overloaded whereas others are under-utilized. MoE fashions have excessive reminiscence necessities, as a result of all skilled parameters must be loaded into reminiscence though solely a subset is used for every enter.

On this put up, we spotlight new options of the Amazon SageMaker mannequin parallelism library that allow environment friendly coaching of MoE fashions utilizing skilled parallelism. Knowledgeable parallelism is a kind of parallelism that handles splitting specialists of an MoE mannequin throughout separate staff or gadgets, much like how tensor parallelism can partition dense mannequin layers. We reveal learn how to use these new options of SMP by pre-training the 47 billion parameter Mixtral 8x7B MoE mannequin utilizing skilled parallelism. To be taught extra, discuss with our GitHub repo and Expert parallelism.

Knowledgeable parallelism

The Mixtral 8x7B mannequin has a sparse MoE structure, containing eight skilled subnetworks with round 7 billion parameters every. A trainable gate community known as a router determines which enter tokens are despatched to which skilled. With this structure, the specialists concentrate on processing totally different features of the enter knowledge. The entire Mixtral 8x7B mannequin has a complete of 47 billion parameters, however solely round 12.9 billion (two specialists, for this mannequin structure) are activated for any given enter token; this leads to improved computational effectivity relative to a dense mannequin of the identical complete dimension. To be taught extra in regards to the MoE structure usually, discuss with Applying Mixture of Experts in LLM Architectures.

SMP provides help for skilled parallelism

SMP now helps skilled parallelism, which is crucial to performant MoE mannequin coaching. With skilled parallelism, totally different skilled subnetworks that comprise the MoE layers are positioned on separate gadgets. Throughout coaching, totally different knowledge is routed to the totally different gadgets, with every gadget dealing with the computation for the specialists it incorporates. By distributing specialists throughout staff, skilled parallelism addresses the excessive reminiscence necessities of loading all specialists on a single gadget and allows MoE coaching on a bigger cluster. The next determine provides a simplified have a look at how skilled parallelism works on a multi-GPU cluster.

The SMP library makes use of NVIDIA Megatron to implement skilled parallelism and help coaching MoE fashions, and runs on prime of PyTorch Absolutely Sharded Knowledge Parallel (FSDP) APIs. You possibly can preserve utilizing your PyTorch FSDP coaching code as is and activate SMP skilled parallelism for coaching MoE fashions. SMP provides a simplified workflow the place you should specify the expert_parallel_degree parameter, which is able to evenly divide specialists throughout the variety of GPUs in your cluster. For instance, to shard your mannequin whereas utilizing an occasion with 8 GPUs, you may set the expert_parallel_degree to 2, 4, or 8. We suggest that you simply begin with a small quantity and regularly improve it till the mannequin matches within the GPU reminiscence.

SMP’s skilled parallelism is appropriate with sharded knowledge parallelism

SMP’s skilled parallel implementation is appropriate with sharded knowledge parallelism, enabling extra memory-efficient and sooner coaching. To know how this works, contemplate an MoE mannequin within the following instance with eight specialists (N=8) coaching on a easy cluster with one node containing 4 GPUs.

SMP’s skilled parallelism splits the MoE specialists throughout GPUs. You management what number of specialists are instantiated on every gadget by utilizing the expert_parallel_degree parameter. For instance, in case you set the diploma to 2, SMP will assign half of the eight specialists to every knowledge parallel group. The diploma worth have to be an element of the variety of GPUs in your cluster and the variety of specialists in your mannequin. Knowledge is dynamically routed to and from the GPU or GPUs internet hosting the chosen skilled utilizing all-to-all GPU communication.

Subsequent, sharded knowledge parallelism partitions and distributes the specialists in addition to the non-MoE layers of the mannequin, like consideration or routers, throughout your cluster to scale back the reminiscence footprint of the mannequin. The hybrid_shard_degree parameter controls this. For instance, a hybrid_shard_degree of two will shard the mannequin states (together with specialists and non-MoE layers) throughout half of the GPUs in our cluster. The product of expert_parallel_degree and hybrid_shard_degree shouldn’t exceed the world dimension of the cluster. Within the following instance, hybrid_shard_degree * expert_parallel_degree = 4 is a sound configuration.

Resolution overview

With the background out of the way in which, let’s dig into the parts of our distributed coaching structure. The next diagram illustrates the answer structure.

On this instance, we use SageMaker coaching jobs. With SageMaker coaching jobs, you may launch and handle clusters of high-performance situations with easy API calls. For instance, you should use the SageMaker Estimator to specify the sort and amount of situations to make use of in your distributed techniques with just some traces of code. Later on this put up, we use a cluster of two ml.p4d.24xlarge situations to coach our mannequin by specifying these parameters in our Estimator. To study SageMaker coaching jobs, see Train a Model with Amazon SageMaker.

On this put up, we use the SMP library to effectively distribute the workload throughout the cluster utilizing hybrid sharded knowledge parallelism and skilled parallelism. Along with these implementations, SMP provides many different performance-improving and memory-saving strategies, similar to:

  • Combined precision coaching and fp8 help for dense Llama fashions (which accelerates distributed coaching and takes benefit of the efficiency enhancements on P5 situations)
  • Tensor parallelism composable with sharded knowledge parallelism
  • Delayed parameter initialization
  • Activation checkpointing (a method to scale back reminiscence utilization by clearing activations of sure layers and recomputing them in the course of the backward move)

For the most recent updates, discuss with SageMaker model parallelism library v2.

Together with SMP, this instance additionally makes use of the SageMaker distributed knowledge parallel library (SMDDP). As you scale your workload and add situations to your cluster, the overhead of communication between situations additionally will increase, which may result in a drop in total computational efficiency and coaching effectivity. That is the place SMDDP helps. SMDDP consists of optimized communication collectives similar to AllGather which can be designed for AWS community infrastructure. Due to this, SMDDP can outperform different extra common communications libraries similar to NCCL when coaching on SageMaker.

Collectively, the SMP and SMDDP libraries can speed up giant distributed coaching workloads by as much as 20%. Moreover, these libraries are appropriate with customary PyTorch APIs and capabilities, which makes it handy to adapt any current PyTorch FSDP coaching script to the SageMaker coaching platform and make the most of the efficiency enhancements that SMP and SMDDP present. To be taught extra, see SageMaker model parallelism library v2 and Run distributed training with the SageMaker distributed data parallelism library.

Within the following sections, we showcase how one can speed up distributed coaching of the Hugging Face Transformers Mixtral 8*7B mannequin on P4 situations utilizing SMP and SMDDP.


That you must full some conditions earlier than you may run the Mixtral pocket book.

First, be sure you have created a Hugging Face access token so you may obtain the Hugging Face tokenizer for use later. After you will have the entry token, you should make a number of quota improve requests for SageMaker. That you must request a minimal of two P4d situations ranging to a most of 8 P4d situations (relying on time-to-train and cost-to-train trade-offs in your use case).

On the Service Quotas console, request the next SageMaker quotas:

  • P4 situations (ml.p4d.24xlarge) for coaching job utilization: 2–8

It could take as much as 24 hours for the quota improve to get permitted.

Now that you simply’re prepared to start the method to pre-train the Mixtral mannequin, we begin with dataset preparation within the subsequent step.

Put together the dataset

We start our tutorial with getting ready the dataset. It will cowl loading the GLUE/SST2 dataset, tokenizing and chunking the dataset, and configuring the info channels for SageMaker coaching on Amazon Simple Storage Service (Amazon S3). Full the next steps:

  1. You first must load the GLUE/SST2 dataset and cut up it into coaching and validation datasets:
    hyperparameters = {
        "cache_dir": "tmp",
        "dataset_config_name": "sst2",
        "dataset_name": "glue",
        "do_train": True,
        "do_eval": True,
    raw_datasets = load_dataset(
    del raw_datasets["validation"]
    if "validation" not in raw_datasets.keys():
        validation_percentage = "10%"
        raw_datasets["validation"] = load_dataset(
            cut up=f"practice[:{validation_percentage}]",
        raw_datasets["train"] = load_dataset(
            cut up=f"practice[{validation_percentage}:]",

  2. Load the Mixtral-8x7B tokenizer from the Hugging Face Transformers library:
    tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1", **tokenizer_kwargs)

Subsequent, you outline two utility capabilities: tokenize_function() and group_texts(). The tokenize_function() runs the tokenizer on the textual content knowledge. The group_texts() operate concatenates all texts from the dataset and generates chunks of a block dimension that corresponds to the mannequin’s enter size (2048) for this instance. By chunking the textual content knowledge into smaller items, you be certain that the mannequin can course of all the dataset throughout coaching, even when some textual content examples are longer than the enter size (2048).

  1. Outline the capabilities with the next code:
    def tokenize_function(examples):
        output = tokenizer(examples[text_column_name])
        return output
    def group_texts(examples):
        # Concatenate all texts.
        concatenated_examples = {ok: sum(examples[k], []) for ok in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
            # Cut up by chunks of max_len.
            consequence = {
                ok: [t[i : i + block_size] for i in vary(0, total_length, block_size)]
                for ok, t in concatenated_examples.objects()
        consequence["labels"] = consequence["input_ids"].copy()
        return consequence

  2. Name the previous utility capabilities in your dataset to tokenize and generate chunks appropriate for the mannequin:
    tokenized_datasets =, batched=True,num_proc=1,remove_columns=column_names)
    lm_datasets =, batched=True)

  3. Put together the coaching and validation datasets for SageMaker coaching by saving them as JSON recordsdata and setting up the S3 paths the place these recordsdata will probably be uploaded:
    train_dataset = lm_datasets["train"]
    training_dataset_location = f"s3://{default_bucket}/dataset/practice/"
    eval_dataset = lm_datasets["validation"]
    validation_dataset_location = f"s3://{default_bucket}/dataset/validation/"

  4. Lastly, arrange the info channels for SageMaker coaching by creating TrainingInput objects from the offered S3 bucket paths for the coaching and check/validation datasets:
    practice = sagemaker.inputs.TrainingInput(
                s3_train_bucket, distribution="FullyReplicated", 
    data_channels = {"practice": practice}
    check = sagemaker.inputs.TrainingInput(
                s3_test_bucket, distribution="FullyReplicated", 
    data_channels["test"] = check

You’re now able to run pre-training or fine-tuning on the dataset.

Pre-train Mixtral 8x7B with skilled parallelism on SMP

To pre-train the Mixtral 8x7B mannequin, full the next steps:

  1. Initialize the script with torch.sagemaker.init() to activate the SMP library:
    import torch.sagemaker as tsm

  2. Import the MoEConfig class from the torch.sagemaker.transform API. We use the MoEConfig class to allow the mannequin to make use of the SMP implementation of MoE:
    from import MoEConfig

  3. Create a mannequin configuration for Mixtral 8x7B mannequin. This will probably be handed to AutoModelForCausalLM.from_config(model_config, attn_implementation="flash_attention_2") from the Hugging Face Transformers library to initialize the mannequin with random weights. If you wish to fine-tune, you may present the trail to the pre-trained weights as a substitute of the mannequin configuration.
    model_config = MixtralConfig(
                vocab_size=args.vocab_size, # 32000,
                hidden_size=args.hidden_width, # 4096,
                intermediate_size=args.intermediate_size, # 14336,
                num_hidden_layers=args.num_layers, # 32,
                num_attention_heads=args.num_heads, # 32,
                num_key_value_heads=args.num_key_value_heads, # 8,
                max_position_embeddings=args.max_context_width, # 4096 * 32,
                initializer_range=args.initializer_range, # 0.02,
                sliding_window=args.sliding_window, # None,
                num_experts_per_tok=args.num_experts_per_tok, # 2,
                num_local_experts=args.num_local_experts, # 8,
    mannequin = AutoModelForCausalLM.from_config(model_config, dtype=dtype, attn_implementation="flash_attention_2" )

Within the example Jupyter Notebook, you employ a create_model() operate that invokes the AutoModelForCausalLM.from_config() operate.

  1. Create the SMP MoE configuration class. Within the following code, you specify parameters within the coaching estimator within the subsequent steps. To be taught extra in regards to the SMP MoEConfig class, see
    moe_config = MoEConfig(
                        smp_moe=args.use_smp_implementation > 0, #Whether or not to make use of the SMP-implementation of MoE. The default worth is True.
                        random_seed=args.seed, # A seed quantity for the random operations in expert-parallel distributed modules. This seed will probably be added to the skilled parallel rank to set the precise seed for every rank. It's distinctive for every skilled parallel rank. The default worth is 12345.
                        moe_load_balancing=args.moe_load_balancing, #Specify the load balancing kind of the MoE router. Legitimate choices are aux_loss, sinkhorn, balanced, and none. The default worth is sinkhorn.
                        global_token_shuffle=args.global_token_shuffle > 0,  #Whether or not to shuffle tokens throughout EP ranks throughout the similar skilled parallel group. The default worth is False
                        moe_all_to_all_dispatcher=args.moe_all_to_all_dispatcher > 0, #Whether or not to make use of all-to-all dispatcher for the communications in MoE. The default worth is True.

  2. With the mannequin and MoE configuration prepared, you wrap the mannequin with the SMP remodel API and move the MoE configuration. Right here, the tsm.remodel methodology adapts the mannequin from Hugging Face format to SMP format. For extra info, discuss with torch.sagemaker.transform.
    mannequin = tsm.remodel(

  3. Outline the coaching hyperparameters, together with the MoE configuration and different settings particular to the mannequin and coaching setup:
    hyperparameters = {
        # MoE config
        "moe": 1,
        "moe_load_balancing": "sinkhorn",
        "moe_all_to_all_dispatcher": 1,
        "seed": 12345,
        #remainder of hyperparameters
        "model_type": "mixtral",
        "sharding_strategy": "hybrid_shard",
        "delayed_param": 1, 
        "epochs": 100,
        "activation_checkpointing": 1,
        "beta1": 0.9,
        "bf16": 1,
        "fp8": 0,
        "checkpoint_dir": "/decide/ml/checkpoints",

We allow delayed parameter initialization in SMP, which permits initializing giant fashions on a meta gadget with out attaching knowledge. This may resolve restricted GPU reminiscence points while you first load the mannequin. This method is especially helpful for coaching LLMs with tens of billions of parameters, the place even CPU reminiscence may not be ample for initialization.

SMP helps numerous routing methods, together with sinkhorn, balanced, and aux_loss. Every gives distinct load balancing approaches to realize equitable token task amongst specialists, thereby sustaining balanced workload distribution.

  1. Specify the parameters for expert_parallel_degree and hybrid_shard_degree:
    expert_parallel_degree = 2  # An integer in [1, world_size]
    hybrid_shard_degree = (
        8  # An integer in [0, world_size // expert_parallel_degree] and its default worth is 0.

Hybrid sharding is a reminiscence saving approach between `FULL_SHARD` and `NO_SHARD`, with `FULL_SHARD` saving essentially the most reminiscence and `NO_SHARD` not saving any. This method shards parameters throughout the hybrid shard diploma (HSD) group and replicates parameters throughout teams. The HSD controls sharding throughout GPUs and might be set to an integer from 0 to `world_size`.

An HSD of 8 applies `FULL_SHARD` inside a node after which replicates parameters throughout nodes as a result of there are 8 GPUs within the nodes we’re utilizing. This leads to decreased communication quantity as a result of costly all-gathers and reduce-scatters are solely performed inside a node, which might be extra performant for medium-sized fashions. Typically, you need to use the smallest HSD that doesn’t trigger out of reminiscence (OOM) errors. If you happen to’re experiencing OOM, strive rising the hybrid shard diploma to scale back reminiscence utilization on every node.

  1. With all the mandatory configurations in place, you now create the PyTorch estimator operate to encapsulate the coaching setup and launch the coaching job. We run the pre-training on the two ml.p4d.24xlarge situations, the place every occasion incorporates 8 A100 Nvidia GPUs:
    smp_estimator = PyTorch(
            "torch_distributed": {
                "enabled": True,
            "smdistributed": {
                "modelparallel": {
                    "enabled": True,
                    "parameters": {
                        "activation_loading_horizon": activation_loading_horizon,
                        "hybrid_shard_degree": hybrid_shard_degree,
                        "sm_activation_offloading": offload_activations,
                        "expert_parallel_degree": expert_parallel_degree,

  2. Lastly, launch the pre-training workload:

Clear up

As a part of cleanup, you may delete the SageMaker default bucket created to host the GLUE/SST2 dataset.


Coaching giant MoE language fashions just like the 47 billion parameter Mistral 8x7B might be difficult resulting from excessive computational and reminiscence necessities. Through the use of skilled parallelism and sharded knowledge parallelism from the SageMaker mannequin parallelism library, you may successfully scale these MoE architectures throughout a number of GPUs and staff.

SMP’s skilled parallelism implementation seamlessly integrates with PyTorch and the Hugging Face Transformers library, permitting you to allow MoE coaching utilizing easy configuration flags with out altering your current mannequin code. Moreover, SMP gives efficiency optimizations like hybrid sharding, delayed parameter initialization, and activation offloading and recomputation to additional enhance coaching effectivity.

For the whole pattern to pre-train and fine-tune Mixtral 8x7B, see the GitHub repo.

Particular thanks

Particular because of Rahul Huilgol, Gautam Kumar, and Luis Quintela for his or her steerage and engineering management in creating this new functionality.

Concerning the Authors

Roy Allela is a Senior AI/ML Specialist Options Architect at AWS based mostly in Munich, Germany. Roy helps AWS clients—from small startups to giant enterprises—practice and deploy giant language fashions effectively on AWS. Roy is keen about computational optimization issues and bettering the efficiency of AI workloads.

Kanwaljit Khurmi is a Principal Options Architect at Amazon Net Companies. He works with AWS clients to supply steerage and technical help, serving to them enhance the worth of their options when utilizing AWS. Kanwaljit makes a speciality of serving to clients with containerized and machine studying purposes.

Robert Van Dusen is a Senior Product Supervisor with Amazon SageMaker. He leads frameworks, compilers, and optimization strategies for deep studying coaching.

Teng Xu is a Software program Improvement Engineer within the Distributed Coaching group in AWS AI. He enjoys studying.

Suhit Kodgule is a Software program Improvement Engineer with the AWS Synthetic Intelligence group engaged on deep studying frameworks. In his spare time, he enjoys climbing, touring, and cooking.

Source link

Read more

Read More