philschmid RSS feed 09月30日
FLAN-T5模型扩展训练教程
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用DeepSpeed ZeRO技术扩展FLAN-T5模型训练,从基础版到3B和11B参数版本,通过模型并行、多GPU和混合精度训练提升效率。教程涵盖环境搭建、数据预处理、DeepSpeed配置及训练脚本编写,并对比了不同硬件配置的性能与成本。实验结果表明,混合精度训练(bf16)显著降低内存需求,8x A100 40GB配置能高效训练FLAN-T5-XXL,而4x A10G 24GB也能胜任3B版本。此外,未使用模型卸载的配置在批量大小大于4时速度更快、成本更低。

📊 DeepSpeed ZeRO通过分区优化器状态、梯度和模型参数,显著降低内存占用,支持大规模Transformer模型训练。它原生集成于Hugging Face Transformers Trainer,只需配置DeepSpeed文件即可启用ZeRO,包括ZeRO-Offload技术将优化器和模型卸载至CPU,实现更大模型训练。

🚀 混合精度训练(bf16)对比fp32内存占用减少约2倍,是训练FLAN-T5-XXL的关键。实验显示,8x A100 40GB配置使用bf16可高效训练,而4x A10G 24GB也能胜任3B版本,fp32则因内存不足难以完成。配置时需根据硬件特性选择合适的精度策略。

📈 训练成本与硬件和策略密切相关。未使用模型卸载的配置在批量大小大于4时,速度提升约2倍且成本更低。对比实验表明,bf16训练FLAN-T5-XXL在8x A100上仅需19小时、成本约613美元,而fp32或需卸载的配置则显著延长训练时间。

🔧 环境搭建需安装torch(cu116)、transformers(4.26.0)、datasets(2.9.0)、deepspeed(0.8.0)等依赖,并准备CNN Dailymail数据集。预处理阶段需设计提示模板(prompt_template)并计算文档长度,确保输入输出序列不超过模型最大长度限制。

📖 训练脚本需支持DeepSpeed配置,通过deepspeed --num_gpus指定GPU数量,并传递模型ID、数据集路径、批大小等超参数。实验建议使用ds_flan_t5_z3_config_bf16.json配置文件,其中包含ZeRO阶段、梯度累积等关键参数设置。

FLAN-T5, released with the Scaling Instruction-Finetuned Language Models paper, is an enhanced version of T5 that has been fine-tuned in a mixture of tasks, or simple words, a better T5 model in any aspect. FLAN-T5 outperforms T5 by double-digit improvements for the same number of parameters. Google has open sourced 5 checkpoints available on Hugging Face ranging from 80M parameter up to 11B parameter.

In a previous blog post, we already learned how to “Fine-tune FLAN-T5 for chat & dialogue summarization” using the base version (250M parameter) of the model. In this blog post, we look into how we can scale the training from the Base version to the XL (3B) or XXL (11B).

This means we will learn how to fine-tune FLAN-T5 XL & XXL using model parallelism, multiple GPUs, and DeepSpeed ZeRO.

You will learn about the following:

    What is DeepSpeed ZeRO?Fine-tune FLAN-T5-XXL using DeepspeedResults & Experiments

in addition to the tutorial, we have run a series of experiments to help you choose the right hardware setup. You can find the details in the Results & Experiments section.

Let's get started! 🚀

1. What is DeepSpeed ZeRO?

DeepSpeed ZeRO is part of the DeepSpeed Training Pillar, which focus on efficient large-scale Training of Transformer models. DeepSpeed ZeRO or Zero Redundancy Optimizer is a method to reduce the memory footprint. Compared to basic data parallelism, ZeRO partitions optimizer states, gradients, and model parameters to save significant memory across multiple devices.

If you want to learn more about DeepSpeed ZeRO, checkout: ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters

DeepSpeed ZeRO is natively integrated into the Hugging Face Transformers Trainer. The integration enables leveraging ZeRO by simply providing a DeepSpeed config file, and the Trainer takes care of the rest.

Excerpt: DeepSpeed ZeRO-offload

DeepSpeed ZeRO not only allows us to parallelize our models on multiple GPUs, it also implements Offloading. ZeRO-Offload implements optimizations that offload optimizer and model to the CPU to train larger models on the given GPUs, e.g. 10B parameter GPT-2 on a single V100 GPU. We used ZeRO-offload for the experiments but will not use it in the tutorial.

2. Fine-tune FLAN-T5-XXL using Deepspeed

We now know that we can use DeepSpeed ZeRO together with Hugging Face Transformers to easily scale our hardware in cases where the model no longer fits on GPU. That's exactly what we need to solve since the FLAN-T5-XXL weights in fp32 are already 44GB big. This makes it almost impossible to fit on a single GPU when adding activations and optimizer states.

In this tutorial, we cover how to fine-tune FLAN-T5-XXL (11B version) on the CNN Dailymail Dataset for news summarization. The provided script and pre-processing can easily be adjusted to fine-tune FLAN-T5-XL and use a different dataset.

Note: This tutorial was created and run on a p4dn.24xlarge AWS EC2 Instance including 8x NVIDIA A100 40GB.

Setup Development Environment

The first step is to install the Hugging Face Libraries, including transformers and datasets, and DeepSeed. Running the following cell will install all the required packages.

# install torch with the correct cuda version, check nvcc --versionpip install torch --extra-index-url https://download.pytorch.org/whl/cu116 --upgrade# install Hugging Face Librariespip install "transformers==4.26.0" "datasets==2.9.0" "accelerate==0.16.0" "evaluate==0.4.0" --upgrade# install deepspeed and ninja for jit compilations of kernelspip install "deepspeed==0.8.0" ninja --upgrade# install additional dependencies needed for trainingpip install rouge-score nltk py7zr tensorboard

Load and prepare dataset

Similar to the “Fine-tune FLAN-T5 for chat & dialogue summarization” we need to prepare a dataset to fine-tune our model. As mentioned in the beginning, we will fine-tune FLAN-T5-XXL on the CNN Dailymail Dataset. The blog post is not going into detail about the dataset generation. If you want to learn the detailed steps check out the previous post.

We define some parameters, which we use throughout the whole example, feel free to adjust it to your needs.

# experiment configmodel_id = "google/flan-t5-xxl" # Hugging Face Model Iddataset_id = "cnn_dailymail" # Hugging Face Dataset Iddataset_config = "3.0.0" # config/verison of the datasetsave_dataset_path = "data" # local path to save processed datasettext_column = "article" # column of input text issummary_column = "highlights" # column of the output text# custom instruct prompt startprompt_template = f"Summarize the following news article:\n{{input}}\nSummary:\n"

Compared to the previous example, we are splitting the processing and training into two separate paths. This allows you to run the preprocessing outside of the GPU instance. We process (tokenize) the dataset and save it to disk and then load in our train script from disk again.

from datasets import load_datasetfrom transformers import AutoTokenizerimport numpy as np # Load dataset from the hubdataset = load_dataset(dataset_id,name=dataset_config)# Load tokenizer of FLAN-t5-basetokenizer = AutoTokenizer.from_pretrained(model_id) print(f"Train dataset size: {len(dataset['train'])}")print(f"Test dataset size: {len(dataset['test'])}") # Train dataset size: 287113# Test dataset size: 11490

We defined a prompt_template in our config, which we will use to construct an instruct prompt for better performance of our model. Our prompt_template has a “fixed” start and end, and our document is in the middle. This means we need to ensure that the “fixed” template parts + document are not exceeding the max length of the model. Therefore we calculate the max length of our document, which we will later use for padding and truncation

prompt_lenght = len(tokenizer(prompt_template.format(input=""))["input_ids"])max_sample_length = tokenizer.model_max_length - prompt_lenghtprint(f"Prompt length: {prompt_lenght}")print(f"Max input length: {max_sample_length}") # Prompt length: 12# Max input length: 500

We know now that our documents can be “500” tokens long to fit our template_prompt still correctly. In addition to our input, we need to understand better our “target” sequence length meaning and how long are the summarization ins our dataset. Therefore we iterate over the dataset and calculate the max input length (at max 500) and the max target length. (takes a few minutes)

from datasets import concatenate_datasetsimport numpy as np # The maximum total input sequence length after tokenization.# Sequences longer than this will be truncated, sequences shorter will be padded.tokenized_inputs = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x[text_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])max_source_length = max([len(x) for x in tokenized_inputs["input_ids"]])max_source_length = min(max_source_length, max_sample_length)print(f"Max source length: {max_source_length}") # The maximum total sequence length for target text after tokenization.# Sequences longer than this will be truncated, sequences shorter will be padded."tokenized_targets = concatenate_datasets([dataset["train"], dataset["test"]]).map(lambda x: tokenizer(x[summary_column], truncation=True), batched=True, remove_columns=[text_column, summary_column])target_lenghts = [len(x) for x in tokenized_targets["input_ids"]]# use 95th percentile as max target lengthmax_target_length = int(np.percentile(target_lenghts, 95))print(f"Max target length: {max_target_length}")

We now have everything needed to process our dataset.

import os def preprocess_function(sample, padding="max_length"):    # created prompted input    inputs = [prompt_template.format(input=item) for item in sample[text_column]]     # tokenize inputs    model_inputs = tokenizer(inputs, max_length=max_source_length, padding=padding, truncation=True)     # Tokenize targets with the `text_target` keyword argument    labels = tokenizer(text_target=sample[summary_column], max_length=max_target_length, padding=padding, truncation=True)     # If we are padding here, replace all tokenizer.pad_token_id in the labels by -100 when we want to ignore    # padding in the loss.    if padding == "max_length":        labels["input_ids"] = [            [(l if l != tokenizer.pad_token_id else -100) for l in label] for label in labels["input_ids"]        ]     model_inputs["labels"] = labels["input_ids"]    return model_inputs # process datasettokenized_dataset = dataset.map(preprocess_function, batched=True, remove_columns=list(dataset["train"].features)) # save dataset to disktokenized_dataset["train"].save_to_disk(os.path.join(save_dataset_path,"train"))tokenized_dataset["test"].save_to_disk(os.path.join(save_dataset_path,"eval"))

Fine-tune model using deepspeed

Done! We can now start training our model! We learned in the introduction that we would leverage the DeepSpeed integration with the Hugging Face Trainer. Therefore we need to create a deespeed_config.json. In the DeepSpeed Configuration, we define the ZeRO strategy we want to use and if we want to use mixed precision training. The Hugging Face Trainer allows us to inherit values from the TrainingArguments in our deepspeed_config.json to avoid duplicate values, check the documentation for more information.

We created 4 deepspeed configurations for the experiments we ran, including CPU offloading and mixed precision:

Depending on your setup, you can use those, e.g. if you are running on NVIDIA V100s, you have to use the config without bf16 since V100 are not support bfloat16 types.

When fine-tuning T5 models we cannot use fp16 since it leads to overflow issues, see: #4586, #10830, #10956

As mentioned in the beginning, we are using a p4dn.24xlarge AWS EC2 Instance including 8x NVIDIA A100 40GB. This means we can leverage bf16, which reduces the memory footprint of the model by almost ~2x, which allows us to train without offloading efficiently.

We are going to use the ds_flan_t5_z3_config_bf16.json. If you are irritated by the auto values, check the documentation.

{  "bf16": {    "enabled": "auto"  },  "optimizer": {    "type": "AdamW",    "params": {      "lr": "auto",      "betas": "auto",      "eps": "auto",      "weight_decay": "auto"    }  },  "scheduler": {    "type": "WarmupLR",    "params": {      "warmup_min_lr": "auto",      "warmup_max_lr": "auto",      "warmup_num_steps": "auto"    }  },  "zero_optimization": {    "stage": 3,    "overlap_comm": true,    "contiguous_gradients": true,    "sub_group_size": 1e9,    "reduce_bucket_size": "auto",    "stage3_prefetch_bucket_size": "auto",    "stage3_param_persistence_threshold": "auto",    "stage3_max_live_parameters": 1e9,    "stage3_max_reuse_distance": 1e9,    "stage3_gather_16bit_weights_on_model_save": false  },  "gradient_accumulation_steps": "auto",  "gradient_clipping": "auto",  "steps_per_print": 2000,  "train_batch_size": "auto",  "train_micro_batch_size_per_gpu": "auto",  "wall_clock_breakdown": false}

Now, we need our training script. We prepared a run_seq2seq_deepspeed.py training script based on the previous blog post, which supports our deepspeed config and all other hyperparameters.

We can start our training with the deepspeed launcher providing the number of GPUs, the deepspeed config, and our hyperparameters, including our model id for google/flan-t5-xxl.

deepspeed --num_gpus=8 scripts/run_seq2seq_deepspeed.py \    --model_id google/flan-t5-xxl \    --dataset_path data \    --epochs 3 \    --per_device_train_batch_size 8 \    --per_device_eval_batch_size 8 \    --generation_max_length 129 \    --lr 1e-4 \    --deepspeed configs/ds_flan_t5_z3_config_bf16.json

Deepspeed now loads our model on the CPU and then splits it across our 8x A100 and starts the training. The training using the CNN Dailymail Dataset takes roughly 10 hours and costs ~$322

3. Results & Experiments

During the creation of the tutorial and to get a better understanding of the hardware requirements, we ran a series of experiments for FLAN-T5 XL & XXL, which should help us evaluate and understand the hardware requirements and cost of training those models.We ran the experiments only for ~20% of the training without evaluation and calculated the duration based on this estimate.

Below you'll find a table of the experiments and more information about the setup.

Dataset: CNN Dailymail Dataset with a train dataset size of 287113 samples with a sequence length of 512

Hyperparameters: Epoch 3

Setup and instance types:

    4x V100 16GB: p3.8xlarge4x A10G 24GB: g5.24xlarge8x V100 16GB: p3.16xlarge8x A100 40GB: p4dn.24xlarge
ModelDS ZeRO offloadHardwarebatch size per GPUprecisiondurationcost
FLAN-T5-XL (3B)No4x V100 16GBOOMfp32--
FLAN-T5-XL (3B)No8x V100 16GB1fp32105h~$2570
FLAN-T5-XL (3B)No8x A100 40GB72bf162,5h~$81
FLAN-T5-XL (3B)Yes4x V100 16GB8fp3269h~$828
FLAN-T5-XL (3B)Yes8x V100 16GB8fp3232h~$768
FLAN-T5-XXL (11B)Yes4x V100 16GBOOMfp32--
FLAN-T5-XXL (11B)Yes8x V100 16GBOOMfp32--
FLAN-T5-XXL (11B)Yes4x A10G 24GB24bf1690h~$732
FLAN-T5-XXL (11B)Yes8x A100 40GB48bf1619h~$613
FLAN-T5-XXL (11B)No8x A100 40GB8bf1610h~$322

We can see that bf16 provides significant advantages over fp32. We could fit FLAN-T5-XXL on 4x A10G (24GB) but not on 8x V100 16GB.

We also learned that if the model fits on the GPUs with a batch size > 4 without offloading, we are ~2x faster and more cost-effective than offloading the model and scaling the batch size.


Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

FLAN-T5 DeepSpeed ZeRO 混合精度训练 Transformer模型扩展 大规模模型训练 Hugging Face 自然语言处理 模型并行 多GPU训练
相关文章