philschmid RSS feed 09月30日 19:11
使用QLoRA在SageMaker上微调 Falcon 180B 模型
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本教程详细介绍了如何利用Amazon SageMaker平台,结合QLoRA技术高效微调180B参数的开源大语言模型Falcon 180B。QLoRA通过将模型量化至4位并引入低秩适配器(LoRA),显著降低了微调大型语言模型的内存需求,同时保持了与全精度微调相当的性能。教程涵盖了开发环境的搭建、数据集的准备与格式化(使用Dolly数据集)、以及在SageMaker上启动QLoRA微调训练的全过程。最终,模型训练完成后,LoRA权重会被合并回模型,使其可以直接作为通用模型使用。同时,文章也对比了微调与预训练的成本,并提供了后续部署模型的建议。

💡 **QLoRA技术实现高效微调**:QLoRA是一种创新的微调技术,它通过将预训练模型量化到4位并冻结模型,然后附加可训练的低秩适配器(LoRA)层,仅微调这些适配器层。这种方法大幅降低了微调大型语言模型(如Falcon 180B)所需的内存,使得在单个GPU上也能实现高效微调,同时性能与全精度微调相当,并在语言任务上取得先进结果。

📊 **SageMaker平台与Dolly数据集的整合**:本教程利用Amazon SageMaker作为训练平台,结合Hugging Face Transformers、Accelerate和PEFT库。在数据集方面,采用了开源的Dolly数据集,该数据集包含由Databricks员工生成的指令遵循记录,涵盖了多种行为类别。数据经过格式化处理,转化为适合模型微调的指令-响应模式,并进行分块和标记化,以优化训练效率。

🚀 **SageMaker训练作业的配置与执行**:教程详细展示了如何在SageMaker上配置和启动HuggingFace Estimator。关键的超参数包括模型ID(tiiuae/falcon-180B)、数据集路径、训练轮数、批次大小、学习率以及Hugging Face Token。训练脚本`run_clm.py`实现了QLoRA和Flash Attention,并在训练后自动合并LoRA权重。训练作业在`ml.p4d.24xlarge`实例上执行,并估算了训练成本。

💰 **成本效益分析与后续步骤**:文章对比了Falcon 180B的预训练成本与本次微调的成本,指出微调成本远低于预训练。例如,本次微调一个epoch花费约5.8小时,成本约为256美元。此外,教程还提供了将微调后的Falcon 180B模型部署到SageMaker端点进行推理的下一步建议。

In this Amazon SageMaker example, we are going to learn how to fine-tune tiiuae/falcon-180B using QLoRA: Efficient Finetuning of Quantized LLMs with Flash Attention. Falcon 180B is the newest version of Falcon LLM family. It is the biggest open source model with 180B parameter and trained on more data - 3.5T tokens with context length window upto 4K tokens.

QLoRA is an efficient finetuning technique that quantizes a pretrained language model to 4 bits and attaches small “Low-Rank Adapters” which are fine-tuned. This enables fine-tuning of models with up to 65 billion parameters on a single GPU; despite its efficiency, QLoRA matches the performance of full-precision fine-tuning and achieves state-of-the-art results on language tasks.

In our example, we are going to leverage Hugging Face Transformers, Accelerate, and PEFT.

In Detail you will learn how to:

    Setup Development EnvironmentLoad and prepare the datasetFine-Tune Falcon 180B with QLoRA on Amazon SageMaker

Access Falcon 180B

Before we can start training we have to make sure that we accepted the license tiiuae/falcon-180B to be able to use it. You can accept the license by clicking on the Agree and access repository button on the model page at:

1. Setup Development Environment

!pip install "transformers==4.31.0" "datasets[s3]==2.13.0" sagemaker --upgrade --quiet

To access any Falcon 180B asset we need to login into our hugging face account. We can do this by running the following command:

!huggingface-cli login --token YOUR_TOKEN

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

import sagemakerimport boto3sess = sagemaker.Session()# sagemaker session bucket -> used for uploading data, models and logs# sagemaker will automatically create this bucket if it not existssagemaker_session_bucket=Noneif sagemaker_session_bucket is None and sess is not None:    # set to default bucket if a bucket name is not given    sagemaker_session_bucket = sess.default_bucket() try:    role = sagemaker.get_execution_role()except ValueError:    iam = boto3.client('iam')    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn'] sess = sagemaker.Session(default_bucket=sagemaker_session_bucket) print(f"sagemaker role arn: {role}")print(f"sagemaker bucket: {sess.default_bucket()}")print(f"sagemaker session region: {sess.boto_region_name}") 

2. Load and prepare the dataset

we will use the dolly an open source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

{  "instruction": "What is world of warcraft",  "context": "",  "response": "World of warcraft is a massive online multi player role playing game. It was released in 2004 by bizarre entertainment"}

To load the samsum dataset, we use the load_dataset() method from the 🤗 Datasets library.

from datasets import load_datasetfrom random import randrange # Load dataset from the hubdataset = load_dataset("databricks/databricks-dolly-15k", split="train") print(f"dataset size: {len(dataset)}")print(dataset[randrange(len(dataset))])# dataset size: 15011 

To instruct tune our model we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function that takes a sample and returns a string with our format instruction.

def format_dolly(sample):    instruction = f"### Instruction\n{sample['instruction']}"    context = f"### Context\n{sample['context']}" if len(sample["context"]) > 0 else None    response = f"### Answer\n{sample['response']}"    # join all the parts together    prompt = "\n\n".join([i for i in [instruction, context, response] if i is not None])    return prompt 

lets test our formatting function on a random example.

from random import randrange print(format_dolly(dataset[randrange(len(dataset))]))

In addition, to formatting our samples we also want to pack multiple samples to one sequence to have a more efficient training.

from transformers import AutoTokenizer model_id = "tiiuae/falcon-180B" # sharded weightstokenizer = AutoTokenizer.from_pretrained(model_id)tokenizer.pad_token = tokenizer.eos_token

We define some helper functions to pack our samples into sequences of a given length and then tokenize them.

from random import randintfrom itertools import chainfrom functools import partial  # template dataset to add prompt to each sampledef template_dataset(sample):    sample["text"] = f"{format_dolly(sample)}{tokenizer.eos_token}"    return sample  # apply prompt template per sampledataset = dataset.map(template_dataset, remove_columns=list(dataset.features))# print random sampleprint(dataset[randint(0, len(dataset))]["text"]) # empty list to save remainder from batches to use in next batchremainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []} def chunk(sample, chunk_length=2048):    # define global remainder variable to save remainder from batches to use in next batch    global remainder    # Concatenate all texts and add remainder from previous batch    concatenated_examples = {k: list(chain(*sample[k])) for k in sample.keys()}    concatenated_examples = {k: remainder[k] + concatenated_examples[k] for k in concatenated_examples.keys()}    # get total number of tokens for batch    batch_total_length = len(concatenated_examples[list(sample.keys())[0]])     # get max number of chunks for batch    if batch_total_length >= chunk_length:        batch_chunk_length = (batch_total_length // chunk_length) * chunk_length     # Split by chunks of max_len.    result = {        k: [t[i : i + chunk_length] for i in range(0, batch_chunk_length, chunk_length)]        for k, t in concatenated_examples.items()    }    # add remainder to global variable for next batch    remainder = {k: concatenated_examples[k][batch_chunk_length:] for k in concatenated_examples.keys()}    # prepare labels    result["labels"] = result["input_ids"].copy()    return result  # tokenize and chunk datasetlm_dataset = dataset.map(    lambda sample: tokenizer(sample["text"]), batched=True, remove_columns=list(dataset.features)).map(    partial(chunk, chunk_length=2048),    batched=True,) # Print total number of samplesprint(f"Total number of samples: {len(lm_dataset)}")

After we processed the datasets we are going to use the new FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket(), adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.

# save train_dataset to s3training_input_path = f's3://{sess.default_bucket()}/processed/falcon/dolly/train'lm_dataset.save_to_disk(training_input_path) print("uploaded data to:")print(f"training dataset to: {training_input_path}")

3. Fine-Tune Falcon 180B with QLoRA on Amazon SageMaker

We are going to use the recently introduced method in the paper "QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:

    Quantize the pretrained model to 4 bits and freezing it.Attach small, trainable adapter layers. (LoRA)Finetune only the adapter layers, while using the frozen quantized model for context.

We prepared a run_clm.py, which implements QLora using PEFT and Flash Attention 2 for efficient training.The script also merges the LoRA weights into the model weights after training. That way you can use the model as a normal model without any additional code.

Make sure the you copy the whole scripts folder, which includes the requirements.txt to install additional packages needed for QLoRA and Flash Attention.

Harwarde requirements

We only run experiments on p4d.24xlarge so far, but based on heuristics it should be possible to run on a g5.48xlarge as well, but it will be slower.

import timefrom sagemaker.huggingface import HuggingFacefrom huggingface_hub import HfFolder # define Training Job Namejob_name = f'huggingface-qlora-{time.strftime("%Y-%m-%d-%H-%M-%S", time.localtime())}' # hyperparameters, which are passed into the training jobhyperparameters = {  'model_id': model_id,                             # pre-trained model  'dataset_path': '/opt/ml/input/data/training',    # path where sagemaker will save training dataset  'epochs': 1,                                      # number of training epochs  'per_device_train_batch_size': 4,                 # batch size for training  'lr': 2e-4,                                       # learning rate used during training  'hf_token': HfFolder.get_token(),                 # huggingface token to access Falcon 180b} # create the Estimatorhuggingface_estimator = HuggingFace(    entry_point          = 'run_clm.py',      # train script    source_dir           = 'scripts',         # directory which includes all the files needed for training    instance_type        = 'ml.p4d.24xlarge', # instances type used for the training job    instance_count       = 1,                 # the number of instances used for training    max_run              = 2*24*60*60,        # maximum runtime in seconds (days * hours * minutes * seconds)    base_job_name        = job_name,          # the name of the training job    role                 = role,              # Iam role used in training job to access AWS ressources, e.g. S3    volume_size          = 300,               # the size of the EBS volume in GB    transformers_version = '4.28',            # the transformers version used in the training job    pytorch_version      = '2.0',             # the pytorch_version version used in the training job    py_version           = 'py310',           # the python version used in the training job    hyperparameters      =  hyperparameters,  # the hyperparameters passed to the training job    environment          = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache" }, # set env variable to cache models in /tmp    disable_output_compression = True # not compress output to save training time and cost)

We can now start our training job, with the .fit() method passing our S3 path to the training script.

# define a data input dictonary with our uploaded s3 urisdata = {'training': training_input_path} # starting the train job with our uploaded datasets as inputhuggingface_estimator.fit(data, wait=True)

In our example for Falcon 180B, the SageMaker training job took 348 minutes or 5.8 hours for 1 epoch with merging the weights. The ml.p4d.24xlarge instance we used costs $37.688 per hour for on-demand usage. As a result, the total cost for training was ~$256.

For comparison the pretraining cost of Falcon 180B was ~7,000,000 GPU hours, which is 300,000 more than fine-tuning for 3 epochs.

Next Steps

You can deploy your fine-tuned Falcon 180B model to a SageMaker endpoint and use it for inference. Check out the Deploy Falcon 180B on Amazon SageMaker and Securely deploy LLMs inside VPCs with Hugging Face and Amazon SageMaker for more details.


Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

SageMaker Falcon 180B QLoRA LLM Fine-tuning Amazon Web Services Hugging Face AI Machine Learning Deep Learning
相关文章