This blog post walks you thorugh how to fine-tune a Llama 3 using PyTorch FSDP and Q-Lora with the help of Hugging Face TRL, Transformers, peft & datasets on Amazon SageMAker. In addition to FSDP we will use Flash Attention v2 implementation.
This blog is an extension and dedicated version to my Efficiently fine-tune Llama 3 with PyTorch FSDP and Q-Lora version, specifically tailored to run on Amazon SageMaker.
- Setup development environmentCreate and prepare the datasetFine-tune Llama 3 on Amazon SageMakerDeploy & Test fine-tuned Llama 3 on Amazon SageMaker
Note: This blog was created and validated on ml.p4d.24xlarge and ml.g5.48.xlarge instances. The configurations and code are optimized for ml.p4d.24xlarge with 8xA100 GPUs each with 40GB of Memory. We tried ml.g5.12xlarge but Amazon SageMaker reserves more memory than EC2. We plan to add support for trn1 in the coming weeks.
FSDP + Q-Lora Background
In a collaboration between Answer.AI, Tim Dettmers Q-Lora creator and Hugging Face, we are proud to announce to share the support of Q-Lora and PyTorch FSDP (Fully Sharded Data Parallel). FSDP and Q-Lora allows you now to fine-tune Llama 2 70b or Mixtral 8x7B on 2x consumer GPUs (24GB). If you want to learn more about the background of this collaboration take a look at You can now train a 70b language model at home. Hugging Face PEFT is were the magic happens for this happens, read more about it in the PEFT documentation.
- PyTorch FSDP is a data/model parallelism technique that shards model across GPUs, reducing memory requirements and enabling the training of larger models more efficiently.Q-LoRA is a fine-tuning method that leverages quantization and Low-Rank Adapters to efficiently reduced computational requirements and memory footprint.
This blog post walks you thorugh how to fine-tune open LLMs from Hugging Face using Amazon SageMaker. This blog is an extension and dedicated version to my How to Fine-Tune LLMs in 2024 with Hugging Face version, specifically tailored to run on Amazon SageMaker.
1. Setup Development Environment
Our first step is to install Hugging Face Libraries we need on the client to correctly prepare our dataset and start our training/evaluations jobs.
!pip install transformers "datasets[s3]==2.18.0" "sagemaker>=2.190.0" "huggingface_hub[cli]" --upgrade --quietNext we need to login into Hugging Face to access the Llama 3 70b model and store our trained model on Hugging Face. If you don't have an account yet and accepted the terms, you can create one here.
!huggingface-cli login --token YOUR_TOKENIf you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
import sagemakerimport boto3sess = sagemaker.Session()# sagemaker session bucket -> used for uploading data, models and logs# sagemaker will automatically create this bucket if it not existssagemaker_session_bucket=Noneif sagemaker_session_bucket is None and sess is not None: # set to default bucket if a bucket name is not given sagemaker_session_bucket = sess.default_bucket() try: role = sagemaker.get_execution_role()except ValueError: iam = boto3.client('iam') role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn'] sess = sagemaker.Session(default_bucket=sagemaker_session_bucket) print(f"sagemaker role arn: {role}")print(f"sagemaker bucket: {sess.default_bucket()}")print(f"sagemaker session region: {sess.boto_region_name}") 2. Create and prepare the dataset
After our environment is set up, we can start creating and preparing our dataset. A fine-tuning dataset should have a diverse set of demonstrations of the task you want to solve. If you want to learn more about how to create a dataset, take a look at the How to Fine-Tune LLMs in 2024 with Hugging Face.
We will use the HuggingFaceH4/no_robots dataset a high-quality dataset of 10,000 instructions and demonstrations created by skilled human annotators. This data can be used for supervised fine-tuning (SFT) to make language models follow instructions better. No Robots was modelled after the instruction dataset described in OpenAI's InstructGPT paper, and is comprised mostly of single-turn instructions.
{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}{"messages": [{"role": "system", "content": "You are..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}The no_robots dataset has 10,000 split into 9,500 training and 500 test examples. Some samples are not including a system message. We will load the dataset with the datasets library, add a missing system message and save them to separate json files.
from datasets import load_dataset # Convert dataset to OAI messagessystem_message = """You are Llama, an AI assistant created by Philipp to be helpful and honest. Your knowledge spans a wide range of topics, allowing you to engage in substantive conversations and provide analysis on complex subjects.""" def create_conversation(sample): if sample["messages"][0]["role"] == "system": return sample else: sample["messages"] = [{"role": "system", "content": system_message}] + sample["messages"] return sample # Load dataset from the hubdataset = load_dataset("HuggingFaceH4/no_robots") # Add system message to each conversationcolumns_to_remove = list(dataset["train"].features)columns_to_remove.remove("messages")dataset = dataset.map(create_conversation, remove_columns=columns_to_remove,batched=False) # Filter out conversations which are corrupted with wrong turns, keep which have even number of turns after adding system messagedataset["train"] = dataset["train"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)dataset["test"] = dataset["test"].filter(lambda x: len(x["messages"][1:]) % 2 == 0)After we processed the datasets we are going to use the FileSystem integration to upload our dataset to S3. We are using the sess.default_bucket(), adjust this if you want to store the dataset in a different S3 bucket. We will use the S3 path later in our training script.
# save train_dataset to s3 using our SageMaker sessioninput_path = f's3://{sess.default_bucket()}/datasets/llama3' # save datasets to s3dataset["train"].to_json(f"{input_path}/train/dataset.json", orient="records")train_dataset_s3_path = f"{input_path}/train/dataset.json"dataset["test"].to_json(f"{input_path}/test/dataset.json", orient="records")test_dataset_s3_path = f"{input_path}/test/dataset.json" print(f"Training data uploaded to:")print(train_dataset_s3_path)print(test_dataset_s3_path)print(f"https://s3.console.aws.amazon.com/s3/buckets/{sess.default_bucket()}/?region={sess.boto_region_name}&prefix={input_path.split('/', 3)[-1]}/")3. Fine-tune Llama 3 on Amazon SageMaker
We are now ready to fine-tune our model. We will use the SFTTrainer from trl to fine-tune our model. The SFTTrainer makes it straightfoward to supervise fine-tune open LLMs. The SFTTrainer is a subclass of the Trainer from the transformers. We prepared a script run_fsdp_qlora.py which will loads the dataset from disk, prepare the model, tokenizer and start the training. It usees the SFTTrainer from trl to fine-tune our model. The SFTTrainer makes it straightfoward to supervise fine-tune open LLMs supporting:
- Dataset formatting, including conversational and instruction format (✅ used)Training on completions only, ignoring prompts (❌ not used)Packing datasets for more efficient training (✅ used)PEFT (parameter-efficient fine-tuning) support including Q-LoRA (✅ used)Preparing the model and tokenizer for conversational fine-tuning (❌ not used, see below)
For configuration we use the new TrlParser, that allows us to provide hyperparameters in a yaml file. This yaml will be uploaded and provided to Amazon SageMaker similar to our datasets. Below is the config file for fine-tuning Llama 3 70B on 8x A100 GPUs or 4x24GB GPUs. We are saving the config file as fsdp_qlora_llama3_70b.yaml and upload it to S3.
For the chat template we use the Anthropic/Vicuna template, not the official one. Since we then would need to train and save the embedding layer as well leading to more memory requirements. If you wnat to use the official Llama 3 template comment in the LLAMA_3_CHAT_TEMPLATE in the run_fsdp_qlora.py script and make sure to add modules_to_save. The template used will look like this.
You are a helpful Assistant. Human: What is 2+2? Assistant: 2+2 equals 4.%%writefile llama_3_70b_fsdp_qlora.yaml# script parametersmodel_id: "meta-llama/Meta-Llama-3-70b"# Hugging Face model idmax_seq_len: 3072 # 2048 # max sequence length for model and packing of the dataset# sagemaker specific parameterstrain_dataset_path: "/opt/ml/input/data/train/" # path to where SageMaker saves train datasettest_dataset_path: "/opt/ml/input/data/test/" # path to where SageMaker saves test dataset# output_dir: "/opt/ml/model" # path to where SageMaker will upload the model output_dir: "/tmp/llama3" # path to where SageMaker will upload the model # training parametersreport_to: "tensorboard" # report metrics to tensorboardlearning_rate: 0.0002 # learning rate 2e-4lr_scheduler_type: "constant" # learning rate schedulernum_train_epochs: 2 # number of training epochsper_device_train_batch_size: 8 # batch size per device during trainingper_device_eval_batch_size: 1 # batch size for evaluationgradient_accumulation_steps: 2 # number of steps before performing a backward/update passoptim: adamw_torch # use torch adamw optimizerlogging_steps: 10 # log every 10 stepssave_strategy: epoch # save checkpoint every epochevaluation_strategy: epoch # evaluate every epochmax_grad_norm: 0.3 # max gradient normwarmup_ratio: 0.03 # warmup ratiobf16: true # use bfloat16 precisiontf32: true # use tf32 precisiongradient_checkpointing: true # use gradient checkpointing to save memory# FSDP parameters: https://huggingface.co/docs/transformers/main/en/fsdpfsdp: "full_shard auto_wrap offload" # remove offload if enough GPU memoryfsdp_config: backward_prefetch: "backward_pre" forward_prefetch: "false" use_orig_params: "false"Lets upload the config file to S3.
from sagemaker.s3 import S3Uploader # upload the model yaml file to s3model_yaml = "llama_3_70b_fsdp_qlora.yaml"train_config_s3_path = S3Uploader.upload(local_path=model_yaml, desired_s3_uri=f"{input_path}/config") print(f"Training config uploaded to:")print(train_config_s3_path)In order to create a sagemaker training job we need an HuggingFace Estimator. The Estimator handles end-to-end Amazon SageMaker training and deployment tasks. The Estimator manages the infrastructure use. Amazon SagMaker takes care of starting and managing all the required ec2 instances for us, provides the correct huggingface container, uploads the provided scripts and downloads the data from our S3 bucket into the container at /opt/ml/input/data. Then, it starts the training job by running.
Note: Make sure that you include the
requirements.txtin thesource_dirif you are using a custom training script. We recommend to just clone the whole repository.
To use torchrun to execute our scripts, we only have to define the distribution parameter in our Estimator and set it to {"torch_distributed": {"enabled": True}}. This tells SageMaker to launch our training job with.
torchrun --nnodes 2 --nproc_per_node 8 --master_addr algo-1 --master_port 7777 --node_rank 1 run_fsdp_qlora.py --config /opt/ml/input/data/config/config.yamlThe HuggingFace configuration below will start a training job on 1x p4d.24xlarge using 8x A100 GPUs. The amazing part about SageMaker is that you can easily scale up to 2x p4d.24xlarge by modifying the instance_count. SageMaker will take care of the rest for you.
from sagemaker.huggingface import HuggingFacefrom huggingface_hub import HfFolder # define Training Job Name job_name = f'llama3-70b-exp1' # create the Estimatorhuggingface_estimator = HuggingFace( entry_point = 'run_fsdp_qlora.py', # train script source_dir = '../scripts/fsdp', # directory which includes all the files needed for training instance_type = 'ml.p4d.24xlarge', # instances type used for the training job instance_count = 1, # the number of instances used for training max_run = 2*24*60*60, # maximum runtime in seconds (days * hours * minutes * seconds) base_job_name = job_name, # the name of the training job role = role, # Iam role used in training job to access AWS ressources, e.g. S3 volume_size = 500, # the size of the EBS volume in GB transformers_version = '4.36.0', # the transformers version used in the training job pytorch_version = '2.1.0', # the pytorch_version version used in the training job py_version = 'py310', # the python version used in the training job hyperparameters = { "config": "/opt/ml/input/data/config/llama_3_70b_fsdp_qlora.yaml" # path to TRL config which was uploaded to s3 }, disable_output_compression = True, # not compress output to save training time and cost distribution={"torch_distributed": {"enabled": True}}, # enables torchrun environment = { "HUGGINGFACE_HUB_CACHE": "/tmp/.cache", # set env variable to cache models in /tmp "HF_TOKEN": HfFolder.get_token(), # huggingface token to access gated models, e.g. llama 3 "ACCELERATE_USE_FSDP": "1", # enable FSDP "FSDP_CPU_RAM_EFFICIENT_LOADING": "1" # enable CPU RAM efficient loading }, )Note: When using QLoRA, we only train adapters and not the full model. The run_fsdp_qlora.py merges the base_modelwith the adapter at the end of the training to directly be able to deploy to Amazon SageMaker.
We can now start our training job, with the .fit() method passing our S3 path to the training script.
# define a data input dictonary with our uploaded s3 urisdata = { 'train': train_dataset_s3_path, 'test': test_dataset_s3_path, 'config': train_config_s3_path } # starting the train job with our uploaded datasets as inputhuggingface_estimator.fit(data, wait=True)In our example the training Llama 3 70B with Flash Attention for 2 epochs with a dataset of 10k samples takes 5052 seconds (~84minutes) on a ml.p4d.24xlarge or ~$50.
4. Deploy & Test fine-tuned Llama 3 on Amazon SageMaker
Evaluating LLMs is crucial for understanding their capabilities and limitations, yet it poses significant challenges due to their complex and opaque nature. There are multiple ways to evaluate a fine-tuned model. You could either use an additional Training job to evaluate the model as we demonstrated in Evaluate LLMs with Hugging Face Lighteval on Amazon SageMaker or you can deploy the model to an endpoint and interactively test the model. We are going to use the latter approach in this example. We will load our eval dataset and evaluate the model on those samples, using a simple loop and accuracy as our metric.
Note: Evaluating Generative AI models is not a trivial task since 1 input can have multiple correct outputs. If you want to learn more about evaluating generative models, check out Evaluate LLMs and RAG a practical example using Langchain and Hugging Face blog post.
We are going to use the Hugging Face LLM Inference DLC a purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI) solution for deploying and serving Large Language Models (LLMs).
from sagemaker.huggingface import get_huggingface_llm_image_uri # retrieve the llm image urillm_image = get_huggingface_llm_image_uri( "huggingface", session=sess, version="2.0.2",) # print ecr image uriprint(f"llm image uri: {llm_image}")We can now create a HuggingFaceModel using the container uri and the S3 path to our model. We also need to set our TGI configuration including the number of GPUs, max input tokens. You can find a full list of configuration options here.
from huggingface_hub import HfFolderfrom sagemaker.huggingface import HuggingFaceModel # sagemaker configinstance_type = "ml.p4d.24xlarge"health_check_timeout = 1200 # 20 minutes # Define Model and Endpoint configuration parameterconfig = { 'HF_MODEL_ID': "/opt/ml/model", # Path to the model in the container 'SM_NUM_GPUS': "8", # Number of GPU used per replica 'MAX_INPUT_LENGTH': "8000", # Max length of input text 'MAX_TOTAL_TOKENS': "8096", # Max length of the generation (including input text) 'MAX_BATCH_PREFILL_TOKENS': "16182", # Limits the number of tokens that can be processed in parallel during the generation 'MESSAGES_API_ENABLED': "true", # Enable the OpenAI Messages API} # create HuggingFaceModel with the image urillm_model = HuggingFaceModel( role=role, # path to s3 bucket with model, we are not using a compressed model # {'S3DataSource':{'S3Uri': "s3://...",'S3DataType': 'S3Prefix','CompressionType': 'None'}}, model_data=huggingface_estimator.model_data, image_uri=llm_image, env=config)After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method.
# Deploy model to an endpointllm = llm_model.deploy( initial_instance_count=1, instance_type=instance_type, container_startup_health_check_timeout=health_check_timeout, # 20 minutes to give SageMaker the time to download and merge model)SageMaker will now create our endpoint and deploy the model to it. This can takes a 15-20 minutes. Afterwards, we can test our model by sending some example inputs to the endpoint. We will use the predict method of the predictor to send the input to the model and get the output.
# Prompt to generatemessages=[ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "Tell me something about Amazon SageMaker?" } ] # Generation argumentsparameters = { "model": "meta-llama-3-fine-tuned", # placeholder, needed "top_p": 0.6, "temperature": 0.9, "max_tokens": 512, "stop": ["<|eot_id|>"],}chat = llm.predict({"messages" :messages, **parameters}) print(chat["choices"][0]["message"]["content"].strip())Note: If you want to learn more about streaming responses or benchmarking your endpoint checkout Deploy Llama 3 on Amazon SageMaker and Deploy Mixtral 8x7B on Amazon SageMaker.
To clean up, we can delete the model and endpoint.
llm.delete_model()llm.delete_endpoint()Thanks for reading! If you have any questions or feedback, please let me know on Twitter or LinkedIn.
