philschmid RSS feed 09月30日 19:12
指令微调Llama 2教程
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文详细介绍了如何使用Meta AI的Llama 2模型进行指令微调,重点在于创建指令数据集,以使模型能够根据输入生成指令。通过定义用例、创建提示模板、构建指令数据集以及使用trl和SFTTrainer进行微调,本文指导读者完成整个过程。此外,还介绍了如何测试模型和运行推理,以及如何使用Flash Attention加速训练。整个过程基于databricks/databricks-dolly-15k数据集,并提供了详细的代码示例和结果分析。

📌本文详细介绍了如何使用Meta AI的Llama 2模型进行指令微调,重点在于创建指令数据集,以使模型能够根据输入生成指令。通过定义用例、创建提示模板、构建指令数据集以及使用trl和SFTTrainer进行微调,本文指导读者完成整个过程。此外,还介绍了如何测试模型和运行推理,以及如何使用Flash Attention加速训练。整个过程基于databricks/databricks-dolly-15k数据集,并提供了详细的代码示例和结果分析。

🔍在定义用例阶段,本文明确了指令的定义和作用,即指导大型语言模型(如Llama、GPT-4或Claude)生成响应,从而让人类能够引导对话并使模型输出更自然、有用且与用户目标一致。本文以多种指令类型(如头脑风暴、分类、封闭式问答、生成、信息提取、开放式问答和摘要)为例,展示了指令的具体应用场景。

📊在创建指令数据集阶段,本文介绍了多种创建方法,包括使用现有数据集(如FLAN)、使用现有大型语言模型生成合成数据(如Alpaca)以及使用人类创建指令数据集(如Dolly)。本文选择使用Dolly数据集,并提供了详细的代码示例,展示了如何从该数据集中加载样本并格式化为指令模板。

🚀在指令微调阶段,本文介绍了QLoRA技术,该技术通过将预训练模型量化为4位并冻结,然后附加小型可训练的适配器层,从而在不牺牲性能的情况下减少大型语言模型在微调过程中的内存占用。本文还介绍了Flash Attention技术,该技术通过重新排序注意力计算并利用经典技术(如tiling和recomputation)来显著加速训练并减少内存使用。

🔧在测试模型和运行推理阶段,本文展示了如何使用peft和transformers加载LoRA适配器到模型中,并提供了详细的代码示例,展示了如何生成指令并验证模型的效果。此外,本文还介绍了如何将LoRA适配器权重合并到基础模型中,以加速模型部署。

This blog post is an extended guide on instruction-tuning Llama 2 from Meta AI. The idea of the blog post is to focus on creating the instruction dataset, which we can then use to fine-tune the base model of Llama 2 to follow our instructions.

The goal is to create a model which can create instructions based on input. The idea behind this is that this can then be used for others to create instruction data from inputs. That's especially helpful if you want to personalize models for, e.g., tweeting, email writing, etc, which means that you would be able to generate an instruction dataset from your emails to then train a model to mimic your email writing.

Okay, so can we get started on this? In the blog, we are going to:

    Define the use case and create a prompt template for instructionsCreate an instruction datasetInstruction-tune Llama 2 using trl and the SFTTrainerTest the Model and run Inference

Note: This tutorial was created and run on a g5.2xlarge AWS EC2 Instance, including an NVIDIA A10G GPU.

1. Define the use case and create a prompt template for instructions

Before we describe our use case, we need to better understand what even is an instruction.

An instruction is a piece of text or prompt that is provided to an LLM, like Llama, GPT-4, or Claude, to guide it to generate a response. Instructions allow humans to steer the conversation and constrain the language model's output to be more natural, useful, and aligned with the user's goals. Crafting clear, well-formulated instructions is key to productive conversations.

Examples of instructions are listed below in the table.

CapabilityExample Instruction
BrainstormingProvide a diverse set of creative ideas for new flavors of ice cream.
ClassificationCategorize these movies as either comedy, drama, or horror based on the plot summary.
Closed QAAnswer the question 'What is the capital of France?' with a single word.
GenerationWrite a poem in the style of Robert Frost about nature and the changing seasons.
Information ExtractionExtract the names of the main characters from this short story.
Open QAWhy do leaves change color in autumn? Explain the scientific reasons.
SummarizationSummarize this article on recent advancements in renewable energy in 2-3 sentences.

As described in the beginning, we want to fine-tune a model to be able to generate instructions based on input. (output). We want to use this as a way to create synthetic datasets to personalize LLMs and Agents.

Converting the idea into a basic prompt template following the Alpaca format we get.

### Instruction:Use the Input below to create an instruction, which could have been used to generate the input using an LLM. ### Input:Dear [boss name], I'm writing to request next week, August 1st through August 4th,off as paid time off. I have some personal matters to attend to that week that requireme to be out of the office. I wanted to give you as much advancenotice as possible so you can plan accordingly while I am away. Please let me know if you need any additional information from meor have any concerns with me taking next week off. I appreciate youconsidering this request. Thank you, [Your name] ### Response:Write an email to my boss that I need next week 08/01 - 08/04 off.

2. Create an instruction dataset

After we defined our use case and prompt template, we need to create our instruction dataset. Creating a high-quality instruction dataset is key for a good-performing model. Research shows that “Less Is More for Alignment” shows that creating a high-quality, low-quantity (~1000 samples) dataset can achieve the same performance as less-quality and high-quantity datasets.

There are several ways to create an instruction dataset, including:

    Using an existing dataset and converting it into an instruction dataset, e.g., FLANUse existing LLMs to create synthetically instruction datasets, e.g., AlpacaUse Humans to create instructions datasets, e.g., Dolly.

Each of the methods has its own advantages and disadvantages and depends on the budget, time, and quality requirements. For example, using an existing dataset is the easiest but might not be tailored to your specific use case, while using humans might be the most accurate but can be time-consuming and expensive. It is also possible to combine several methods to create an instruction dataset, as shown in Orca: Progressive Learning from Complex Explanation Traces of GPT-4.

To keep it simple, we are going to use Dolly an open-source dataset of instruction-following records generated by thousands of Databricks employees in several of the behavioral categories outlined in the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA, and summarization.

Let's start coding, but first, let's install our dependencies.

!pip install "transformers==4.34.0" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.23.0" "bitsandbytes==0.41.1" "trl==0.4.7" "safetensors>=0.3.1" --upgrade

To load the databricks/databricks-dolly-15k dataset, we use the load_dataset() method from the 🤗 Datasets library.

from datasets import load_datasetfrom random import randrange # Load dataset from the hubdataset = load_dataset("databricks/databricks-dolly-15k", split="train") print(f"dataset size: {len(dataset)}")print(dataset[randrange(len(dataset))])# dataset size: 15011

To instruct tune our model, we need to convert our structured examples into a collection of tasks described via instructions. We define a formatting_function that takes a sample and returns a string with our format instruction.

def format_instruction(sample): return f"""### Instruction:Use the Input below to create an instruction, which could have been used to generate the input using an LLM. ### Input:{sample['response']} ### Response:{sample['instruction']}"""

Let's test our formatting function on a random example.

from random import randrange print(format_instruction(dataset[randrange(len(dataset))]))### Instruction:Use the Input below to create an instruction, which could have been used to generate the input using an LLM. ### Input:22nd July 1947 ### Response:When was the Indian National Flag adopted

3. Instruction-tune Llama 2 using trl and the SFTTrainer

We will use the recently introduced method in the paper "QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation" by Tim Dettmers et al. QLoRA is a new technique to reduce the memory footprint of large language models during finetuning, without sacrificing performance. The TL;DR; of how QLoRA works is:

    Quantize the pre-trained model to 4 bits and freeze it.Attach small, trainable adapter layers. (LoRA)Finetune only the adapter layers while using the frozen quantized model for context.

If you want to learn more about QLoRA and how it works, I recommend you to read the Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA blog post.

Flash Attention

Flash Attention is a an method that reorders the attention computation and leverages classical techniques (tiling, recomputation) to significantly speed it up and reduce memory usage from quadratic to linear in sequence length. It is based on the paper "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness".The TL;DR; accelerates training up to 3x. Learn more at FlashAttention. Flash Attention is currently only available for Ampere (A10, A40, A100, ...) & Hopper (H100, ...) GPUs. You can check if your GPU is supported and install it using the following command:

Note: If your machine has less than 96GB of RAM and lots of CPU cores, reduce the number of MAX_JOBS. On the g5.2xlarge we used 4.

python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"pip install ninja packagingMAX_JOBS=4 pip install flash-attn --no-build-isolation

Installing flash attention can take quite a bit of time (10-45 minutes).

The example supports the use of Flash Attention for all Llama checkpoints, but is not enabled by default. To use Flash Attention change the value of use_flash_attentin to True

import torchfrom transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig use_flash_attention = False # Hugging Face model idmodel_id = "NousResearch/Llama-2-7b-hf"  # non-gated# model_id = "meta-llama/Llama-2-7b-hf" # gated  # BitsAndBytesConfig int-4 configbnb_config = BitsAndBytesConfig(    load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16) # Load model and tokenizermodel = AutoModelForCausalLM.from_pretrained(    model_id,    quantization_config=bnb_config,    use_cache=False,    use_flash_attention_2=use_flash_attention,    device_map="auto",)model.config.pretraining_tp = 1  tokenizer = AutoTokenizer.from_pretrained(model_id)tokenizer.pad_token = tokenizer.eos_tokentokenizer.padding_side = "right"

The SFTTrainer supports a native integration with peft, which makes it super easy to efficiently instruction tune LLMs. We only need to create our LoRAConfig and provide it to the trainer.

from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model # LoRA config based on QLoRA paperpeft_config = LoraConfig(        lora_alpha=16,        lora_dropout=0.1,        r=64,        bias="none",        task_type="CAUSAL_LM",)  # prepare model for trainingmodel = prepare_model_for_kbit_training(model)model = get_peft_model(model, peft_config)

Before we can start our training we need to define the hyperparameters (TrainingArguments) we want to use.

from transformers import TrainingArguments args = TrainingArguments(    output_dir="llama-7-int4-dolly",    num_train_epochs=3,    per_device_train_batch_size=6 if use_flash_attention else 4,    gradient_accumulation_steps=2,    gradient_checkpointing=True,    optim="paged_adamw_32bit",    logging_steps=10,    save_strategy="epoch",    learning_rate=2e-4,    bf16=True,    tf32=True,    max_grad_norm=0.3,    warmup_ratio=0.03,    lr_scheduler_type="constant",    disable_tqdm=True # disable tqdm since with packing values are in correct)

We now have every building block we need to create our SFTTrainer to start then training our model.

from trl import SFTTrainer max_seq_length = 2048 # max sequence length for model and packing of the dataset trainer = SFTTrainer(    model=model,    train_dataset=dataset,    peft_config=peft_config,    max_seq_length=max_seq_length,    tokenizer=tokenizer,    packing=True,    formatting_func=format_instruction,    args=args,)

Start training our model by calling the train() method on our Trainer instance.

# traintrainer.train() # there will not be a progress bar since tqdm is disabled # save modeltrainer.save_model()

The training without Flash Attention enabled took 03:08:00 on a g5.2xlarge. The instance costs 1,212$/h which brings us to a total cost of 3.7$.The training with Flash Attention enabled took 02:08:00 on a g5.2xlarge. The instance costs 1,212$/h which brings us to a total cost of 2.6$.

The results using Flash Attention are mind blowing and impressive, 1.5x faster and 30% cheaper.

4. Test Model and run Inference

After the training is done we want to run and test our model. We will use peft and transformers to load our LoRA adapter into our model.

if use_flash_attention:    # unpatch flash attention    from utils.llama_patch import unplace_flash_attn_with_attn    unplace_flash_attn_with_attn() import torchfrom peft import AutoPeftModelForCausalLMfrom transformers import AutoTokenizer args.output_dir = "llama-7-int4-dolly" # load base LLM model and tokenizermodel = AutoPeftModelForCausalLM.from_pretrained(    args.output_dir,    low_cpu_mem_usage=True,    torch_dtype=torch.float16,    load_in_4bit=True,)tokenizer = AutoTokenizer.from_pretrained(args.output_dir)

Let’s load the dataset again with a random sample to try to generate an instruction.

from datasets import load_datasetfrom random import randrange  # Load dataset from the hub and get a sampledataset = load_dataset("databricks/databricks-dolly-15k", split="train")sample = dataset[randrange(len(dataset))] prompt = f"""### Instruction:Use the Input below to create an instruction, which could have been used to generate the input using an LLM. ### Input:{sample['response']} ### Response:""" input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()# with torch.inference_mode():outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9) print(f"Prompt:\n{sample['response']}\n")print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")print(f"Ground truth:\n{sample['instruction']}")

result

Prompt:Jack Dorsey, Noah Glass, Biz Stone, Evan WilliamsGenerated instruction:Extract the founders of Twitter from the passage. Display the results in a comma separated format.Ground truth:List the founders of Twitter from the above passage in a comma separated format.

Nice! our model works! If want to accelerate our model we can deploy it with Text Generation Inference. Therefore we would need to merge our adapter weights into the base model.

from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained(    args.output_dir,    low_cpu_mem_usage=True,) # Merge LoRA and base modelmerged_model = model.merge_and_unload() # Save the merged modelmerged_model.save_pretrained("merged_model",safe_serialization=True)tokenizer.save_pretrained("merged_model") # push merged model to the hub# merged_model.push_to_hub("user/repo")# tokenizer.push_to_hub("user/repo")

Thanks for reading! If you have any questions, feel free to contact me on Twitter or LinkedIn.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

指令微调 LLama 2 Meta AI 指令数据集 trl SFTTrainer QLoRA Flash Attention
相关文章