philschmid RSS feed 09月30日 19:13
使用DeepSpeed-Inference优化GPT-J推理
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本教程介绍了如何使用Hugging Face Transformers和DeepSpeed-Inference优化GPT-2/GPT-J模型的推理性能。主要聚焦于单GPU推理场景,通过DeepSpeed-Inference引擎应用先进的优化技术,如张量并行和流水线并行,以及自定义CUDA内核优化,显著提升模型推理速度。教程详细步骤包括:搭建开发环境、加载基准GPT-J模型、使用DeepSpeed-Inference引擎优化模型、评估优化后的性能和速度。通过实例展示了将GPT-J-6B模型的生成延迟从8.9秒优化至6.5秒,吞吐量提升1.3倍的成果,证明了DeepSpeed-Inference在加速Transformer模型推理方面的有效性。

📚 本教程系统介绍了利用DeepSpeed-Inference框架优化Hugging Face Transformers中GPT-2/GPT-J模型推理性能的方法。通过详细步骤,展示了如何结合张量并行、流水线并行及自定义CUDA内核注入等技术,显著提升模型在单GPU环境下的推理速度和效率。

⚙️ 教程首先指导用户搭建包含PyTorch、Transformers和DeepSpeed等必要库的开发环境,并通过环境检测确保所有组件正确安装。随后加载EleutherAI训练的GPT-J-6B分片fp16权重模型作为基准,验证优化前的模型性能基准。

💻 核心部分详细阐述了使用DeepSpeed-Inference的`init_inference`方法对GPT-J模型进行优化的过程,包括指定模型、GPU数量、数据类型及内核注入等关键参数,并展示了优化后模型图结构中GPTJLayer被HFGPTJLayer替代,融入DeepSpeedTransformerInference模块的优化效果。

📈 性能评估环节通过对比优化前后模型的延迟指标,量化了优化带来的性能提升。实验数据显示,优化后的DeepSpeed模型将128 token生成任务的延迟从8.9秒缩短至6.5秒,单token处理时间从69毫秒降至50毫秒,速度提升1.3倍,验证了优化策略的有效性。

🔧 教程强调DeepSpeed-Inference并非万能方案,需确认模型兼容性,并指出优化过程虽简单(仅需一行代码初始化),但需根据具体模型、任务和数据进行适配调整,为实际应用提供了参考依据。

In this session, you will learn how to optimize GPT-2/GPT-J for Inerence using Hugging Face Transformers and DeepSpeed-Inference. The session will show you how to apply state-of-the-art optimization techniques using DeepSpeed-Inference.This session will focus on single GPU inference for GPT-2, GPT-NEO and GPT-J like modelsBy the end of this session, you will know how to optimize your Hugging Face Transformers models (GPT-2, GPT-J) using DeepSpeed-Inference. We are going to optimize GPT-j 6B for text-generation.

You will learn how to:

    Setup Development EnvironmentLoad vanilla GPT-J model and set baselineOptimize GPT-J for GPU using DeepSpeeds InferenceEngineEvaluate the performance and speed

Let's get started! 🚀

This tutorial was created and run on a g4dn.2xlarge AWS EC2 Instance including an NVIDIA T4.


Quick Intro: What is DeepSpeed-Inference

DeepSpeed-Inference is an extension of the DeepSpeed framework focused on inference workloads. DeepSpeed Inference combines model parallelism technology such as tensor, pipeline-parallelism, with custom optimized cuda kernels.DeepSpeed provides a seamless inference mode for compatible transformer based models trained using DeepSpeed, Megatron, and HuggingFace. For a list of compatible models please see here.As mentioned DeepSpeed-Inference integrates model-parallelism techniques allowing you to run multi-GPU inference for LLM, like BLOOM with 176 billion parameters.If you want to learn more about DeepSpeed inference:

1. Setup Development Environment

Our first step is to install Deepspeed, along with PyTorch, Transfromers and some other libraries. Running the following cell will install all the required packages.

Note: You need a machine with a GPU and a compatible CUDA installed. You can check this by running nvidia-smi in your terminal. If your setup is correct, you should get statistics about your GPU.

!pip install torch==1.11.0 torchvision==0.12.0 --extra-index-url https://download.pytorch.org/whl/cu113 --upgrade -q# !pip install deepspeed==0.7.2 --upgrade -q!pip install git+https://github.com/microsoft/DeepSpeed.git@ds-inference/support-large-token-length --upgrade!pip install transformers[sentencepiece]==4.21.2 accelerate --upgrade -q 

Before we start. Let's make sure all packages are installed correctly.

import reimport torch # check deepspeed installationreport = !python3 -m deepspeed.env_reportr = re.compile('.*ninja.*OKAY.*')assert any(r.match(line) for line in report) == True, "DeepSpeed Inference not correct installed" # check cuda and torch versiontorch_version, cuda_version = torch.__version__.split("+")torch_version = ".".join(torch_version.split(".")[:2])cuda_version = f"{cuda_version[2:4]}.{cuda_version[4:]}"r = re.compile(f'.*torch.*{torch_version}.*')assert any(r.match(line) for line in report) == True, "Wrong Torch version"r = re.compile(f'.*cuda.*{cuda_version}.*')assert any(r.match(line) for line in report) == True, "Wrong Cuda version" 

2. Load vanilla GPT-J model and set baseline

After we set up our environment, we create a baseline for our model. We use the EleutherAI/gpt-j-6B, a GPT-J 6B was trained on the Pile, a large-scale curated dataset created by EleutherAI. This model was trained for 402 billion tokens over 383,500 steps on TPU v3-256 pod. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.

To create our baseline, we load the model with transformers and run inference.

Note: We created a separate repository containing sharded fp16 weights to make it easier to load the models on smaller CPUs by using the device_map feature to automatically place sharded checkpoints on GPU. Learn more here

import torchfrom transformers import AutoTokenizer, AutoModelForCausalLM, pipeline # Model Repository on huggingface.comodel_id = "philschmid/gpt-j-6B-fp16-sharded" # Load Model and Tokenizertokenizer = AutoTokenizer.from_pretrained(model_id)# we use device_map auto to automatically place all shards on the GPU to save CPU memorymodel = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")print(f"model is loaded on device {model.device.type}")# model is loaded on device cuda 

Lets run some inference.

payload = "Hello my name is Philipp. I am getting in touch with you because i didn't get a response from you. What do I need to do to get my new card which I have requested 2 weeks ago? Please help me and answer this email in the next 7 days. Best regards and have a nice weekend but it" input_ids = tokenizer(payload,return_tensors="pt").input_ids.to(model.device)print(f"input payload: \n \n{payload}")logits = model.generate(input_ids, do_sample=True, num_beams=1, min_length=128, max_new_tokens=128) print(f"prediction: \n \n {tokenizer.decode(logits[0].tolist()[len(input_ids[0]):])}")#    input payload:#    Hello my name is Philipp. I am getting in touch with you because i didn't get a response from you. What do I need to do to get my new card which I have requested 2 weeks ago? Please help me and answer this email in the next 7 days. Best regards and have a nice weekend but it#    prediction:#     's Friday evening for the British and you can feel that coming in on top of a Friday, please try to spend a quiet time tonight. Thankyou, Philipp

Create a latency baseline we use the measure_latency function, which implements a simple python loop to run inference and calculate the avg, mean & p95 latency for our model.

from time import perf_counterimport numpy as npimport transformers# hide generation warningstransformers.logging.set_verbosity_error() def measure_latency(model, tokenizer, payload, generation_args={},device=model.device):    input_ids = tokenizer(payload, return_tensors="pt").input_ids.to(device)    latencies = []    # warm up    for _ in range(2):        _ =  model.generate(input_ids, **generation_args)    # Timed run    for _ in range(10):        start_time = perf_counter()        _ = model.generate(input_ids, **generation_args)        latency = perf_counter() - start_time        latencies.append(latency)    # Compute run statistics    time_avg_ms = 1000 * np.mean(latencies)    time_std_ms = 1000 * np.std(latencies)    time_p95_ms = 1000 * np.percentile(latencies,95)    return f"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f};", time_p95_ms

We are going to use greedy search as decoding strategy and will generate 128 new tokens with 128 tokens as input.

payload="Hello my name is Philipp. I am getting in touch with you because i didn't get a response from you. What do I need to do to get my new card which I have requested 2 weeks ago? Please help me and answer this email in the next 7 days. Best regards and have a nice weekend but it"*2print(f'Payload sequence length is: {len(tokenizer(payload)["input_ids"])}') # generation argumentsgeneration_args = dict(  do_sample=False,  num_beams=1,  min_length=128,  max_new_tokens=128)vanilla_results = measure_latency(model,tokenizer,payload,generation_args) print(f"Vanilla model: {vanilla_results[0]}")#  Payload sequence length is: 128#  Vanilla model: P95 latency (ms) - 8985.898722249989; Average latency (ms) - 8955.07 +\- 24.34;

Our model achieves latency of 8.9s for 128 tokens or 69ms/token.

3. Optimize GPT-J for GPU using DeepSpeeds InferenceEngine

The next and most important step is to optimize our model for GPU inference. This will be done using the DeepSpeed InferenceEngine. The InferenceEngine is initialized using the init_inference method. The init_inference method expects as parameters atleast:

    model: The model to optimize.mp_size: The number of GPUs to use.dtype: The data type to use.replace_with_kernel_inject: Whether inject custom kernels.

You can find more information about the init_inference method in the DeepSpeed documentation or thier inference blog.

Note: You might need to restart your kernel if you are running into a CUDA OOM error.

import torchfrom transformers import AutoTokenizer, AutoModelForCausalLMimport deepspeed # Model Repository on huggingface.comodel_id = "philschmid/gpt-j-6B-fp16-sharded" # load model and tokenizertokenizer = AutoTokenizer.from_pretrained(model_id)model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16)  # init deepspeed inference engineds_model = deepspeed.init_inference(    model=model,      # Transformers models    mp_size=1,        # Number of GPU    dtype=torch.float16, # dtype of the weights (fp16)    replace_method="auto", # Lets DS autmatically identify the layer to replace    replace_with_kernel_inject=True, # replace the model with the kernel injector)print(f"model is loaded on device {ds_model.module.device}") 

We can now inspect our model graph to see that the vanilla GPTJLayer has been replaced with an HFGPTJLayer, which includes the DeepSpeedTransformerInference module, a custom nn.Module that is optimized for inference by the DeepSpeed Team.

InferenceEngine(  (module): GPTJForCausalLM(    (transformer): GPTJModel(      (wte): Embedding(50400, 4096)      (drop): Dropout(p=0.0, inplace=False)      (h): ModuleList(        (0): DeepSpeedTransformerInference(          (attention): DeepSpeedSelfAttention()          (mlp): DeepSpeedMLP()        )
from deepspeed.ops.transformer.inference import DeepSpeedTransformerInference assert isinstance(ds_model.module.transformer.h[0], DeepSpeedTransformerInference) == True, "Model not sucessfully initalized"
# Test modelexample = "My name is Philipp and I"input_ids = tokenizer(example,return_tensors="pt").input_ids.to(model.device)logits = ds_model.generate(input_ids, do_sample=True, max_length=100)tokenizer.decode(logits[0].tolist())#     'My name is Philipp and I live in Freiburg in Germany and I have a project called Cenapen. After three months in development already it is finally finished – and it is a Linux based device / operating system on an ARM Cortex A9 processor on a Raspberry Pi.\n\nAt the moment it offers the possibility to store data locally, it can retrieve data from a local, networked or web based Sqlite database (I’m writing this tutorial while I’'

4. Evaluate the performance and speed

As the last step, we want to take a detailed look at the performance of our optimized model. Applying optimization techniques, like graph optimizations or mixed-precision, not only impact performance (latency) those also might have an impact on the accuracy of the model. So accelerating your model comes with a trade-off.

Let's test the performance (latency) of our optimized model. We will use the same generation args as for our vanilla model.

payload = (    "Hello my name is Philipp. I am getting in touch with you because i didn't get a response from you. What do I need to do to get my new card which I have requested 2 weeks ago? Please help me and answer this email in the next 7 days. Best regards and have a nice weekend but it"    * 2)print(f'Payload sequence length is: {len(tokenizer(payload)["input_ids"])}') # generation argumentsgeneration_args = dict(do_sample=False, num_beams=1, min_length=128, max_new_tokens=128)ds_results = measure_latency(ds_model, tokenizer, payload, generation_args, ds_model.module.device) print(f"DeepSpeed model: {ds_results[0]}")# Payload sequence length is: 128# DeepSpeed model: P95 latency (ms) - 6577.044982599967; Average latency (ms) - 6569.11 +\- 6.57;

Our Optimized DeepsPeed model achieves latency of 6.5s for 128 tokens or 50ms/token.

We managed to accelerate the GPT-J-6B model latency from 8.9s to 6.5 for generating 128 tokens. This results into an improvement from 69ms/token to 50ms/token or 1.38x.

Conclusion

We successfully optimized our GPT-J Transformers with DeepSpeed-inference and managed to decrease our model latency from 69ms/token to 50ms/token or 1.3x.Those are good results results thinking of that we only needed to add 1 additional line of code, but applying the optimization was as easy as adding one additional call to deepspeed.init_inference.But I have to say that this isn't a plug-and-play process you can transfer to any Transformers model, task, or dataset. Also, make sure to check if your model is compatible with DeepSpeed-Inference.


Thanks for reading! If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

DeepSpeed-Inference GPT-J Optimization Transformer Models GPU Acceleration Hugging Face Inference Optimization Latency Reduction PyTorch CUDA Kernels AI Performance
相关文章