philschmid RSS feed 09月30日 19:13
使用Hugging Face Optimum优化GPU模型推理速度
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本教程演示了如何利用Hugging Face Optimum工具优化GPU上的Hugging Face Transformers模型。通过将模型权重转换为FP16,并结合ONNX Runtime,可以显著降低DistilBERT模型在特定硬件上的推理延迟。具体而言,将一个用于问答任务的DistilBERT模型在SQuAD数据集上的推理延迟从7毫秒降低到3毫秒(序列长度128)。教程涵盖了环境设置、模型转换为ONNX、GPU优化以及性能和准确性评估,展示了在不牺牲模型精度的情况下,大幅提升模型吞吐量和降低延迟的效果。

🚀 **性能显著提升**: 通过使用Hugging Face Optimum和ONNX Runtime,可以将DistilBERT模型的GPU推理延迟从7毫秒降低到3毫秒,提高2.3倍。这对于需要快速响应的AI应用至关重要,能够在不影响模型精度的前提下,大幅提升处理速度。

🔧 **优化技术**: 教程详细介绍了两种关键的优化技术:一是将模型权重从FP32转换为FP16(适用于V100、T4或更新的GPU),二是应用ONNX Runtime的图优化(如常量折叠、冗余节点消除和算子融合)。这些技术协同作用,最大限度地提升模型在目标硬件上的运行效率。

✅ **精度保持**: 优化过程并未牺牲模型的准确性。通过在SQuAD数据集上进行评估,发现优化后的模型在F1分数上与原始FP32模型几乎保持一致(分别为86.85%和86.85%),证明了该优化方法在性能和精度之间取得了良好的平衡。

🛠️ **实践步骤**: 教程提供了从零开始的实践指南,包括安装必要的库(如optimum[onnxruntime-gpu]),将Hugging Face Transformers模型转换为ONNX格式,使用ORTOptimizer进行GPU优化配置,以及加载优化后的模型进行推理和性能评估,使读者能够亲手复现优化过程。

In this session, you will learn how to optimize Hugging Face Transformers models for GPUs using Optimum. The session will show you how to convert you weights to fp16 weights and optimize a DistilBERT model using Hugging Face Optimum and ONNX Runtime. Hugging Face Optimum is an extension of 🤗 Transformers, providing a set of performance optimization tools enabling maximum efficiency to train and run models on targeted hardware. We are going to optimize a DistilBERT model for Question Answering, which was fine-tuned on the SQuAD dataset to decrease the latency from 7ms to 3ms for a sequence lenght of 128.

Note: int8 quantization is currently only supported for CPUs. We plan to add support for in the near future using TensorRT.

By the end of this session, you will know how GPU optimization with Hugging Face Optimum can result in significant increase in model latency and througput while keeping 100% of the full-precision model.

You will learn how to:

    Setup Development EnvironmentConvert a Hugging Face Transformers model to ONNX for inferenceOptimize model for GPU using ORTOptimizerEvaluate the performance and speed

Let's get started! 🚀

This tutorial was created and run on an g4dn.xlarge AWS EC2 Instance including a NVIDIA T4.


1. Setup Development Environment

Our first step is to install Optimum, along with Evaluate and some other libraries. Running the following cell will install all the required packages for us including Transformers, PyTorch, and ONNX Runtime utilities:

Note: You need a machine with a GPU and CUDA installed. You can check this by running nvidia-smi in your terminal. If you have a correct environment you should statistics abour your GPU.

%pip install "optimum[onnxruntime-gpu]==1.3.0" --upgrade

Before we start. Lets make sure we have the CUDAExecutionProvider for ONNX Runtime available.

from onnxruntime import get_available_providers, get_deviceimport onnxruntime # check available providersassert 'CUDAExecutionProvider' in get_available_providers(), "ONNX Runtime GPU provider not found. Make sure onnxruntime-gpu is installed and onnxruntime is uninstalled."assert "GPU" == get_device() # asser version due to bug in 1.11.1assert onnxruntime.__version__ > "1.11.1", "you need a newer version of ONNX Runtime"

If you want to run inference on a CPU, you can install 🤗 Optimum with pip install optimum[onnxruntime].

2. Convert a Hugging Face Transformers model to ONNX for inference

Before we can start optimizing our model we need to convert our vanilla transformers model to the onnx format. To do this we will use the new ORTModelForQuestionAnswering class calling the from_pretrained() method with the from_transformers attribute. The model we are using is the distilbert-base-cased-distilled-squad a fine-tuned DistilBERT-based model on the SQuAD dataset achieving an F1 score of 87.1 and as the feature (task) question-answering.

from optimum.onnxruntime import ORTModelForQuestionAnsweringfrom transformers import AutoTokenizerfrom pathlib import Path  model_id="distilbert-base-cased-distilled-squad"onnx_path = Path("onnx") # load vanilla transformers and convert to onnxmodel = ORTModelForQuestionAnswering.from_pretrained(model_id, from_transformers=True)tokenizer = AutoTokenizer.from_pretrained(model_id) # save onnx checkpoint and tokenizermodel.save_pretrained(onnx_path)tokenizer.save_pretrained(onnx_path)

Before we jump into the optimization of the model lets first evaluate the current performance of the model. Therefore we can use pipeline() function from 🤗 Transformers. Meaning we will measure the end-to-end latency including the pre- and post-processing features.

context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."question="As what is Philipp working?"

After we prepared our payload we can create the inference pipeline.

from transformers import pipeline vanilla_qa = pipeline("question-answering", model=model, tokenizer=tokenizer, device=0) print(f"pipeline is loaded on device {vanilla_qa.model.device}")print(vanilla_qa(question=question,context=context))#    pipeline is loaded on device cuda:0#    {'score': 0.6575328707695007, 'start': 88, 'end': 102, 'answer': 'Technical Lead'}

If you are seeing a CreateExecutionProviderInstance error you are not having a compatible cuda version installed. Check the documentation, which cuda version you need.

If you want to learn more about exporting transformers model check-out Convert Transformers to ONNX with Hugging Face Optimum blog post

3. Optimize model for GPU using ORTOptimizer

The ORTOptimizer allows you to apply ONNX Runtime optimization on our Transformers models. In addition to the ORTOptimizer Optimum offers a OptimizationConfig a configuration class handling all the ONNX Runtime optimization parameters.There are several technique to optimize our model for GPUs including graph optimizations and converting our model weights from fp32 to fp16.

Graph optimizations are essentially graph-level transformations, ranging from small graph simplifications and node eliminations to more complex node fusions and layout optimizations.Examples of graph optimizations include:

    Constant folding: evaluate constant expressions at compile time instead of runtimeRedundant node elimination: remove redundant nodes without changing graph structureOperator fusion: merge one node (i.e. operator) into another so they can be executed together

If you want to learn more about graph optimization you take a look at the ONNX Runtime documentation.

To achieve best performance we will apply the following optimizations parameter in our OptimizationConfig:

    optimization_level=99: to enable all the optimizations. Note: Switching Hardware after optimization can lead to issues.optimize_for_gpu=True: to enable GPU optimizations.fp16=True: to convert model computation from fp32 to fp16. Note: Only for V100 and T4 or newer.
from optimum.onnxruntime import ORTOptimizerfrom optimum.onnxruntime.configuration import OptimizationConfig # create ORTOptimizer and define optimization configurationoptimizer = ORTOptimizer.from_pretrained(model_id, feature=model.pipeline_task)optimization_config = OptimizationConfig(optimization_level=99,                                         optimize_for_gpu=True,                                         fp16=True                                         ) # apply the optimization configuration to the modeloptimizer.export(    onnx_model_path=onnx_path / "model.onnx",    onnx_optimized_model_output_path=onnx_path / "model-optimized.onnx",    optimization_config=optimization_config,)

To test performance we can use the ORTModelForSequenceClassification class again and provide an additional file_name parameter to load our optimized model. (This also works for models available on the hub).

from transformers import pipeline # load optimized modelmodel = ORTModelForQuestionAnswering.from_pretrained(onnx_path, file_name="model-optimized.onnx") # create optimized pipelineoptimized_qa = pipeline("question-answering", model=model, tokenizer=tokenizer, device=0)print(optimized_qa(question=question,context=context))

4. Evaluate the performance and speed

As the last step, we want to take a detailed look at the performance and accuracy of our model. Applying optimization techniques, like graph optimizations or mixed-precision not only impact performance (latency) those also might have an impact on the accuracy of the model. So accelerating your model comes with a trade-off.

Let's evaluate our models. Our transformers model distilbert-base-cased-distilled-squad was fine-tuned on the SQuAD dataset.

from transformers import pipeline trfs_qa = pipeline("question-answering", model=model_id, device=0)
from datasets import load_metric,load_dataset metric = load_metric("squad_v2")metric = load_metric("squad")eval_dataset = load_dataset("squad")["validation"] # creating a subset for faster evaluation# COMMENT IN to run evaluation on a subset of the dataset# eval_dataset = eval_dataset.select(range(1000))

We can now leverage the map function of datasets to iterate over the validation set of squad_v2 and run prediction for each data point. Therefore we write a evaluate helper method which uses our pipelines and applies some transformation to work with the squad v2 metric.

def evaluate(example):  default = vanilla_qa(question=example["question"], context=example["context"])  optimized = optimized_qa(question=example["question"], context=example["context"])  return {      'reference': {'id': example['id'], 'answers': example['answers']},      'default': {'id': example['id'],'prediction_text': default['answer']},      'optimized': {'id': example['id'],'prediction_text': optimized['answer']},      } result = eval_dataset.map(evaluate) default_acc = metric.compute(predictions=result["default"], references=result["reference"])optimized = metric.compute(predictions=result["optimized"], references=result["reference"]) print(f"vanilla model: f1={default_acc['f1']}%")print(f"optimized model: f1={optimized['f1']}%")print(f"The optimized model achieves {round(optimized['f1']/default_acc['f1'],2)*100:.2f}% accuracy of the fp32 model") # vanilla model: f1=86.84859514665654%# optimized model: f1=86.8536859246896%# The optimized model achieves 100.00% accuracy of the fp32 model 

Okay, now let's test the performance (latency) of our optimized model. We are going to use a payload with a sequence length of 128 for the benchmark. To keep it simple, we are going to use a python loop and calculate the avg,mean & p95 latency for our vanilla model and for the optimized model.

from time import perf_counterimport numpy as np context="Hello, my name is Philipp and I live in Nuremberg, Germany. Currently I am working as a Technical Lead at Hugging Face to democratize artificial intelligence through open source and open science. In the past I designed and implemented cloud-native machine learning architectures for fin-tech and insurance companies. I found my passion for cloud concepts and machine learning 5 years ago. Since then I never stopped learning. Currently, I am focusing myself in the area NLP and how to leverage models like BERT, Roberta, T5, ViT, and GPT2 to generate business value."question="As what is Philipp working?" def measure_latency(pipe):    latencies = []    # warm up    for _ in range(10):        _ = pipe(question=question,context=context)    # Timed run    for _ in range(300):        start_time = perf_counter()        _ =  pipe(question=question,context=context)        latency = perf_counter() - start_time        latencies.append(latency)    # Compute run statistics    time_avg_ms = 1000 * np.mean(latencies)    time_std_ms = 1000 * np.std(latencies)    time_p95_ms = 1000 * np.percentile(latencies,95)    return f"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +\- {time_std_ms:.2f};", time_p95_ms vanilla_model=measure_latency(vanilla_qa)optimized_model=measure_latency(optimized_qa) print(f"Vanilla model: {vanilla_model[0]}")print(f"Optimized model: {optimized_model[0]}")print(f"Improvement through optimization: {round(vanilla_model[1]/optimized_model[1],2)}x") # Vanilla model: P95 latency (ms) - 7.784631400227273; Average latency (ms) - 6.87 +\- 1.20;# Optimized model: P95 latency (ms) - 3.392388850079442; Average latency (ms) - 3.32 +\- 0.03;# Improvement through optimization: 2.29x

We managed to accelerate our model latency from 7.8ms to 3.4ms or 2.3x while keeping 100.00% of the achttps://www.philschmid.de/static/blog/optimizing-transformers-with-optimum-gpu/performance.pngtimum-gpu/performance.png" alt="performance">

Conclusion

We successfully optimized our vanilla Transformers model with Hugging Face Optimum and managed to accelerate our model latency from 7.8ms to 3.4ms or 2.3x while keeping 100.00% of the accuracy.

But I have to say that this isn't a plug and play process you can transfer to any Transformers model, task or dataset.


Thanks for reading. If you have any questions, feel free to contact me, through Github, or on the forum. You can also connect with me on Twitter or LinkedIn.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Hugging Face Optimum GPU优化 ONNX Runtime 模型推理 Transformers DistilBERT 性能优化 FP16 Question Answering SQuAD
相关文章