Hugging Face Optimum 助力 Transformer 模型优化

MarkTechPost@AI 09月24日 07:51

本教程详细介绍了如何使用 Hugging Face Optimum 工具优化 Transformer 模型，以在保持准确性的同时提升推理速度。我们通过在 SST-2 数据集上设置 DistilBERT，并对比了纯 PyTorch、torch.compile、ONNX Runtime 和量化 ONNX 等不同执行引擎。整个过程在 Google Colab 环境中完成，涵盖了模型导出、优化、量化和基准测试等关键步骤，让用户亲身体验模型性能提升的全过程。

🚀 **环境搭建与准备**：教程首先指导用户安装必要的库，并配置 Hugging Face Optimum 和 ONNX Runtime 环境。这包括设置模型 ID、输出目录、设备（CPU/GPU）、批处理大小和迭代次数等关键参数，为后续的模型优化和性能评估奠定基础。

📊 **模型性能基准测试**：通过定义 `run_eval` 和 `bench` 函数，教程实现了对不同模型执行引擎的准确率和推理速度进行公平比较。首先对原始 PyTorch 模型进行基准测试，然后尝试使用 `torch.compile` 进行即时图优化，并记录其性能指标。

⚡ **ONNX Runtime 与量化优化**：随后，教程将模型导出为 ONNX 格式，并使用 ONNX Runtime 进行推理，对比其速度和准确性。进一步地，利用 Optimum 的 `ORTQuantizer` 对模型进行动态量化，并再次进行基准测试，以量化带来的性能提升和准确性保持情况。

💡 **多引擎对比与实际应用**：最终，教程整合了所有引擎的性能数据，通过 Pandas DataFrame 展示了 PyTorch、torch.compile、ONNX Runtime 和量化 ONNX 在推理速度和准确率上的对比结果。同时，还提供了实际应用场景的示例预测，帮助用户直观理解不同优化策略的效果。

In this tutorial, we walk through how we use Hugging Face Optimum to optimize Transformer models and make them faster while maintaining accuracy. We begin by setting up DistilBERT on the SST-2 dataset, and then we compare different execution engines, including plain PyTorch and torch.compile, ONNX Runtime, and quantized ONNX. By doing this step by step, we get hands-on experience with model export, optimization, quantization, and benchmarking, all inside a Google Colab environment. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

!pip -q install "transformers>=4.49" "optimum[onnxruntime]>=1.20.0" "datasets>=2.20" "evaluate>=0.4" acceleratefrom pathlib import Pathimport os, time, numpy as np, torchfrom datasets import load_datasetimport evaluatefrom transformers import AutoTokenizer, AutoModelForSequenceClassification, pipelinefrom optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizerfrom optimum.onnxruntime.configuration import QuantizationConfigos.environ.setdefault("OMP_NUM_THREADS", "1")os.environ.setdefault("MKL_NUM_THREADS", "1")MODEL_ID = "distilbert-base-uncased-finetuned-sst-2-english"ORT_DIR  = Path("onnx-distilbert")Q_DIR    = Path("onnx-distilbert-quant")DEVICE   = "cuda" if torch.cuda.is_available() else "cpu"BATCH    = 16MAXLEN   = 128N_WARM   = 3N_ITERS  = 8print(f"Device: {DEVICE} | torch={torch.__version__}")

We begin by installing the required libraries and setting up our environment for Hugging Face Optimum with ONNX Runtime. We configure paths, batch size, and iteration settings, and we confirm whether we run on CPU or GPU. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

ds = load_dataset("glue", "sst2", split="validation[:20%]")texts, labels = ds["sentence"], ds["label"]metric = evaluate.load("accuracy")tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)def make_batches(texts, max_len=MAXLEN, batch=BATCH):   for i in range(0, len(texts), batch):       yield tokenizer(texts[i:i+batch], padding=True, truncation=True,                       max_length=max_len, return_tensors="pt")def run_eval(predict_fn, texts, labels):   preds = []   for toks in make_batches(texts):       preds.extend(predict_fn(toks))   return metric.compute(predictions=preds, references=labels)["accuracy"]def bench(predict_fn, texts, n_warm=N_WARM, n_iters=N_ITERS):   for _ in range(n_warm):       for toks in make_batches(texts[:BATCH*2]):           predict_fn(toks)   times = []   for _ in range(n_iters):       t0 = time.time()       for toks in make_batches(texts):           predict_fn(toks)       times.append((time.time() - t0) * 1000)   return float(np.mean(times)), float(np.std(times))

We load an SST-2 validation slice and prepare tokenization, an accuracy metric, and batching. We define run_eval to compute accuracy from any predictor and bench to warm up and time end-to-end inference. With these helpers, we fairly compare different engines using identical data and batching. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

torch_model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE).eval()@torch.no_grad()def pt_predict(toks):   toks = {k: v.to(DEVICE) for k, v in toks.items()}   logits = torch_model(**toks).logits   return logits.argmax(-1).detach().cpu().tolist()pt_ms, pt_sd = bench(pt_predict, texts)pt_acc = run_eval(pt_predict, texts, labels)print(f"[PyTorch eager]   {pt_ms:.1f}±{pt_sd:.1f} ms | acc={pt_acc:.4f}")compiled_model = torch_modelcompile_ok = Falsetry:   compiled_model = torch.compile(torch_model, mode="reduce-overhead", fullgraph=False)   compile_ok = Trueexcept Exception as e:   print("torch.compile unavailable or failed -> skipping:", repr(e))@torch.no_grad()def ptc_predict(toks):   toks = {k: v.to(DEVICE) for k, v in toks.items()}   logits = compiled_model(**toks).logits   return logits.argmax(-1).detach().cpu().tolist()if compile_ok:   ptc_ms, ptc_sd = bench(ptc_predict, texts)   ptc_acc = run_eval(ptc_predict, texts, labels)   print(f"[torch.compile]   {ptc_ms:.1f}±{ptc_sd:.1f} ms | acc={ptc_acc:.4f}")

We load the baseline PyTorch classifier, define a pt_predict helper, and benchmark/score it on SST-2. We then attempt torch.compile for just-in-time graph optimizations and, if successful, run the same benchmarks to compare speed and accuracy under an identical setup. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

provider = "CUDAExecutionProvider" if DEVICE == "cuda" else "CPUExecutionProvider"ort_model = ORTModelForSequenceClassification.from_pretrained(   MODEL_ID, export=True, provider=provider, cache_dir=ORT_DIR)@torch.no_grad()def ort_predict(toks):   logits = ort_model(**{k: v.cpu() for k, v in toks.items()}).logits   return logits.argmax(-1).cpu().tolist()ort_ms, ort_sd = bench(ort_predict, texts)ort_acc = run_eval(ort_predict, texts, labels)print(f"[ONNX Runtime]    {ort_ms:.1f}±{ort_sd:.1f} ms | acc={ort_acc:.4f}")Q_DIR.mkdir(parents=True, exist_ok=True)quantizer = ORTQuantizer.from_pretrained(ORT_DIR)qconfig = QuantizationConfig(approach="dynamic", per_channel=False, reduce_range=True)quantizer.quantize(model_input=ORT_DIR, quantization_config=qconfig, save_dir=Q_DIR)ort_quant = ORTModelForSequenceClassification.from_pretrained(Q_DIR, provider=provider)@torch.no_grad()def ortq_predict(toks):   logits = ort_quant(**{k: v.cpu() for k, v in toks.items()}).logits   return logits.argmax(-1).cpu().tolist()oq_ms, oq_sd = bench(ortq_predict, texts)oq_acc = run_eval(ortq_predict, texts, labels)print(f"[ORT Quantized]   {oq_ms:.1f}±{oq_sd:.1f} ms | acc={oq_acc:.4f}")

We export the model to ONNX, run it with ONNX Runtime, then apply dynamic quantization with Optimum’s ORTQuantizer and benchmark both to see how latency improves while accuracy stays comparable. Check out the FULL CODES here.

Copy CodeCopiedUse a different Browser

pt_pipe  = pipeline("sentiment-analysis", model=torch_model, tokenizer=tokenizer,                   device=0 if DEVICE=="cuda" else -1)ort_pipe = pipeline("sentiment-analysis", model=ort_model, tokenizer=tokenizer, device=-1)samples = [   "What a fantastic movie—performed brilliantly!",   "This was a complete waste of time.",   "I’m not sure how I feel about this one."]print("\nSample predictions (PT | ORT):")for s in samples:   a = pt_pipe(s)[0]["label"]   b = ort_pipe(s)[0]["label"]   print(f"- {s}\n  PT={a} | ORT={b}")import pandas as pdrows = [["PyTorch eager", pt_ms, pt_sd, pt_acc],       ["ONNX Runtime",  ort_ms, ort_sd, ort_acc],       ["ORT Quantized", oq_ms, oq_sd, oq_acc]]if compile_ok: rows.insert(1, ["torch.compile", ptc_ms, ptc_sd, ptc_acc])df = pd.DataFrame(rows, columns=["Engine", "Mean ms (↓)", "Std ms", "Accuracy"])display(df)print("""Notes:- BetterTransformer is deprecated on transformers>=4.49, hence omitted.- For larger gains on GPU, also try FlashAttention2 models or FP8 with TensorRT-LLM.- For CPU, tune threads: set OMP_NUM_THREADS/MKL_NUM_THREADS; try NUMA pinning.- For static (calibrated) quantization, use QuantizationConfig(approach='static') with a calibration set.""")

We sanity-check predictions with quick sentiment pipelines and print PyTorch vs ONNX labels side by side. We then assemble a summary table to compare latency and accuracy across engines, inserting torch.compile results when available. We conclude with practical notes, allowing us to extend the workflow to other backends and quantization modes.

In conclusion, we can clearly see how Optimum helps us bridge the gap between standard PyTorch models and production-ready, optimized deployments. We achieve speedups with ONNX Runtime and quantization while retaining accuracy, and we also explore how torch.compile provides gains directly within PyTorch. This workflow demonstrates a practical approach to balancing performance and efficiency for Transformer models, providing a foundation that can be further extended with advanced backends, such as OpenVINO or TensorRT.

Check out the FULL CODES here. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

For content partnership/promotions on marktechpost.com, please TALK to us

The post Coding Implementation to End-to-End Transformer Model Optimization with Hugging Face Optimum, ONNX Runtime, and Quantization appeared first on MarkTechPost.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签