The TensorFlow Blog 09月12日
XNNPack支持动态范围量化,提升TensorFlow Lite CPU推理性能
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

XNNPack的Fully Connected和Convolution 2D算子现在支持动态范围量化,显著提升了TensorFlow Lite后端的CPU推理性能。与单精度基准相比,性能提升了四倍。这意味着更多AI功能可以部署到老旧和低功耗设备上。动态范围量化在保持较高精度的同时,通过在推理过程中动态计算量化参数,优化了计算效率。

🔹 动态范围量化在Fully Connected和Convolution 2D算子中实现,使TensorFlow Lite后端的CPU推理性能提升了四倍,相比单精度基准。

🔹 与全整数量化不同,动态范围量化在模型转换时仅将Fully Connected和Convolution算子的权重量化为8位整数,其他张量保持float32,推理时动态计算激活张量的量化参数。

🔹 动态范围量化结合了全量化的大部分性能优势和高精度,输出为32位浮点数,而全量化输出为8位整数。

🔹 支持所有XNNPack架构(ARM, ARM64, x86 SSE/AVX/AVX512和WebAssembly),包括最新的ArmV9处理器。

🔹 使用简单,无需代表性数据集,不受不支持的算子影响,更适合非专业用户。

🔹 可与半精度推理(fp16)结合,在支持硬件fp16的设备上获得最佳性能,例如大多数现代手机。

🔹 基准测试显示,动态范围量化在多个模型上与全整数量化性能相当,甚至在某些情况下表现更优,例如Stable Diffusion的扩散模型速度提升高达6.2倍。

🔹 混合精度推理下,模型精度基本保持不变,生成的图像质量不受fp16推理影响。

🔹 XNNPack的动态范围量化技术已应用于Gemini、Google Meet和Chrome OS音频降噪,并将推广到更多产品。

Posted by Alan Kelly, Software Engineer

We are excited to announce that XNNPack’s Fully Connected and Convolution 2D operators now support dynamic range quantization. XNNPack is TensorFlow Lite’s CPU backend and CPUs deliver the widest reach for ML inference and remain the default target for TensorFlow Lite. Consequently, improving CPU inference performance is a top priority. We quadrupled inference performance in TensorFlow Lite’s XNNPack backend compared to the single precision baseline by adding support for dynamic range quantization to the Fully Connected and Convolution operators. This means that more AI powered features may be deployed to older and lower tier devices.

Previously, XNNPack offered users the choice between either full integer quantization, where the weights and activations are stored as signed 8-bit integers, or half-precision (fp16) or single-precision (fp32) floating-point inference. In this article we demonstrate the benefits of dynamic range quantization.

Dynamic Range Quantization

Dynamically quantized models are similar to fully-quantized models in that the weights for the Fully Connected and Convolution operators are quantized to 8-bit integers during model conversion. All other tensors are not quantized, they remain as float32 tensors. During model inference, the floating-point layer activations are converted to 8-bit integers before being passed to the Fully Connected and Convolution operators. The quantization parameters (the zero point and scale) for each row of the activation tensor are calculated dynamically based on the observed range of activations. This maximizes the accuracy of the quantization process as the activations make full use of the 8 quantized bits. In fully-quantized models, these parameters are fixed during model conversion, based on the range of the activation values observed using a representative dataset. The second difference between full quantization and dynamic range quantization is that the output of the Fully Connected and Convolution operators is in 32-bit floating-point format, as opposed to 8-bit integer for fully-quantized operators. With dynamic range quantization, we get most of the performance gains of full quantization, yet with higher overall accuracy.

Traditionally the inference of such models was done using TensorFlow Lite’s native operators. Now dynamically quantized models can benefit from XNNPack’s highly-optimized per-architecture implementations of the Fully Connected and Convolution2D operators. These operators are optimized for all architectures supported by XNNPack (ARM, ARM64, x86 SSE/AVX/AVX512 and WebAssembly), including the latest ArmV9 processors such as the Pixel 8’s Tensor G3 CPU or the One Plus 11’s SnapDragon 8 Gen 2 CPU.

How can you use it?

Two steps are required to use dynamic range quantization. You must first convert your model from TensorFlow with support for dynamic range quantization. Existing models already converted using dynamic range quantization do not need to be reconverted. Dynamic range quantization can be enabled during model conversion by enabling the converter.optimizations = [tf.lite.Optimize.DEFAULT] converter flag. Unlike full integer quantization, no representative dataset is required and unsupported operators do not prevent conversion from succeeding. Dynamic range quantization is therefore far more accessible to non-expert users than full integer quantization.

From TensorFlow 2.17, dynamically quantized XNNPack inference will be enabled by default in prebuilt binaries. If you want to use it sooner, the nightly TensorFlow builds may be used.

Mixed Precision Inference

In our previous article we presented the impressive performance gains from using half precision inference. Half-precision and dynamic range quantization may now be combined within XNNPack to get the best possible on-device CPU inference performance on devices which have hardware fp16 support (most phones on sale today do). The Fully Connected and Convolution 2D operators can output fp16 data instead of fp32. The Pixel 3, released in 2018, was the first Pixel model with fp16 support. fp16 uses half as many bits to store a floating-point value compared to fp32, meaning that the relative accuracy of each value is reduced due to the significantly shorter mantissa (10 vs 23 bits). Not all models support fp16 inference, but if a model supports it, the computational cost of vectorized floating-point operators can be reduced by half as the CPU can process twice as much data per instruction. Dynamically quantized models with compute-intensive floating point operators, such as Batch Matrix Multiply and Softmax, can benefit from fp16 inference as well.

Performance Improvements

Below, we present benchmarks on four public models covering common computer vision tasks:

    EfficientNetV2 - image classification and feature extractionInception-v3 - image classificationDeeplab-v3 - semantic segmentationStable Diffusion - image generation (diffusion model)

Each model was converted three times where possible: full float, full 8 bit signed integer quantization and dynamic range quantization. Stable Diffusion’s diffusion model could not be converted using full integer quantization due to unsupported operators. The speed-up versus the original float32 model using TFLite’s kernels is shown below.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

XNNPack TensorFlow Lite 动态范围量化 CPU推理 AI性能优化 半精度推理 全整数量化
相关文章