Nvidia Developer 09月03日
NVFP4技术革新AI大模型训练效率
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

近年来,AI工作负载呈指数级增长,大型语言模型(LLM)的部署和预训练/后训练过程中的令牌处理需求不断提升。随着组织扩展计算基础设施以训练和部署数十亿参数的基础模型,维持更高的令牌吞吐量已成为关键任务。NVIDIA推出NVFP4 4位格式,旨在预训练阶段实现效率与规模的突破。通过微块缩放、高精度块编码、张量分布重塑、量化保真度和随机舍入等技术,NVFP4在保持与FP8/BF16相当精度的同时,显著提升吞吐量,加速收敛周期,支持前所未有的前沿模型规模。目前,该技术正与亚马逊云科技、Cohere、谷歌云等领先组织合作研发,标志着AI大模型训练进入高效能新纪元。

🔬 NVFP4采用4位精度量化技术,通过微块缩放(将16位元素共享缩放因子)和高精度块编码(E4M3缩放因子替代二进制幂缩放),在大幅降低内存需求的同时,保持与FP8/BF16相当的训练精度,有效解决窄精度量化中的动态范围、梯度波动和数值稳定性问题。

🚀 NVFP4预训练配方结合随机哈达玛变换重塑张量分布、选择性2D块量化保持正向反向一致性、以及随机舍入减少舍入偏差等技术,确保在4位精度下实现稳定收敛,避免低精度训练常见的训练不稳定或发散问题,验证了万亿令牌规模模型训练的可行性。

💡 在12B Hybrid Mamba-Transformer模型(类似NVIDIA Nemotron Nano 2)上进行的实验表明,使用NVFP4预训练的模型在处理10T令牌海量数据时,验证损失曲线与FP8基线高度一致,下游任务表现与高精度模型相当,证明了该技术在保持精度的同时显著提升训练效率的潜力。

🌐 NVFP4推动AI工厂实现更快速、更可持续的扩展,通过提升算力利用率、优化通信效率,使更大规模的实验成为可能,加速生成式AI发展进程,为构建更先进架构、处理更多令牌的智能系统奠定基础。

🤝 目前NVIDIA正与亚马逊云科技、Cohere、谷歌云、Kimi AI、微软AI等合作伙伴共同推进NVFP4技术,旨在将这一突破性量化方案转化为可大规模部署的AI训练解决方案,推动整个行业向高效率、低能耗的训练模式转型。

In recent years, AI workloads have grown exponentially—not only in the deployment of large language models (LLMs) but also in the demand to process ever more tokens during pretraining and post-training. As organizations scale up compute infrastructure to train and deploy multi-billion-parameter foundation models, the ability to sustain higher token throughput has become mission critical. Progress is increasingly defined not just by efficiency, but by how many tokens an AI factory can push through to unlock the next wave of model capabilities.

AI-optimized data formats have emerged as a key innovation in this effort. Narrow-precision computation has already transformed inference, with NVIDIA’s introduction to NVFP4, a 4-bit format purpose-built to deliver exceptional inference latency, throughput, and efficiency—all while maintaining production-grade accuracy. 

Now, NVIDIA is extending this innovation to the pretraining phase, marking a major leap forward in LLM development. Using NVFP4 for pretraining unlocks huge improvements in training LLMs at scale and overall infrastructure efficiency. This isn’t just an incremental optimization—it’s a foundational shift in how large models can be trained at scale. 

In the era of AI factories, where compute is the engine of progress, precision is no longer a backend detail—it’s a strategic advantage. NVFP4 4-bit pretraining redefines the boundaries of efficiency and scalability, setting a new standard for high-performance AI model development. 

NVFP4 training is still in the research phase, exploring and validating the potential of 4-bit precision in large-scale model pretraining. Active engagements and continued collaboration around NVFP4 are ongoing with leading organizations such as Amazon Web Services, Cohere, Google Cloud, Kimi AI, Microsoft AI, Mistral, OpenAI, Perplexity, Reflection, and Runway.

What is 4-bit quantization?

4-bit quantization refers to the process of reducing the precision of model weights and activations to just 4 bits—a dramatic drop from the typical 16-bit or 32-bit floating-point formats. 

Pretraining with 4 bits is challenging because gradients and updates must be handled very carefully to preserve accuracy while improving the overall training speed. Specialized techniques and recipes are required to maintain effectiveness while mapping high-precision tensors to a much smaller set of quantized values. 

How fewer bits unlock more capability for AI factories

In recent years, AI workloads have grown exponentially—not just in the deployment of large language models (LLMs) but also in the scale of foundation model pretraining and post-training. As organizations expand compute infrastructure to handle training and deployment of multi-billion-parameter models, progress is increasingly defined by how much token throughput an AI factory can sustain to unlock new capabilities.

Inference has already undergone multiple waves of innovation, from FP32 and FP16 down to FP8 and most recently, NVIDIA’s release of NVFP4 for AI inference. While methods like post-training quantization (PTQ) have shown NVFP4 to be a force multiplier in increasing inference throughput while maintaining accuracy, a remaining challenge lies upstream in pretraining—where foundation models still rely on BF16 or FP8 for stability and convergence.

Training is where AI factories can spend the bulk of their compute, power, and time. Power budgets are fixed and GPU cycles are scarce, so developers must account for every bit, token, and epoch. Throughput isn’t an abstract metric here—it directly determines what scale of models can be built, how many experiments can be run, and how quickly breakthroughs arrive.

This is where 4-bit precision becomes transformative. By cutting memory needs, boosting arithmetic throughput, and optimizing communication, 4-bit pretraining allows factories to push significantly more tokens through the same hardware. With the right quantization recipe, it can deliver accuracy on par with FP8/BF16 while dramatically raising throughput—unlocking faster convergence cycles, more experiments per unit of compute, and scaling to unprecedented frontier models. In other words, fewer bits don’t just save money—they expand the frontier of what AI factories can achieve.

The NVFP4 quantization recipe for pretraining

To enable pretraining at 4-bit precision, we’ve developed a purpose-built NVFP4 pretraining recipe that addresses the core challenges of dynamic range, gradient volatility, and numerical stability in large-scale training. 

Blackwell was the first architecture from NVIDIA to natively support FP4 formats. The massive FP4 FLOPs throughput on GB200 and GB300 enables efficient 4-bit training by accelerating narrow-precision matrix operations while maintaining the scale and parallelism needed for large model convergence—making them ideal for next-generation AI factories deploying FP4-based pretraining.

Figure 1 below shows measured GEMM performance with Blackwell Ultra, revealing a 7x speedup over the Hopper generation. Modern LLMs fundamentally rely on matrix multiplication, particularly within their fully-connected or linear layers, as a core computational element. This makes the efficiency of these operations crucial. With FP4 precision enabling faster and more efficient execution of these operations, the observed GEMM acceleration means the entire pretraining process—from forward propagation to gradient updates—runs significantly faster, reducing time-to-train while enabling rapid larger-scale model development.

 Figure 1. Measured GEMM performance shows GB300 delivering a 7x speedup over Hopper, accelerating core LLM training operations through faster FP4-optimized matrix multiplications.

To enable efficient narrow-precision training, NVIDIA’s NVFP4 pretraining recipe leverages several key techniques which have been chosen based on their performance and accuracy. These include:

    Enhanced value representation with NVFP4’s micro-block scaling: Blackwell introduces native Tensor Core support for NVFP4, a 4-bit numerical format for both weights and activations that uses micro-block scaling—where each group of sixteen 4-bit elements shares a common scaling factor. By reducing the block size from 32 to 16 elements compared to MXFP4, NVFP4 minimizes the influence of outliers and enables more precise scaling. This finer granularity reduces quantization error and improves overall model accuracy.
    NVFP4 high-precision block encoding with E4M3 scale factors: Scale factor precision plays a critical role in quantization quality and accuracy. Unlike MXFP4, which is limited to power-of-two scale factors (E8M0) and prone to high rounding errors, NVFP4 uses higher-precision E4M3 scale factors with additional mantissa bits. This allows finer-grain scaling, better utilization of the limited quantization bins, and more accurate representation of values within a block.
    Reshaping tensor distributions to fit narrow formats: Gradients and activations during LLM pretraining tend to have large outliers that can impact narrow-precision quantization. Applying Hadamard transforms to GEMM inputs helps reshape their distribution to be more Gaussian-like, which smooths outliers and makes tensors easier to represent accurately.  These transformations are transparent to the model architecture and can be applied to linear layers in the forward and backward pass.
    Maintaining fidelity with quantization techniques: To ensure stable and efficient training, we employ quantization methods that preserve consistency between the forward and backward passes. Techniques such as selective 2D block-based quantization help maintain alignment in tensor representations throughout the training cycle. This consistency is key to minimizing signal distortion, improving convergence behavior, and enhancing overall robustness—especially when operating under narrow-precision formats like NVFP4.
    Reducing bias with stochastic rounding: Unlike traditional (deterministic) rounding where gradients are always rounded toward the nearest representable number, stochastic rounding ensures that gradients are rounded up or down randomly, with probabilities proportional to how close a number lies between two representable values. This step is essential for reducing rounding bias, maintaining gradient flow during training, and ultimately improving model accuracy.
Figure 2. NVIDIA’s NVFP4 pretraining techniques for efficient low-precision training

NVFP4 Makes 4-Bit Pretraining Real: Accuracy and Stability at Trillion-Token Scale

For narrow-precision formats to be practical in large-scale pretraining, they must ensure both model accuracy and stable convergence. To assess the viability of 4-bit precision in large-scale model training, experiments were conducted with FP8 and NVFP4 on a 12-billion parameter model based on a combined Mamba-Transformer architecture (12B Hybrid Mamba-Transformer model)—similar to NVIDIA Nemotron Nano 2. This model was trained on a massive dataset of 10 trillion tokens using a phased data-blending approach, switching to a different dataset mix in the second phase of training at 70%, and in the third phase of training at 90% during pretraining. 

A version of the 12B Hybrid Mamba-Transformer model was initially trained with 8-bit precision—FP8, which has been shown in previous studies to closely match 16-bit precision, and hence served as our baseline for comparison. We then successfully trained this same 12B model from scratch using NVFP4, demonstrating that this new low-precision format can support full pretraining at trillion-token scale. The NVFP4 run exhibited stable convergence without the training instabilities or divergence issues that typically plague ultra-low precision training.

Figure 3 below shows that NVFP4’s validation loss curve closely matches the loss curves from the higher-precision baseline (i.e., FP8) throughout the entire duration of training. The quantization techniques outlined above ensure that even with aggressive bit-width reduction, the 4-bit pretraining dynamics closely resemble those of higher-precision runs.

We then took the 12B Hybrid Mamba-Transformer model pretrained using NVFP4 and compared it to the higher precision FP8 baseline across a range of downstream tasks and intelligence domains. Figure 4 illustrates that across all domains, NVFP4 matches the performance of FP8, highlighting its effectiveness. This finding strengthens the initial hypothesis: NVFP4 is a robust choice for pretraining LLMs even at the trillion-token scale—highlighting its potential for efficient large-scale frontier model training.

Figure 4. Benchmarking downstream task accuracy scores on pretraining the 12B Hybrid Mamba-Transformer model using FP8 precision (baseline) and NVFP4 precision. Pretraining with NVFP4 achieves accuracy comparable with higher precision formats.

Train smarter, not just harder

NVIDIA’s NVFP4 format is redefining the landscape of AI training—setting a new benchmark for speed, efficiency, and purposeful innovation. By enabling 4-bit pretraining, NVFP4 empowers AI factories to scale more rapidly and sustainably, paving the way for the next era of generative AI. As a dynamic and evolving technology, NVFP4 continues to unlock new opportunities for teams building frontier models, driving progress in energy-efficient, high-performance AI. With its breakthrough in compute efficiency, 4-bit pretraining opens the door to more advanced architectures, larger training runs, and significantly more tokens—fueling the future of intelligent systems.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVFP4 AI大模型 量化训练 NVIDIA 高效计算 LLM预训练 4位精度 深度学习 AI工厂
相关文章