NVFP4技术革新AI大模型训练效率

In recent years, AI workloads have grown exponentially—not only in the deployment of large language models (LLMs) but also in the demand to process ever more tokens during pretraining and post-training. As organizations scale up compute infrastructure to train and deploy multi-billion-parameter foundation models, the ability to sustain higher token throughput has become mission critical. Progress is increasingly defined not just by efficiency, but by how many tokens an AI factory can push through to unlock the next wave of model capabilities.

AI-optimized data formats have emerged as a key innovation in this effort. Narrow-precision computation has already transformed inference, with NVIDIA’s introduction to NVFP4, a 4-bit format purpose-built to deliver exceptional inference latency, throughput, and efficiency—all while maintaining production-grade accuracy.

Now, NVIDIA is extending this innovation to the pretraining phase, marking a major leap forward in LLM development. Using NVFP4 for pretraining unlocks huge improvements in training LLMs at scale and overall infrastructure efficiency. This isn’t just an incremental optimization—it’s a foundational shift in how large models can be trained at scale.

In the era of AI factories, where compute is the engine of progress, precision is no longer a backend detail—it’s a strategic advantage. NVFP4 4-bit pretraining redefines the boundaries of efficiency and scalability, setting a new standard for high-performance AI model development.

NVFP4 training is still in the research phase, exploring and validating the potential of 4-bit precision in large-scale model pretraining. Active engagements and continued collaboration around NVFP4 are ongoing with leading organizations such as Amazon Web Services, Cohere, Google Cloud, Kimi AI, Microsoft AI, Mistral, OpenAI, Perplexity, Reflection, and Runway.

What is 4-bit quantization?

4-bit quantization refers to the process of reducing the precision of model weights and activations to just 4 bits—a dramatic drop from the typical 16-bit or 32-bit floating-point formats.

Pretraining with 4 bits is challenging because gradients and updates must be handled very carefully to preserve accuracy while improving the overall training speed. Specialized techniques and recipes are required to maintain effectiveness while mapping high-precision tensors to a much smaller set of quantized values.

How fewer bits unlock more capability for AI factories

In recent years, AI workloads have grown exponentially—not just in the deployment of large language models (LLMs) but also in the scale of foundation model pretraining and post-training. As organizations expand compute infrastructure to handle training and deployment of multi-billion-parameter models, progress is increasingly defined by how much token throughput an AI factory can sustain to unlock new capabilities.

Inference has already undergone multiple waves of innovation, from FP32 and FP16 down to FP8 and most recently, NVIDIA’s release of NVFP4 for AI inference. While methods like post-training quantization (PTQ) have shown NVFP4 to be a force multiplier in increasing inference throughput while maintaining accuracy, a remaining challenge lies upstream in pretraining—where foundation models still rely on BF16 or FP8 for stability and convergence.

Training is where AI factories can spend the bulk of their compute, power, and time. Power budgets are fixed and GPU cycles are scarce, so developers must account for every bit, token, and epoch. Throughput isn’t an abstract metric here—it directly determines what scale of models can be built, how many experiments can be run, and how quickly breakthroughs arrive.

This is where 4-bit precision becomes transformative. By cutting memory needs, boosting arithmetic throughput, and optimizing communication, 4-bit pretraining allows factories to push significantly more tokens through the same hardware. With the right quantization recipe, it can deliver accuracy on par with FP8/BF16 while dramatically raising throughput—unlocking faster convergence cycles, more experiments per unit of compute, and scaling to unprecedented frontier models. In other words, fewer bits don’t just save money—they expand the frontier of what AI factories can achieve.

The NVFP4 quantization recipe for pretraining

To enable pretraining at 4-bit precision, we’ve developed a purpose-built NVFP4 pretraining recipe that addresses the core challenges of dynamic range, gradient volatility, and numerical stability in large-scale training.

Blackwell was the first architecture from NVIDIA to natively support FP4 formats. The massive FP4 FLOPs throughput on GB200 and GB300 enables efficient 4-bit training by accelerating narrow-precision matrix operations while maintaining the scale and parallelism needed for large model convergence—making them ideal for next-generation AI factories deploying FP4-based pretraining.

Figure 1 below shows measured GEMM performance with Blackwell Ultra, revealing a 7x speedup over the Hopper generation. Modern LLMs fundamentally rely on matrix multiplication, particularly within their fully-connected or linear layers, as a core computational element. This makes the efficiency of these operations crucial. With FP4 precision enabling faster and more efficient execution of these operations, the observed GEMM acceleration means the entire pretraining process—from forward propagation to gradient updates—runs significantly faster, reducing time-to-train while enabling rapid larger-scale model development.

*Figure 1. Measured GEMM performance shows GB300 delivering a 7x speedup over Hopper, accelerating core LLM training operations through faster FP4-optimized matrix multiplications.*

To enable efficient narrow-precision training, NVIDIA’s NVFP4 pretraining recipe leverages several key techniques which have been chosen based on their performance and accuracy. These include:

Enhanced value representation with NVFP4’s micro-block scaling:

NVFP4 high-precision block encoding with E4M3 scale factors:

Reshaping tensor distributions to fit narrow formats:

Maintaining fidelity with quantization techniques:

Reducing bias with stochastic rounding

*Figure 2. NVIDIA’s NVFP4 pretraining techniques for efficient low-precision training*

NVFP4 Makes 4-Bit Pretraining Real: Accuracy and Stability at Trillion-Token Scale

For narrow-precision formats to be practical in large-scale pretraining, they must ensure both model accuracy and stable convergence. To assess the viability of 4-bit precision in large-scale model training, experiments were conducted with FP8 and NVFP4 on a 12-billion parameter model based on a combined Mamba-Transformer architecture (12B Hybrid Mamba-Transformer model)—similar to NVIDIA Nemotron Nano 2. This model was trained on a massive dataset of 10 trillion tokens using a phased data-blending approach, switching to a different dataset mix in the second phase of training at 70%, and in the third phase of training at 90% during pretraining.

A version of the 12B Hybrid Mamba-Transformer model was initially trained with 8-bit precision—FP8, which has been shown in previous studies to closely match 16-bit precision, and hence served as our baseline for comparison. We then successfully trained this same 12B model from scratch using NVFP4, demonstrating that this new low-precision format can support full pretraining at trillion-token scale. The NVFP4 run exhibited stable convergence without the training instabilities or divergence issues that typically plague ultra-low precision training.

Figure 3 below shows that NVFP4’s validation loss curve closely matches the loss curves from the higher-precision baseline (i.e., FP8) throughout the entire duration of training. The quantization techniques outlined above ensure that even with aggressive bit-width reduction, the 4-bit pretraining dynamics closely resemble those of higher-precision runs.

Figure 4. Benchmarking downstream task accuracy scores on pretraining the 12B Hybrid Mamba-Transformer model using FP8 precision (baseline) and NVFP4 precision. Pretraining with NVFP4 achieves accuracy comparable with higher precision formats.

Train smarter, not just harder

NVIDIA’s NVFP4 format is redefining the landscape of AI training—setting a new benchmark for speed, efficiency, and purposeful innovation. By enabling 4-bit pretraining, NVFP4 empowers AI factories to scale more rapidly and sustainably, paving the way for the next era of generative AI. As a dynamic and evolving technology, NVFP4 continues to unlock new opportunities for teams building frontier models, driving progress in energy-efficient, high-performance AI. With its breakthrough in compute efficiency, 4-bit pretraining opens the door to more advanced architectures, larger training runs, and significantly more tokens—fueling the future of intelligent systems.

What is 4-bit quantization?

How fewer bits unlock more capability for AI factories

The NVFP4 quantization recipe for pretraining

NVFP4 Makes 4-Bit Pretraining Real: Accuracy and Stability at Trillion-Token Scale

Train smarter, not just harder

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签