NVIDIA Blackwell Ultra GPU：AI新纪元的强大引擎

As the latest member of the NVIDIA Blackwell architecture family, the NVIDIA Blackwell Ultra GPU builds on core innovations to accelerate training and AI reasoning. It fuses silicon innovations with new levels of system-level integration, delivering next-level performance, scalability, and efficiency for AI factories and the large-scale, real-time AI services they power.

With its energy-efficient dual-reticle design, high bandwidth and large-capacity HBM3E memory subsystem, fifth-generation Tensor Cores, and breakthrough NVFP4 precision format, Blackwell Ultra is raising the bar for accelerated computing. This in-depth look explains the architectural advances, why they matter, and how they translate into measurable gains for AI workloads.

Dual-reticle design: one GPU

Blackwell Ultra is composed of two reticle-sized dies connected using NVIDIA High-Bandwidth Interface (NV-HBI), a custom, power-efficient die-to-die interconnect technology that provides 10 TB/s of bandwidth. Blackwell Ultra is manufactured using TSMC 4NP and features 208B transistors–2.6x more than the NVIDIA Hopper GPU—all while functioning as a single, NVIDIA CUDA-programmed accelerator. This enables a large increase in performance while also maintaining the familiar CUDA programming model that developers have enjoyed for nearly two decades.

Benefits

Unified compute domain:

Full coherence:

Maximum silicon utilization:

*Figure 1. NVIDIA Blackwell Ultra GPU chip explained*

Streaming multiprocessors: compute engines for the AI Factory

As shown in Figure 1, the heart of Blackwell Ultra is its 160 Streaming Multiprocessors (SMs) organized into eight Graphics Processing Clusters (GPCs) in the full GPU implementation. Every SM, shown in Figure 2, is a self-contained compute engine housing:

128 CUDA Cores

4 fifth-generation Tensor Cores

256 KB of Tensor Memory (TMEM)

Special Function Units (SFUs)

*Figure 3. Blackwell Ultra GPU delivers 1.5x more dense NVFP4 throughput compared to Blackwell*

Accelerated softmax in the attention layer

Modern AI workloads rely heavily on attention processing with long input contexts and long output sequences for “thinking”. Transformer attention layers, in turn, stress exponentials, divisions, and other transcendental operations executed by the SM’s SFUs.

In Blackwell Ultra, SFU throughput has been doubled for key instructions used in attention, delivering up to 2x faster attention-layer compute compared to Blackwell GPUs. This improvement accelerates both short and long-sequence attention, but is especially impactful for reasoning models with large context windows—where the softmax stage can become a latency bottleneck.

By accelerating the attention mechanism within transformer models, Blackwell Ultra enables:

Faster AI reasoning with lower time-to-first-token in interactive applications.Lower compute costs by reducing total processing cycles per query.Higher system efficiency—more attention sequences processed per watt.

As depicted in Figure 4, the performance gains from the accelerated attention-layer instructions in Blackwell Ultra compound with NVFP4 precision, resulting in a step-function improvement for LLM and multimodal inference.

*Figure 4. Blackwell Ultra attention-layer acceleration*

Memory: high capacity and bandwidth for multi-trillion-parameter models

Blackwell Ultra doesn’t just scale compute—it scales memory capacity to meet the demands of the largest AI models. With 288 GB of HBM3e per GPU, it offers 3.6x more on-package memory than H100 and 50% more than Blackwell, as shown in Figure 5. This capacity is critical for hosting trillion-parameter models, extending context length without KV-cache offloading, and enabling high-concurrency inference in AI factories.

High bandwidth memory features

Max capacity:

HBM configuration:

Bandwidth:

*Figure 5. HBM capacity scaling across GPU generations*

This massive memory footprint enables:

Complete model residence:

Extended context lengths:

Improved compute efficiency:

Interconnect: built for scale

Blackwell and Blackwell Ultra support fifth-generation NVIDIA NVLink for GPU-to-GPU communication over NVLink Switch, NVLink-C2C for coherent interconnect to an NVIDIA Grace CPU, and x16 PCI-Express Gen 6 interface for connection to host CPUs.

NVLink 5 Specifications

Per-GPU Bandwidth:

Performance Scaling:

Maximum Topology:

Rack-Scale Integration:

Host connectivity:

PCIe Interface:

NVLink-C2C:

Table 1 provides a comparison of the interconnects across generations.

Interconnect	Hopper GPU	Blackwell GPU	Blackwell Ultra GPU
NVLink (GPU-GPU)	900	1,800	1,800
NVLink-C2C (CPU-GPU)	900	900	900
PCIe Interface	128 (Gen 5)	256 (Gen 6)	256 (Gen 6)

Table 1. Interconnect comparison of Hopper compared to Blackwell and Blackwell Ultra (in BiDir GB/s)

Advancing performance-efficiency

Blackwell Ultra delivers a decisive leap over Blackwell by adding 50% more NVFP4 compute and 50% more HBM capacity per chip, enabling larger models and faster throughput without compromising efficiency. Accelerated softmax execution further boosts real-world inference speeds, driving up tokens per second per user (TPS/user) while improving data center tokens per second per megawatt (TPS/MW). Every architectural enhancement was purpose-built to push both user experience and operational efficiency to the next level.

As shown in Figure 6, plotting these two metrics for the NVIDIA Hopper HGX H100 NVL8 system, NVIDIA Blackwell HGX B200 NVL8 system, NVIDIA Blackwell GB200 NVL72 system, and NVIDIA Blackwell Ultra GB300 NVL72 system reveals a generational leap. The curve starts with Hopper NVL8 at FP8 precision and ends with Blackwell Ultra NVL72 at NVFP4 precision—showing how each architectural advance pushes the Pareto frontier up and to the right.

*Figure 6. AI factory output evolution from Hopper to Blackwell Ultra*

These architectural innovations improve the economics of AI inference and redefine what’s possible in AI factory design—delivering more model instances, faster responses, and higher output per megawatt than any previous NVIDIA platform.

To see firsthand how innovations in hardware and deployment configurations impact data center efficiency and user experience, check out our interactive Pareto Frontier explainer.

Enterprise-grade features

Blackwell Ultra isn’t just about raw performance—it’s designed with enterprise-grade features that simplify operations, strengthen security, and deliver reliable performance at scale.

Advanced scheduling and management

Enhanced GigaThread Engine:

Multi-Instance GPU (MIG):

MIG

Security and reliability

Confidential computing and secure AI

Advanced NVIDIA Reliability, Availability, and Serviceability (RAS) engine:

AI video and data processing enhancements

Blackwell Ultra also integrates specialized engines for modern AI workloads requiring multimodal data processing:

Video and JPEG decoding:

NVIDIA DALI

Decompression engine:

nvCOMP

NVIDIA GPU chip summary comparison

To put Blackwell Ultra’s advances in perspective, Table 2 compares key chip specifications across Hopper, Blackwell, and Blackwell Ultra. It highlights the generational leap in transistor count, memory capacity, interconnect bandwidth, and precision compute throughput—as well as the architectural enhancements like attention acceleration and NVFP4. This side-by-side view shows how Blackwell Ultra scales up performance and extends capabilities critical for AI factory deployments at both node and rack scale.

Feature	Hopper	Blackwell	Blackwell Ultra
Manufacturing process	TSMC 4N	TSMC 4NP	TSMC 4NP
Transistors	80B	208B	208B
Dies per GPU	1	2	2
NVFP4 dense \| sparse performance	–	10 \| 20 PetaFLOPS	15 \| 20 PetaFLOPS
FP8 dense \| sparse performance	2 \| 4 PetaFLOPS	5 \| 10 PetaFLOPS	5 \| 10 PetaFLOPS
Attention acceleration (SFU EX2)	4.5 TeraExponentials/s	5 TeraExponentials/s	10.7 TeraExponentials/s
Max HBM capacity	80 GB HBM (H100) 141 GB HBM3E (H200)	192 GB HBM3E	288 GB HBM3E
Max HBM bandwidth	3.35 TB/s (H100) 4.8 TB/s (H200)	8 TB/s	8 TB/s
NVLink bandwidth	900 GB/s	1,800 GB/s	1,800 GB/s
Max power (TGP)	Up to 700W	Up to 1,200W	Up to 1,400W

Table 2. NVIDIA GPU chip comparison

From chip to AI factory

Blackwell Ultra GPUs form the backbone of NVIDIA’s next-generation AI infrastructure—delivering transformative performance from desktop superchips to full AI factory racks.

NVIDIA Grace Blackwell Ultra Superchip

This superchip couples one Grace CPU with two Blackwell Ultra GPUs through NVLink‑C2C, offering up to 30 PFLOPS dense, and 40 PFLOPS sparse, NVFP4 AI compute, and boasts 1 TB of unified memory combining HBM3E and LPDDR5X for unprecedented on-node capacity. ConnectX-8 SuperNICs provide 800 Gb/s high-speed network connectivity (See Figure 7). The NVIDIA Grace Blackwell Ultra Superchip is the foundational computing component of the GB300 NVL 72 rack-scale system.

NVIDIA GB300 NVL72 rack-scale system:

GB300 NVL72

power smoothing

NVIDIA HGX and DGX B300 systems:

HGX B300

DGX B300

Complete CUDA compatibility

Blackwell Ultra maintains full backward compatibility with the entire CUDA ecosystem while introducing optimizations for next-generation AI frameworks:

Framework integration:

NVIDIA Dynamo:

30x higher

NVIDIA Enterprise AI:

NVIDIA development tools and CUDA libraries:

CUTLASS for custom kernel developmentNsight Systems and Nsight Compute for profiling and tuningModel Optimizer for precision-aware graph optimizationcuDNN for deep learning primitivesNCCL for multi-GPU communicationCUDA Graphs for reducing launch overhead

The bottom line

NVIDIA Blackwell Ultra establishes the foundation for AI factories to train and deploy intelligence at unprecedented scale and efficiency. With breakthrough innovations in dual-die integration, NVFP4 acceleration, massive memory capacity, and advanced interconnect technology, Blackwell Ultra enables AI applications that were previously computationally impossible.

As the industry transitions from proof-of-concept AI to production AI factories, Blackwell Ultra provides the computational foundation to turn AI ambitions into reality with unmatched performance, efficiency, and scale.

Learn more

Dive deeper into the innovations powering the trillion-token era. Download the Blackwell Architecture Technical Brief to explore the full silicon-to-system story.

Acknowledgments

We’d like to thank Manas Mandal, Ronny Krashinsky, Vishal Mehta, Greg Palmer, Michael Andersch, Eduardo Alvarez, Ashraf Eassa, Joe DeLaere, and many other NVIDIA GPU architects, engineers, and product leaders who contributed to this post.