NVIDIA推出Jet-Nemotron：53倍加速LLM推理，降低98%成本

NVIDIA researchers have shattered the longstanding efficiency hurdle in large language model (LLM) inference, releasing Jet-Nemotron—a family of models (2B and 4B) that delivers up to 53.6× higher generation throughput than leading full-attention LLMs while matching, or even surpassing, their accuracy. Most importantly, this breakthrough isn’t the result of a new pre-training run from scratch, but rather a retrofit of existing, pre-trained models using a novel technique called Post Neural Architecture Search (PostNAS). The implications are transformative for businesses, practitioners, and researchers alike.

The Need for Speed in Modern LLMs

While today’s state-of-the-art (SOTA) LLMs, like Qwen3, Llama3.2, and Gemma3, have set new benchmarks for accuracy and flexibility, their O(n²) self-attention mechanism incurs exorbitant costs—both in compute and memory—especially for long-context tasks. This makes them expensive to deploy at scale and nearly impossible to run on edge or memory-constrained devices. Efforts to replace full-attention Transformers with more efficient architectures (Mamba2, GLA, RWKV, etc.) have struggled to close the accuracy gap, until now.

PostNAS: A Surgical, Capital-Efficient Overhaul

The core innovation is PostNAS: a neural architecture search pipeline designed specifically for efficiently retrofitting pre-trained models. Here’s how it works:

Freeze the Knowledge

MLP layers

Surgical Replacement

JetBlock

Hybrid, Hardware-Aware Design

super-network training and beam search

optimal placement and minimal set of full-attention layers

task-specific

hardware-aware

Scale and Deploy

hybrid-architecture

JetBlock is particularly noteworthy: it introduces dynamic causal convolution kernels conditioned on input (unlike static kernels in prior linear attention blocks) and removes redundant convolutions for streamlined efficiency. With hardware-aware hyperparameter search, it not only keeps pace with prior linear attention designs in throughput, but actually boosts accuracy.

Jet-Nemotron: Performance by the Numbers

The key metrics from NVIDIA’s technical paper are staggering:

Model	MMLU-Pro Acc.	Generation Throughput (tokens/s, H100)	KV Cache Size (MB, 64K context)	Notes
Qwen3-1.7B-Base	37.8	61	7,168	Full-attention baseline
Jet-Nemotron-2B	39.0	2,885	154	47× throughput, 47× smaller cache
Jet-Nemotron-4B	44.2	1,271	258	21× throughput, still SOTA acc.
Mamba2-2.7B	8.6	2,507	80	All-linear, much lower accuracy
RWKV7-1.5B	13.4	3,050	24	All-linear, much lower accuracy
DeepSeek-V3-Small (MoE)	—	—	—	2.2B activated, 15B total, lower acc.

Jet-Nemotron-2B matches or exceeds Qwen3-1.7B-Base on every major benchmark—math, commonsense, coding, retrieval, long-context—while delivering 47× higher generation throughput.

This isn’t a small gain: a 53.6× speedup in decoding at 256K context length means a 98% reduction in inference cost for the same volume of tokens. Prefilling speedups are also dramatic: 6.14× faster at 256K context.

Memory footprint shrinks by 47× (154MB cache vs. 7,168MB for Qwen3-1.7B-Base). This is a game-changer for edge deployment: Jet-Nemotron-2B is 8.84× and 6.5× faster than Qwen2.5-1.5B on Jetson Orin and RTX 3090, respectively.

Applications

For Business Leaders: Better ROI $$

Inference at scale is now affordable.

dollar-for-dollar, you can serve 53× more users

slash hosting costs by 98%

Operational efficiency

latency drops, batch sizes grow, and memory constraints vanish

offer SOTA AI at commodity prices

The AI business model reshapes

For Practitioners: SOTA on the Edge

Forget about quantization, distillation, or pruning compromises.

fit on Jetson Orin, RTX 3090, and even mobile chips

No retraining, no data pipeline changes

without losing accuracy

Real-world AI services

instant and scalable

For Researchers: Lower Barrier, Higher Innovation

PostNAS slashes the cost of LLM architecture innovation.

architecture search happens on frozen backbone models

Hardware-aware NAS is the future

KV cache size

paradigm shift

The community can iterate faster

PostNAS is a rapid testbed

Summary

The open-sourcing of Jet-Nemotron and JetBlock (code on GitHub) means the broader AI ecosystem can now retrofit their models for unprecedented efficiency. PostNAS is not a one-off trick: it’s a general-purpose framework for accelerating any Transformer, lowering the cost of future breakthroughs.

Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post NVIDIA AI Released Jet-Nemotron: 53x Faster Hybrid-Architecture Language Model Series that Translates to a 98% Cost Reduction for Inference at Scale appeared first on MarkTechPost.

The Need for Speed in Modern LLMs

PostNAS: A Surgical, Capital-Efficient Overhaul

Jet-Nemotron: Performance by the Numbers

Applications

For Business Leaders: Better ROI $$

For Practitioners: SOTA on the Edge

For Researchers: Lower Barrier, Higher Innovation

Summary

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签