MarkTechPost@AI 08月27日
NVIDIA推出Jet-Nemotron:53倍加速LLM推理,降低98%成本
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

NVIDIA研究人员发布了Jet-Nemotron系列模型(2B和4B),通过一种名为PostNAS的新技术,在保持甚至超越现有领先模型的准确性的同时,实现了高达53.6倍的生成吞吐量提升。这项技术能够改造已有的预训练模型,无需从头开始重新训练。Jet-Nemotron解决了大型语言模型(LLM)在推理效率方面的长期瓶颈,其O(n²)自注意力机制成本高昂,尤其是在长上下文任务中。PostNAS通过冻结MLP层、引入硬件高效的JetBlock线性注意力模块,并结合硬件感知设计,实现了对现有模型的“手术式”升级。其结果是一个混合架构的LLM,显著降低了延迟和内存占用,使得在边缘设备上部署SOTA模型成为可能,并为企业、从业者和研究人员带来了巨大的应用前景。

🚀 **效率飞跃与成本降低:** Jet-Nemotron系列模型(2B和4B)通过PostNAS技术,在保持高准确性的前提下,实现了高达53.6倍的生成吞吐量提升,这意味着在相同成本下服务用户数量可增加53倍,或将推理成本降低98%。

💡 **PostNAS技术的核心创新:** PostNAS是一种专门用于高效改造预训练模型的神经架构搜索流程。它通过冻结模型知识(MLP层)、用硬件高效的JetBlock线性注意力模块替换计算密集型全注意力层,并采用硬件感知设计,在保留模型智能的同时大幅提升效率。

🔄 **JetBlock的独特设计:** JetBlock引入了动态因输入而异的卷积核(区别于静态卷积核),并移除冗余卷积,实现了流线型效率。结合硬件感知超参数搜索,它不仅在吞吐量上与现有线性注意力设计持平,甚至提高了准确性。

📱 **赋能边缘设备与广泛应用:** Jet-Nemotron极小的KV缓存(154MB)和2B参数使其能够轻松部署在Jetson Orin、RTX 3090甚至移动芯片上,无需云端依赖。这使得实时文档AI、长上下文代理、设备端助手等应用成为可能。

🔬 **加速LLM研究与创新:** PostNAS大幅降低了LLM架构创新的成本和时间。研究人员可以快速测试新的注意力模块,并优先考虑硬件感知设计,加速整个AI生态系统的迭代和进步。

NVIDIA researchers have shattered the longstanding efficiency hurdle in large language model (LLM) inference, releasing Jet-Nemotron—a family of models (2B and 4B) that delivers up to 53.6× higher generation throughput than leading full-attention LLMs while matching, or even surpassing, their accuracy. Most importantly, this breakthrough isn’t the result of a new pre-training run from scratch, but rather a retrofit of existing, pre-trained models using a novel technique called Post Neural Architecture Search (PostNAS). The implications are transformative for businesses, practitioners, and researchers alike.

The Need for Speed in Modern LLMs

While today’s state-of-the-art (SOTA) LLMs, like Qwen3, Llama3.2, and Gemma3, have set new benchmarks for accuracy and flexibility, their O(n²) self-attention mechanism incurs exorbitant costs—both in compute and memory—especially for long-context tasks. This makes them expensive to deploy at scale and nearly impossible to run on edge or memory-constrained devices. Efforts to replace full-attention Transformers with more efficient architectures (Mamba2, GLA, RWKV, etc.) have struggled to close the accuracy gap, until now.

https://arxiv.org/abs/2508.15884v1?

PostNAS: A Surgical, Capital-Efficient Overhaul

The core innovation is PostNAS: a neural architecture search pipeline designed specifically for efficiently retrofitting pre-trained models. Here’s how it works:

JetBlock is particularly noteworthy: it introduces dynamic causal convolution kernels conditioned on input (unlike static kernels in prior linear attention blocks) and removes redundant convolutions for streamlined efficiency. With hardware-aware hyperparameter search, it not only keeps pace with prior linear attention designs in throughput, but actually boosts accuracy.

https://arxiv.org/abs/2508.15884v1?

Jet-Nemotron: Performance by the Numbers

The key metrics from NVIDIA’s technical paper are staggering:

ModelMMLU-Pro Acc.Generation Throughput (tokens/s, H100)KV Cache Size (MB, 64K context)Notes
Qwen3-1.7B-Base37.8617,168Full-attention baseline
Jet-Nemotron-2B39.02,88515447× throughput, 47× smaller cache
Jet-Nemotron-4B44.21,27125821× throughput, still SOTA acc.
Mamba2-2.7B8.62,50780All-linear, much lower accuracy
RWKV7-1.5B13.43,05024All-linear, much lower accuracy
DeepSeek-V3-Small (MoE)2.2B activated, 15B total, lower acc.

Jet-Nemotron-2B matches or exceeds Qwen3-1.7B-Base on every major benchmark—math, commonsense, coding, retrieval, long-context—while delivering 47× higher generation throughput.

This isn’t a small gain: a 53.6× speedup in decoding at 256K context length means a 98% reduction in inference cost for the same volume of tokens. Prefilling speedups are also dramatic: 6.14× faster at 256K context.

Memory footprint shrinks by 47× (154MB cache vs. 7,168MB for Qwen3-1.7B-Base). This is a game-changer for edge deployment: Jet-Nemotron-2B is 8.84× and 6.5× faster than Qwen2.5-1.5B on Jetson Orin and RTX 3090, respectively.

https://arxiv.org/abs/2508.15884v1?

Applications

For Business Leaders: Better ROI $$

For Practitioners: SOTA on the Edge

For Researchers: Lower Barrier, Higher Innovation

Summary

The open-sourcing of Jet-Nemotron and JetBlock (code on GitHub) means the broader AI ecosystem can now retrofit their models for unprecedented efficiency. PostNAS is not a one-off trick: it’s a general-purpose framework for accelerating any Transformer, lowering the cost of future breakthroughs.


Check out the Paper and GitHub Page. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter.

The post NVIDIA AI Released Jet-Nemotron: 53x Faster Hybrid-Architecture Language Model Series that Translates to a 98% Cost Reduction for Inference at Scale appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

NVIDIA Jet-Nemotron LLM AI 推理效率 PostNAS JetBlock 语言模型 NVIDIA AI 人工智能 Large Language Models Inference Efficiency Hybrid-Architecture
相关文章