MarkTechPost@AI 10月03日 06:51
IBM发布Granite 4.0:混合架构LLM内存占用大减
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

IBM推出了Granite 4.0,这是一个开源的大型语言模型系列,采用混合Mamba-2/Transformer架构替代传统的Transformer模型,显著减少了内存占用,同时保持了高质量性能。该系列包含不同尺寸的模型,从3B参数的“Micro”到32B参数的“H-Small”(约9B活跃参数)。这些模型均采用Apache-2.0许可,并获得了ISO/IEC 42001:2023 AI管理体系认证。Granite 4.0的混合设计通过交织Mamba-2状态空间层和少量自注意力层,在长上下文和多会话推理时,RAM占用可减少70%以上,从而降低GPU成本。该模型在指令遵循和工具使用等企业级基准测试中表现出色,并已在多个平台上线。

💡 **创新混合架构,显著降低内存需求**:Granite 4.0系列模型采用了一种创新的混合Mamba-2/Transformer架构,将Mamba-2状态空间层与少量自注意力层相结合。这种设计相比传统的Transformer架构,在处理长上下文和多会话推理时,内存占用(RAM)可减少超过70%,从而有效降低了GPU成本,提高了计算效率,使得模型在保持高质量的同时更具成本效益。

🚀 **多样化模型规格与开源许可**:IBM发布了Granite 4.0的多个模型变体,包括3B参数的“Micro”和“H-Micro”,7B参数的“H-Tiny”(约1B活跃参数),以及32B参数的“H-Small”(约9B活跃参数)。所有模型均采用Apache-2.0开源许可,并经过加密签名,确保了其开放性和安全性,便于开发者和企业进行广泛的应用和部署。

✅ **企业级性能与合规认证**:Granite 4.0在指令遵循(IFEval)、函数调用(BFCLv3)和检索增强生成(MTRAG)等关键的企业级基准测试中表现优异,证明了其强大的实用性。值得一提的是,IBM声称Granite是首个获得认可的ISO/IEC 42001:2023 AI管理体系认证的开源模型家族,这为企业在部署AI模型时提供了额外的合规性和可信度保障。

🌐 **多平台广泛可用性**:用户可以通过多种渠道轻松访问和使用Granite 4.0模型,包括IBM watsonx.ai、Dell Pro AI Studio/Enterprise Hub、Docker Hub、Hugging Face、Kaggle、LM Studio、NVIDIA NIM、Ollama和Replicate等。IBM也在积极推动其对vLLM、llama.cpp等流行推理框架的支持,进一步便利了模型的集成和部署。

IBM just released Granite 4.0, an open-source LLM family that swaps monolithic Transformers for a hybrid Mamba-2/Transformer stack to cut serving memory while keeping quality. Sizes span a 3B dense “Micro,” a 3B hybrid “H-Micro,” a 7B hybrid MoE “H-Tiny” (~1B active), and a 32B hybrid MoE “H-Small” (~9B active). The models are Apache-2.0, cryptographically signed, and—per IBM—the first open models covered by an accredited ISO/IEC 42001:2023 AI management system certification. They are available on watsonx.ai and via Docker Hub, Hugging Face, LM Studio, NVIDIA NIM, Ollama, Replicate, Dell Pro AI Studio/Enterprise Hub, Kaggle, with Azure AI Foundry…

So, what is new?

Granite 4.0 introduces a hybrid design that interleaves a small fraction of self-attention blocks with a majority of Mamba-2 state-space layers (9:1 ratio). As per IBM technical blog, relative to conventional Transformer LLMs, Granite 4.0-H can reduce RAM by >70% for long-context and multi-session inference, translating into lower GPU cost at a given throughput/latency target. IBM’s internal comparisons also show the smallest Granite 4.0 models outperforming Granite 3.3-8B despite using fewer parameters.

Tell me what are the released variants?

IBM is shipping both Base and Instruct variants across four initial models:

All are Apache-2.0 and cryptographically signed; IBM states Granite is the first open model family with accredited ISO/IEC 42001 coverage for its AI management system (AIMS). Reasoning-optimized (“Thinking”) variants are planned later in 2025.

How is it trained, context, and dtype?

Granite 4.0 was trained on samples up to 512K tokens and evaluated up to 128K tokens. Public checkpoints on Hugging Face are BF16 (quantized and GGUF conversions are also published), while FP8 is an execution option on supported hardware—not the format of the released weights.

Lets understand it’s performance signals (enterprise-relevant)

IBM highlights instruction following and tool-use benchmarks:

IFEval (HELM): Granite-4.0-H-Small leads most open-weights models (trailing only Llama 4 Maverick at far larger scale).

https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

BFCLv3 (Function Calling): H-Small is competitive with larger open/closed models at lower price points.

https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

MTRAG (multi-turn RAG): Improved reliability on complex retrieval workflows.

https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models

How can I get access?

Granite 4.0 is live on IBM watsonx.ai and distributed via Dell Pro AI Studio/Enterprise Hub, Docker Hub, Hugging Face, Kaggle, LM Studio, NVIDIA NIM, Ollama, OPAQUE, Replicate. IBM notes ongoing enablement for vLLM, llama.cpp, NexaML, and MLX for hybrid serving.

My thoughts/comments

I see Granite 4.0’s hybrid Mamba-2/Transformer stack and active-parameter MoE as a practical path to lower TCO: >70% memory reduction and long-context throughput gains translate directly into smaller GPU fleets without sacrificing instruction-following or tool-use accuracy (IFEval, BFCLv3, MTRAG). The BF16 checkpoints with GGUF conversions simplify local evaluation pipelines, and ISO/IEC 42001 plus signed artifacts address provenance/compliance gaps that typically stall enterprise deployment. Net result: a lean, auditable base model family (1B–9B active) that’s easier to productionize than prior 8B-class Transformers.


Check out the Hugging Face Model Card and Technical details. Feel free to check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The post IBM Released new Granite 4.0 Models with a Novel Hybrid Mamba-2/Transformer Architecture: Drastically Reducing Memory Use without Sacrificing Performance appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

IBM Granite 4.0 LLM Mamba-2 Transformer 混合架构 开源 AI管理体系 内存优化 Hybrid Architecture Open Source AI Management System Memory Optimization
相关文章