Microsoft AI News 10月15日
微软贡献新标准推动云和AI基础设施创新
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软正在为云和AI基础设施创新做出重大贡献,通过在电力、冷却、可持续性、安全、网络和车队弹性方面制定新标准。随着计算从云规模转向前沿规模,微软通过开放计算项目(OCP)等跨行业论坛分享了最佳实践,优化了其云基础设施堆栈。微软最近增加了超过20吉瓦的新容量,并推出了世界上性能最强大的AI数据中心,其性能是当今世界最快超级计算机的10倍。微软正在领导多项行业倡议,包括电源稳定化、冷却创新、统一网络解决方案、安全性和可持续性增强,以及车队运营弹性,旨在推动全球可互操作性和标准化。

微软正在通过开放计算项目(OCP)等平台,为云和AI基础设施创新做出重大贡献,制定了涵盖电力、冷却、可持续性、安全、网络和车队弹性等多个方面的行业新标准。

微软最近增加了超过20吉瓦的新容量,并推出了世界上性能最强大的AI数据中心,其性能是当今世界最快超级计算机的10倍,展示了其在AI计算领域的领先地位。

微软正在领导多项行业倡议,包括电源稳定化、冷却创新、统一网络解决方案、安全性和可持续性增强,以及车队运营弹性,旨在推动全球可互操作性和标准化。

微软通过开放源代码硅根信任(Caliptra)和分层开放式加密密钥管理(L.O.C.K)等创新,增强了AI系统的安全性和可靠性。

微软正在推动数据中心规模的可持续性发展,通过与其他行业领导者的合作,标准化碳测量方法,并促进废热再利用,以减少对环境的影响。

Microsoft is contributing new standards across power, cooling, sustainability, security, networking, and fleet resiliency to advance innovation.

In the transition from building computing infrastructure for cloud scale to building cloud and AI infrastructure for frontier scale, the world of computing has experienced tectonic shifts in innovation. Throughout this journey, Microsoft has shared its learnings and best practices, optimizing our cloud infrastructure stack in cross-industry forums such as the Open Compute Project (OCP) Global Foundation.

Today, we see that the next phase of cloud infrastructure innovation is poised to be the most consequential period of transformation yet. In just the last year, Microsoft has added more than 2 gigawatts of new capacity and launched the world’s most powerful AI datacenter, which delivers 10x the performance of the world’s fastest supercomputer today. Yet, this is just the beginning.

Delivering AI infrastructure at the highest performance and lowest cost requires a systems approach, with optimizations across the stack to drive quality, speed, and resiliency at a level that can provide a consistent experience to our customers. In the quest to supply resilient, sustainable, secure, and widely scalable technology to handle the breadth of AI workloads, we’re embarking on an ambitious new journey: one not just of redefining infrastructure innovation at every layer of execution from silicon to systems, but one of tightly integrated industry alignment on standards that offer a model for global interoperability and standardization.

At this year’s OCP Global Summit, Microsoft is contributing new standards across power, cooling, sustainability, security, networking, and fleet resiliency to further advance innovation in the industry.

Redefining power distribution for the AI era

As AI workloads scale globally, hyperscale datacenters are experiencing unprecedented power density and distribution challenges.

Last year, at the OCP Global Summit, we partnered with Meta and Google in the development of Mt. Diablo, a disaggregated power architecture. This year, we’re building on this innovation with the next step of our full-stack transformation of datacenter power systems: solid-state transformers. Solid-state transformers simplify the power chain with new conversion technologies and protection schemes that can accommodate future rack voltage requirements.

Training large models across thousands of GPUs also introduces variable and intense power draw patterns that can strain the grid. The utility, and traditional power delivery systems. These fluctuations not only risk hardware reliability and operational efficiency but also create challenges across capacity planning and sustainability goals.

Together with key industry partners, Microsoft is leading a power stabilization initiative to address this challenge. In a recently published paper with OpenAI and NVIDIA—Power Stabilization for AI Training Datacenters—we address how full-stack innovations spanning rack-level hardware, firmware orchestration, predictive telemetry, and facility integration can smooth power spikes, reduce power overshoot by 40%, and mitigate operational risk and costs to enable predictable, and scalable power delivery for AI training clusters.

This year, at the OCP Global Summit, Microsoft is joining forces with industry partners to launch a dedicated power stabilization workgroup. Our goal is to foster open collaboration across hyperscalers and hardware partners, sharing our learnings from full-stack innovation and inviting the community to co-develop new methodologies that address the unique power challenges of AI training datacenters. By building on the insights from our recently published white paper, we aim to accelerate industry-wide adoption of resilient, scalable power delivery solutions for the next generation of AI infrastructure. Read more about our power stabilization efforts.

Cooling innovations for resiliency

As the power profile for AI infrastructure changes, we are also continuing to rearchitect our cooling infrastructure to support evolving needs around energy consumption, space optimization, and overall datacenter sustainability. Various cooling solutions must be implemented to support the scale of our expansion—as we seek to build new AI-scale datacenters, we are also utilizing Heat Exchanger Unit (HXU)-based liquid cooling to rapidly deploy new AI capacity within our existing air-cooled datacenter footprint.

Microsoft’s next generation HXU is an upcoming OCP contribution that enables liquid cooling for high-performance AI systems in air-cooled datacenters, supporting global scalability and rapid deployment. The modular HXU design delivers 2X the performance of current models and maintains >99.9% cooling service availability for AI workloads. No datacenter modifications are required, allowing seamless integration and expansion. Learn more about the next generation HXU here. 

Meanwhile, we’re continuing to innovate across multiple layers of the stack to address changes in power and heat dissipation—utilizing facility water cooling at datacenter-scale, circulating liquid in closed-loops from server to chiller; and exploring on-chip cooling innovations like microfluidics to efficiently remove heat directly from the silicon.

Unified networking solutions for growing infrastructure demands 

Scaling hundreds of thousands of GPUs to operate as a single, coherent system comes with significant challenges to create rack-scale interconnects that can deliver low-latency, high bandwidth fabrics that are both efficient and interoperable. As AI workloads grow exponentially and infrastructure demands intensify, we are exploring networking optimizations that can support these needs. To that end, we have developed solutions leveraging scale-up, scale-out, and Wide Area Network (WAN) solutions to enable large-scale distributed training.

We partner closely with standards bodies, like UEC (Ultra Ethernet Consortium) and UALink, focused on innovation in networking technologies for this critical element of AI systems. We are also driving forward adoption of Ethernet for scale-up networking across the ecosystem and are excited to see Ethernet for Scale-up Networking (ESUN) workstream launch under the OCP Networking Project. We look forward to promoting adoption of cutting-edge networking solutions and enabling multi-vendor Ecosystem based on open standards.

Security, sustainability, and quality: Fundamental pillars for resilient AI operations

Defense in depth: Trust at every layer

Our comprehensive approach to scaling AI systems responsibly includes embedding trust and security into every layer of our platform. This year, we are introducing new security contributions that build on our existing body of work in hardware security and introduce new protocols that are uniquely fit to support new scientific breakthroughs that have been accelerated with the introduction of AI:

  • Building on past years’ contributions and Microsoft’s collaboration with AMD, Google, and NVIDIA, we have further enhanced Caliptra, our open-source silicon root of trust The introduction of Caliptra 2.1 extends the hardware root-of-trust to a full security subsystem. Learn more about Caliptra 2.1 here.
  • We have also added Adams Bridge 2.0 to Caliptra to extend support for quantum-resilient cryptographic algorithms to the root-of-trust.
  • Finally, we are contributing OCP Layered Open-source Cryptographic Key Management (L.O.C.K)—a key management block for storage devices that secures media encryption keys in hardware. L.O.C.K was developed through collaboration between Google, Kioxia, Microsoft, Samsung, and Solidigm.

Advancing datacenter-scale sustainability 

Sustainability continues to be a major area of opportunity for industry collaboration and standardization through communities such as the Open Compute Project. Working collaboratively as an ecosystem of hyperscalers and hardware partners is one catalyst to address the need for sustainable datacenter infrastructure that can effectively scale as compute demands continue to evolve. This year, we are pleased to continue our collaborations as part of OCP’s Sustainability workgroup across areas such as carbon reporting, accounting, and circularity:

  • Announced at this year’s Global Summit, we are partnering with AWS, Google, and Meta to fund the Product Category Rule initiative under the OCP Sustainability workgroup, with the goal of standardizing carbon measurement methodology for devices and datacenter equipment.
  • Together with Google, Meta, OCP, Schneider Electric, and the iMasons Climate Accord, we are establishing the Embodied Carbon Disclosure Base Specification to establish a common framework for reporting the carbon impact of datacenter equipment.
  • Microsoft is advancing the adoption of waste heat reuse (WHR). In partnership with the NetZero Innovation Hub, NREL, and EU and US collaborators, Microsoft has published heat reuse reference designs and is developing an economic modeling tool which provide data center operators and waste heat off takers/consumers the cost it takes to develop the waste heat reuse infrastructure based on the conditions like the size and capacity of the WHR system, season, location, WHR mandates and subsidies in place. These region-specific solutions help operators convert excess heat into usable energy—meeting regulatory requirements and unlocking new capacity, especially in regions like Europe where heat reuse is becoming mandatory.
  • We have developed an open methodology for Life Cycle Assessment (LCA) at scale across large-scale IT hardware fleets to drive towards a “gold standard” in sustainable cloud infrastructure.

Rethinking node management: Fleet operational resiliency for the frontier era

As AI infrastructure scales at an unprecedented pace, Microsoft is investing in standardizing how diverse compute nodes are deployed, updated, monitored, and serviced across hyperscale datacenters. In collaboration with AMD, Arm, Google, Intel, Meta, and NVIDIA, we are driving a series of Open Compute Project (OCP) contributions focused on streamlining fleet operations, unifying firmware management, manageability interfaces and enhancing diagnostics, debug, and RAS (Reliability, Availability, and Serviceability) capabilities. This standardized approach to lifecycle management lays the foundation for consistent, scalable node operations during this period of rapid expansion. Read more about our approach to resilient fleet operations

Paving the way for frontier-scale AI computing 

As we enter a new era of frontier-scale AI development, Microsoft takes pride in leading the advancement of standards that will drive the future of globally deployable AI supercomputing. Our commitment is reflected in our active role in shaping the ecosystem that enables scalable, secure, and reliable AI infrastructure across the globe. We invite attendees of this year’s OCP Global Summit to connect with Microsoft at booth #B53 to discover our latest cloud hardware demonstrations. These demonstrations showcase our ongoing collaborations with partners throughout the OCP community, highlighting innovations that support the evolution of AI and cloud technologies.

Connect with Microsoft at the OCP Global Summit 2025 and beyond

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

微软 云基础设施 AI基础设施 开放计算项目 创新 标准制定 可持续性 安全性
相关文章