Blog about software - ordep.dev 10月02日
谷歌云配置错误导致全球大宕机
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

2025年6月12日,谷歌云因配额系统中的空白字段配置错误引发全球性大宕机。该错误触发服务控制器的空指针崩溃循环,导致合法流量受阻,波及内部和客户服务。例如,依赖谷歌云的Cloudflare Workers KV存储后端中断,进一步影响众多依赖该服务的网站和API。此次事件持续超过两小时,凸显配置错误在分布式系统中的严重后果。文章强调配置需像代码一样严格管理,实施版本控制、自动化验证、安全部署和持续监控。建议措施包括使用特性标志、分阶段部署、限制影响范围和指数退避。该事件暴露了缺乏特性标志、验证、分阶段部署等安全措施的隐患,也警示团队应假设故障会发生,设计系统以限制损害。

😟 配置错误引发全球宕机:谷歌云因配额系统中的空白字段配置错误,触发服务控制器的空指针崩溃循环,导致全球性大宕机,持续超过两小时。

🛠️ 配置管理需像代码一样严格:分布式系统中,配置错误若未被注意或难以控制,可能引发严重后果。必须实施版本控制、自动化验证、安全部署和持续监控。

🚩 安全部署措施建议:使用特性标志允许即时禁用新行为;在安全的生产模拟环境中验证配置变更;强制执行强模式、类型检查和验证;分阶段部署以尽早发现问题;永不使用无效状态覆盖有效状态;限制影响范围,未充分测试前不全局复制配置;使用指数退避、断路器和安全重试避免级联故障。

🔄 事件暴露的隐患:缺乏特性标志、验证、分阶段部署等安全措施是此次事件的主要原因。例如,即使推送了错误配置,也应能轻易禁用;空白字段配置应通过模式检查拦截;分阶段部署能及早发现问题;安全重试机制能减轻系统负担;拒绝应用无效配置可避免全局宕机;持续监控能提前发现崩溃循环或重试激增。

🤝 跨团队协作与责任:此类事件也凸显了明确的所有权、彻底变更审查和有效跨团队沟通的重要性,尤其在共享基础设施管理中。

Lessons from the June 2025 Google Cloud Incident

On June 12, 2025, a seemingly minor misconfiguration in Google Cloud’s quota system had a catastrophic global impact, resulting in a massive outage that disrupted a significant portion of the internet. The root cause? A quota policy that contained blank fields.

What Happened

    A quota policy was pushed with blank fields. This triggered a null pointer crash loop inside Google’s Service Control, the system that enforces API quotas and access checks. The invalid policy was replicated globally instead of being caught or contained. There was no feature flag and no staged rollout. Remediation retries overwhelmed the system without exponential backoff, creating a thundering herd that made recovery more difficult.

A minor oversight, exacerbated by scale and lacking safeguards, led to a global service disruption.

The Ripple Effect

Because so many Google Cloud services rely on Service Control for basic API authentication and quota checks, the crash loop quickly blocked legitimate traffic. Applications that depended on Google’s APIs started returning sudden 503 errors, affecting both internal workloads and customer-facing services.

This had a domino effect. For example, Cloudflare’s Workers KV storage backend, which is built on Google Cloud, became unavailable. This disrupted not only Cloudflare’s services but also the sites and APIs that depend on Cloudflare, spreading the impact far beyond Google’s platform.

In total, this single misconfiguration degraded critical services worldwide for over two hours, with lingering effects as systems slowly recovered.

Treat Configuration as Code and Expect Failure

In distributed systems, even small mistakes can have major consequences if they go unnoticed or are hard to contain.

Configuration must be handled with the same rigor as application code. It requires version control, automated validation, safe deployment procedures, robust monitoring, and the ability to roll back immediately.

Teams should assume failures will happen and design systems to limit the damage. Every change, no matter how small, should be easy to test, quick to disable, and impossible to replicate globally if broken.

This checklist reflects core SRE and reliability engineering principles: layered validation, minimizing the blast radius, and rapid rollback are essential for building resilient systems.

Checklist for Safe Configuration Rollouts

Before rolling out any configuration change, consider the following safeguards:

    Use feature flags to wrap new behaviors, allowing them to be disabled instantly. Validate configuration changes in a safe, production-like staging environment before implementing them in the live environment. Prevent invalid states by enforcing strong schemas, type checks, and validation when changes are ingested. Roll out gradually with canaries or phased deployments to catch problems early. Never overwrite a valid state with an invalid state. Adopt a “fail open” approach so that if a new configuration fails validation, the system preserves the last known good state. Limit the blast radius and do not replicate the configuration globally until it has been thoroughly tested and proven safe. Use exponential backoff, circuit breakers, and safe retries to avoid cascading failures.

How This Could Have Been Prevented

Many of the checklist items above, if implemented, would have prevented this outage. In particular:

    Feature flag: Even if pushed, the bad policy should have been easy to disable immediately. Validation: The policy with blank fields should have failed schema checks before deployment. Staged rollout: A gradual rollout would have caught the issue before it was replicated everywhere. Safe retries: Backoff and circuit breakers would have reduced the load on an already failing system. Fail open: The system should have refused to apply the invalid policy, preserving the last known good state and preventing a global outage. Monitoring: Continuous monitoring and anomaly alerts could have detected the crash loop or surges in retries before they cascaded into a global outage.

Incidents like this also highlight the importance of clear ownership, thorough change reviews, and effective cross-team communication, particularly for shared infrastructure.

Universal Takeaways

    Small configuration mistakes can cause global failures. Operational basics, such as validation, feature flags, and gradual rollouts, are essential when operating at scale. Systems should be designed to fail safely and locally, not catastrophically and globally. Resilience is not magic. It comes from sound engineering practices and careful daily operations.

Use this incident as a prompt to audit your config pipelines. How easily could a bad change slip through today?

Final Thought

Distributed systems can quickly turn tiny mistakes into massive outages. Stay humble, validate everything, roll out carefully, and always design for safe failure.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

谷歌云 配置错误 全球宕机 服务控制 系统可靠性 特性标志 分阶段部署 指数退避 分布式系统
相关文章