Anthropic公开事故回顾：三大基础设施问题影响Claude表现

https://simonwillison.net/atom/everything 09月30日 19:10

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

Anthropic近期因三大基础设施问题导致Claude模型表现不稳定，公开解释了故障原因。问题涉及AWS Trainium、NVIDIA GPU和Google TPU的多平台部署，因各平台特性不同导致优化复杂。公司承认隐私政策限制了对用户交互数据的访问，阻碍了问题调查。尽管承认了问题，Anthropic强调从未因需求、时间或服务器负载降低模型质量，用户反馈的问题完全由基础设施故障引起。

🔍 Anthropic近期遭遇三大基础设施问题，导致Claude模型在8月至9月初表现不稳定，影响了用户交互质量。

🌐 问题源于Claude在AWS Trainium、NVIDIA GPU和Google TPU等多平台部署，各平台特性差异导致优化复杂，故障集中爆发。

🚫 Anthropic强调从未因需求、时间或服务器负载降低模型质量，用户反馈的问题完全由基础设施故障引起，而非策略性降级。

🔐 公司承认隐私政策限制了对用户交互数据的访问，阻碍了问题调查——Claude常能从孤立错误中恢复，导致问题难以复现。

💡 通过公开技术细节，Anthropic承认了多平台部署的复杂性，并展示了TPU特定代码示例，但用户仍对其可靠性表示担忧。

Anthropic: A postmortem of three recent issues. Anthropic had a very bad month in terms of model reliability:

Between August and early September, three infrastructure bugs intermittently degraded Claude's response quality. We've now resolved these issues and want to explain what happened. [...]
To state it plainly: We never reduce model quality due to demand, time of day, or server load. The problems our users reported were due to infrastructure bugs alone. [...]
We don't typically share this level of technical detail about our infrastructure, but the scope and complexity of these issues justified a more comprehensive explanation.

I'm really glad Anthropic are publishing this in so much detail. Their reputation for serving their models reliably has taken a notable hit.

I hadn't appreciated the additional complexity caused by their mixture of different serving platforms:

We deploy Claude across multiple hardware platforms, namely AWS Trainium, NVIDIA GPUs, and Google TPUs. [...] Each hardware platform has different characteristics and requires specific optimizations.

It sounds like the problems came down to three separate bugs which unfortunately came along very close to each other.

Anthropic also note that their privacy practices made investigating the issues particularly difficult:

The evaluations we ran simply didn't capture the degradation users were reporting, in part because Claude often recovers well from isolated mistakes. Our own privacy practices also created challenges in investigating reports. Our internal privacy and security controls limit how and when engineers can access user interactions with Claude, in particular when those interactions are not reported to us as feedback. This protects user privacy but prevents engineers from examining the problematic interactions needed to identify or reproduce bugs.

The code examples they provide to illustrate a TPU-specific bug show that they use Python and JAX as part of their serving layer.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签