Communications of the ACM - Artificial Intelligence 10月04日 16:15
生成式AI重塑软件开发:从速度到可信赖的转变
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

生成式AI显著提升了软件开发生命周期(SDLC)中的代码编写速度,能快速生成语法正确的代码片段。然而,AI带来的核心挑战已从速度和语法正确性转向语义的准确性。随着代码生产成本趋近于零,价值重心转移到系统的可信赖性上,即系统在真实复杂环境中满足服务等级目标(SLO)的能力。站点可靠性工程师(SRE)在这一转型中扮演关键角色,他们不再仅仅是AI工具的使用者,而是AI生成可靠性的独立验证者。AI在测试、迁移和可观测性方面展现出巨大潜力,通过智能测试生成、风险评估以及信号提取,帮助SRE更有效地管理和验证复杂系统的可靠性。最终,AI并非自动化了可靠性工程,而是重构了它,使工程师能够专注于模型 curation 和验证,确保系统不仅开发快速,而且根本上、可验证地、可持续地值得信赖。

🚀 **AI驱动代码速度提升,但核心挑战转向语义准确性**:生成式AI极大地提高了代码编写速度,能快速生成符合语法规范的代码。然而,AI生成的代码面临的首要问题已从速度和语法转向了语义上的正确性,即代码是否真正实现了预期的功能和逻辑。这使得软件开发生命周期(SDLC)的瓶颈转移到非功能性需求,如延迟、弹性、容错和安全等方面。

💡 **SRE成为AI时代可靠性验证的关键角色**:随着代码生产成本的降低,工程价值的重心从开发时间转向了对系统在复杂真实环境中满足服务等级目标(SLO)的信任。站点可靠性工程师(SRE)在此过程中扮演核心角色,他们不再仅仅是AI生成代码的使用者,而是作为与AI无关的验证者,确保AI生成内容的可靠性。

🧪 **AI赋能智能测试与风险管理**:AI能够生成超越单元测试的复杂测试用例,如生成式和结构感知模糊测试,通过模拟系统故障和异常情况来发现潜在问题。SRE可以利用AI建立风险模型,根据代码变更、复杂度等因素对测试优先级进行加权,从而将宝贵的计算资源集中用于验证最关键、风险最高的代码路径。

🔄 **AI助力安全迁移,揭示API抽象下的风险**:在平台迁移过程中,即使API接口保持稳定,底层实现的变化也可能引入可靠性风险。AI工具能够通过静态分析识别潜在的敏感模式,甚至通过工作负载合成,基于生产环境的真实数据创建高保真负载测试套件,从而在API契约满足但性能特征发生分歧时,提前发现“语义漂移”问题。

📊 **AI增强可观测性,加速根因诊断**:在复杂的微服务环境中,AI模型擅长从海量高基数指标、分布式追踪和结构化日志中提取信号,识别异常关联。AI可以初步识别出潜在的系统问题(如API网关错误与缓存层OOM事件的关联),为SRE提供调查线索,但最终的根因诊断仍需工程师结合架构和业务逻辑进行深度分析,并将反馈信息用于优化AI模型。

Generative AI has unquestionably increased raw coding velocity as part of the software development lifecycle (SDLC). Tools can churn out syntactically correct boilerplate, REST endpoints, and data transformation logic in a matter of seconds. This increased velocity puts undefined trust before reliability. We now face a more fundamental problem. It has nothing to do with the speed or syntactic correctness of the code produced by AI; it is about semantic correctness. The SDLC bottleneck now resides in non-functional requirements such as latency, resilience, fault tolerance, and security.3

As producing code approaches near-zero marginal cost, the economic and operational value shifts from a consideration of how much engineering time it takes to develop a new application feature to the notion of trust: confidence that a system will meet its SLO (service-level objective) in chaotic, real-world operation. The scarce resource in the SDLC is no longer engineering hours developing features, but rather the trust that a particular expert must stake to underwrite the reliability of a system. The site reliability engineer (SRE) is positioned at the heart of the AI revolution in the SDLC, not as a utility user of AI-generated outputs, but rather as a designer-agnostic validator of AI-generated reliability.2

Testing
AI-generated tests are likely to migrate well beyond unit tests. They can and will incorporate sophisticated processes, such as generative and structure-aware fuzzing.6 Given protocol definitions (such as Protobufs, or OpenAPI specs), AI can generate valid but unexpected inputs that would surface some deep serialization bugs undetectable with standard testing practices. AI could also orchestrate more complex forms of chaos engineering1 by simulating an entire range of system outages and operational failures that might not result in a complete destructiveness, but make it unreliable like mismatched server clocks, or nodes in distributed systems sending wrong data.

But we are not looking for test coverage; we are looking to mitigate risk. For an SRE, the basic job becomes how to guide the AI system. This means establishing a risk model, perhaps even assisted by AI, that provides weights for testing priority based on criterion around code churn, cyclomatic complexity, and centrality in the service dependency graph. A test putting the authentication service through its paces with a P99 latency SLO of P99(latency) <150ms is infinitely more valuable than a thousand tests on an asynchronous batch job. The AI provides the engine for simulation at massive scale, the SRE provides the objective function and determines where to spend expensive computational resources on validating the finest-grained, most critical tests in the system architecture.

Figure 1


Figure 1 represents an AI-driven testing pipeline, where an LLM generates an exhaustive set of tests that are then filtered and prioritized by a risk-weighting engine. This engine’s model can be trained on no shortage of data from production incident data, dependency graphs, and SLO definitions to focus testing on the code paths that matter.

Migration
While testing addresses reliability at the code and system level, reliability risks also emerge during platform shifts. Migration scenarios such as Redis-to-Valkey4,7 reveal how trust must be revalidated even when APIs remain stable.

The case example of a Redis-to-Valkey migration is a canonical example of reliability risk lurking below the abstraction of API compatibility. The command sets are basically identical, but differences in implementation can have first-order effects on production systems. For example, Valkey’s addition of slot-based multithreading7 fundamentally changes key performance aspects from Redis’ single-threaded-per-shard model, potentially introducing new race conditions or simply changing latency distributions for complex scripts.

AI tooling can provide an intelligent way of de-risking this process. An LLM can perform static analysis to identify not only deprecated commands, but also idioms and patterns that may be sensitive to a change in the threading model (such as, what are the chances that calling KEYS extensively in scripts will not be a problem?). In an even more important sense, AI can perform workload synthesis. If you ingest and analyze the output of the MONITOR command for a Redis instance that was being used in production, you could create a high-fidelity load-testing suite to simulate the production command mix, key distribution, and concurrency. This allows engineers to discover when the “semantic drift” occurs; that is, the API contracts are being satisfied, but performance characteristics are diverging—which can cause problems downstream of the Redis instance in the ecosystem before users are affected. Again, the SRE will take the histograms and HDR reports and validate that the P99.9 latencies and memory allocation patterns remain within the error budget of the system under a synthetic load.

Figure 2


Figure 2 shows a comparative dashboard and memory fragmentation metrics for Redis vs. Valkey, both running an AI-synthesized workload based on production traffic. It highlights not just API compatibility, but the latency and memory fragmentation differences that can redefine downstream system reliability.

Observability
Even with careful testing, systems can change in unexpected ways once they are in use. This is why observability, being able to get useful information from complex systems, is important for improving AI reliability.

In a complex microservices environment, observability is not a data problem; it is a signal extraction problem. AI models are particularly good at identifying correlations across high cardinality metrics, distributed traces,5 and structured logs. If an AI model notices a spike in 5xx errors on the API gateway and correlates it with increased cache misses, as well as a sequence of pod OOM Killed events in the caching tier, it might be possible to counsel an engineer to look further up the service dependency chain. The depicted correlation is merely the beginning and not the end. More than just time and effort are needed for an SRE to provide adequate reasoning chains. In this case, the SRE investigates and finds the recent deployment changed the efficient, paginated, HSCAN operation and replaced it with a full keyspace HGETALL inside a function. This change had the side effect of memory utilization on the cache server, which would no longer have limits, eventually leading to ungraceful termination of the cache container with OOM Killed events. This, in turn, leads to the API’s cache usage creating API errors.

Obviously, the AI can identify the symptoms of the error, architecturally and operationally, across the distributed system; however, the underlying pathology can best be diagnosed by an engineer. Moving forward, although it is the engineer’s responsibility to provide a legitimate contextual rationale, as feedback is ingested within the models over time.

Figure 3

Figure 3 can be used to present an incident workflow. The “Before” state shows an incident and a storm of disparate alerts that all point to the same OOM Killed event(s). The “After” state shows the identical OOM Killed disorders occurring in the same service, however, with the AI correlation engine grouping these alerts together and providing the SRE with a hypothesis as to a root cause. The engineer was able to map this hypothesis onto a particular code-level finding (HGETALL misused). The processed finding is part of the new data that the engineer provides to the model.

Curating the Model
The use of AI systems does not mean that the discipline of reliability engineering has become automated; it has become refactored. The traditional aspects of reliability, creating, and analyzing become curation and validation, or more so, instead of the engineer just creating a system, they also must build and refine the AI models that help operate that service in production. The engineer becomes the curator of a model and is responsible for applying the architectural context, business logic, and first principles of distributed systems to turn the generic logic models made by AI into a legitimate, high-fidelity embodiment of their production environment.

In a world where the principles of resilience engineering are more pertinent than ever, the engineer must remember that AI’s capacity to provide scaled, new capacity for leverage does not abrogate the responsibility of the human engineer to apply their deep competencies, so the systems developed are not just fast to develop, but also fundamentally, verifiably, and sustainably trusted at their core.

References

1. Basiri, A., Behnam, N., de Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J., and Rosenthal, C. (2016). Chaos engineering. IEEE Software, 33(3), 35-41.

2. Beyer, B., Jones, C., Petoff, J., and Murphy, N.R. (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media.

3. Forsgren, N., Humble, J., and Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. IT Revolution Press.

4. Redis Documentation. (2024). Redis Commands and Architecture, https://redis.io/docs/

5. Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P. et al. (2010). Dapper, a large-scale distributed systems tracing infrastructure. Technical Report, Google.

6. Sutton, M., Greene, A., and Amini, P. (2007). Fuzzing: Brute Force Vulnerability Discovery. Addison-Wesley.

7. Valkey Project. (2024). Valkey: An Open Source, High-Performance Data Store, https://valkey.io/

Rahul Chandel is an engineering leader with 15+ years of experience building high-performance systems across fintech, blockchain, and cloud platforms at companies like Coinbase, Twilio, and Citrix. He specializes in scalable, resilient architectures and has presented at AWS re:Invent. See https://www.linkedin.com/in/chandelrahul/.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Generative AI SDLC Site Reliability Engineering SRE AI in Testing Software Reliability Observability AI-driven Development Semantic Correctness Trust in AI
相关文章