ByteByteGo 10月15日 00:36
Salesforce AI加速测试失败诊断
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Salesforce每天运行约600万测试,产生大量失败。为解决传统手动诊断效率低、耗时久的问题,其工程团队开发了AI驱动的测试失败(TF)诊断代理。该系统利用FAISS语义搜索、上下文嵌入和LLM推理,在30秒内提供具体建议,显著提升开发效率。通过异步流程和Cursor工具加速开发,实现测试失败解决时间缩短30%,并保持高产品品质。

🔍 Salesforce每日处理约600万测试,产生15万 monthly 失败和每日2.7万代码变更,传统手动诊断耗时约7天,导致积压和开发效率低下。

🤖 AI测试失败(TF)诊断代理结合FAISS语义搜索、上下文嵌入和LLM推理,在30秒内提供具体解决方案,大幅缩短诊断时间。

⚙️ 系统采用异步、解耦架构,独立运行不阻塞CI/CD流程,并通过Cursor工具将开发周期从数月缩短至4-6周,实现高效扩展。

📈 实施后测试失败解决时间提升30%,提升开发人员生产力,同时保持高产品品质,解决多产品集成时的潜在问题。

🔗 Cursor工具深度集成代码库,提供实时上下文代码参考,帮助团队快速复用已有模式、应对扩展挑战,专注于核心逻辑开发。

Your free ticket to P99 CONF is waiting — 60+ engineering talks on all things performance (Sponsored)

P99 CONF is the technical conference for anyone who obsesses over high-performance, low-latency applications. Leading engineers from today’s most impressive gamechangers will be sharing 60+ talks on topics like Rust, Go, Zig, distributed data systems, Kubernetes, and AI/ML.

Sign up to get 30-day access to the complete O’Reilly library & learning platform, free books, and a chance to win 1 of 500 free swag packs!

Join 30K of your peers for an unprecedented opportunity to learn from experts like Chip Huyen (author of the O’Reilly AI Engineering book), Alexey Milovidov (Clickhouse creator/CTO) & Andy Pavlo (CMU professor) and more – for free, from anywhere.

GET YOUR FREE TICKET


Disclaimer: The details in this post have been derived from the details shared online by the Salesforce Engineering Team. All credit for the technical details goes to the Salesforce Engineering Team. The links to the original articles and sources are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.

Modern software systems run on millions of automated tests every single day. At Salesforce, this testing ecosystem operates at an enormous scale.

The company runs about 6 million tests daily, covering more than 78 billion possible test combinations. Every month, these tests generate around 150,000 failures, and there are more than 27,000 code changelists submitted each day.

Before automation, dealing with these failures was a slow and tiring process. Developers had to spend hours going through error logs, changelists, and internal tracking systems like GUS to figure out what went wrong. Integration failures were especially difficult because any of the 30,000 engineers across the company could be responsible for a given issue. This made it hard to find the root cause and fix the problem quickly.

The result was a growing backlog of unresolved failures, increasing developer frustration, and long delays. On average, it took about seven days to resolve a single test failure. The Salesforce engineering team recognized that this was not sustainable. They needed a faster, more reliable way to handle failures and keep the development process moving smoothly. This challenge set the stage for building an AI-powered solution to remove these bottlenecks.

In this article, we will look at how Salesforce developed such a system and the key takeaways from their journey.


Ship code that breaks less: Sentry AI Code Review (Sponsored)

Catch issues before they merge. Sentry’s AI Code Review inspects pull requests using real error and performance signals from your codebase. It surfaces high-impact bugs, explains root causes, and generates targeted unit tests in separate branches. Currently supports GitHub and GitHub Enterprise. Free while in open beta.

Learn More About AI Code Review


The Goal of the System

Salesforce has a dedicated Platform Quality Engineering team that plays a critical role in the software development process.

This team acts as the final line of defense before any code is released to customers. While individual scrum teams focus on testing their own products in isolation, the Platform Quality Engineering team goes a step further. They run integration tests across multiple products to make sure everything works well together as one unified system.

This focus on integration is important because customers often use several Salesforce products in combination. A product might work perfectly on its own, but when used together with others, unexpected problems can appear. Customers sometimes describe this as the products feeling like they come from different companies. The Platform Quality Engineering team exists to catch these integration bugs early, before they ever reach customers. Fixing bugs after deployment is expensive and time-consuming, so identifying them early is a major priority for Salesforce.

Beyond this core testing role, the team is always looking for ways to automate engineering workflows and make developers more productive. One of the most time-consuming parts of their work was triaging large numbers of test failures.

To address this, the Salesforce engineering team set a clear goal:

To meet these goals, Salesforce built the Test Failure (TF) Triage Agent, an AI-powered system that provides concrete recommendations within seconds of a failure occurring.

The TF Triage Agent is designed to transform what used to be a slow, manual triage process into a fast and reliable automated workflow. This system fits directly into the team’s mission of maintaining high product quality at scale while keeping the development process efficient.

AI and Automation Architecture

To build an AI-powered system that could process millions of test results quickly and accurately, the Salesforce engineering team designed a specialized AI and automation architecture.

This architecture had to work with massive amounts of noisy, unstructured error data while keeping response times under 30 seconds. Achieving this required a combination of intelligent data processing, search techniques, and careful system design.

Here are the main technical components of the architecture:

1 - Semantic Search with FAISS

Salesforce used FAISS (Facebook AI Similarity Search) to create a semantic search index of historical test failures and their resolutions. FAISS is a library that allows very fast similarity searches between data represented as vectors.

Every time a new test failure occurs, the system performs a vector similarity search against this index to find past failures that look similar. This makes it possible to match a new error with previously fixed problems and suggest likely solutions. Using FAISS replaced older methods that relied on SQL databases, which were too slow for real-time lookups at Salesforce’s scale.

2 - Contextual Embeddings and Parsing Pipelines

Error logs and code snippets are often messy and inconsistent. To make them useful for semantic search, the Salesforce engineering team built parsing pipelines that clean and structure the data before it is processed.

Once the data is cleaned, the system generates contextual embeddings, which are mathematical representations that capture the meaning of code snippets and error messages. By embedding both error stacks and historical fixes, the system can compare them in a meaningful way and identify the most probable solutions for a new failure.

3 - Asynchronous and Decoupled Pipelines

The team designed the pipelines to work asynchronously and to be decoupled from the main CI/CD workflows. This means that the AI triage process runs in parallel, without slowing down code integration or testing activities.

This design choice is critical for speed. Instead of making developers wait for the AI system to finish, the pipelines process failures independently and return recommendations quickly, keeping overall latency low.

4 - Hybrid of LLM Reasoning and Semantic Search

The Salesforce engineering team combined semantic search with large language model (LLM) reasoning to get the best of both worlds.

The semantic search step finds the most relevant historical examples, while the LLM then interprets and refines these results to produce clear and specific guidance. This approach ensures that developers receive precise recommendations instead of vague or generic answers. It also helps avoid speculative outputs that can reduce developer trust in AI tools.

Development Approach with Cursor

To build the TF Triage Agent quickly and effectively, the Salesforce engineering team decided to use Cursor, an AI-powered pair programming and code-retrieval tool. This decision played a major role in speeding up development and reducing unnecessary engineering effort.

Normally, building a system like this would have taken several months of manual work. By using Cursor, the Salesforce engineering team was able to complete the project in just four to six weeks. Cursor’s strength lies in its deep integration with the codebase and its ability to provide real-time, contextually relevant code references while engineers are working.

During development, when the team needed to add a new similarity engine to the TF Triage Agent, Cursor made it easy to find existing code patterns that were already implemented elsewhere in the system. This meant engineers did not have to reinvent the wheel for every new component. Instead, they could quickly understand and reuse proven approaches.

Cursor was also valuable when the team faced scaling challenges. Instead of relying on trial and error, engineers could explore multiple architectural options suggested by Cursor and make informed decisions quickly. This ability to iterate fast helped the team build a more reliable and scalable system in a shorter time.

Another key benefit was that Cursor allowed Salesforce engineers to focus their time and energy on the core failure triage logic, which was the most complex and valuable part of the project. Tasks like searching through legacy code or writing repetitive boilerplate were handled much more efficiently with Cursor’s assistance.

Conclusion

By building and deploying the TF Triage Agent, the Salesforce engineering team was able to transform a slow, manual process into a fast and reliable automated workflow.

Some of the key lessons from this project are as follows:

Together, these decisions led to a 30 percent faster test failure resolution time and a significant boost in developer productivity.

Reference:


Help us Make ByteByteGo Newsletter Better

TL:DR: Take this 2-minute survey so I can learn more about who you are, what you do, and how I can improve ByteByteGo

Take the ByteByteGo Survey


SPONSOR US

Get your product in front of more than 1,000,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Salesforce AI测试 自动化诊断 FAISS LLM推理 Cursor
相关文章