Salesforce Blog AI Research 09月28日 23:41
LLM在CRM任务中的挑战与解决方案
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

大型语言模型(LLM)在处理CRM任务时面临挑战,主要由于缺乏特定领域的训练数据和业务背景理解不足。这导致查询语法错误、数据模型混淆、对未明确任务的处理模糊以及不熟悉业务流程等问题。Salesforce AI Research通过一系列研究和实验,探索了为LLM代理提供领域特定工具、技术工作流描述或函数实现等不同方法,以提升其在CRM任务中的表现。研究发现,指导代理如何执行任务(而非仅仅告知任务目标)能显著提高性能,即使没有复杂的函数调用能力。实验表明,原始SOQL/SOSL访问的准确率约为31%,而提供完整的函数实现或详细的技术工作流描述可将准确率提高到74%。即使没有技术工作流,非技术工作流和人类编写的函数实现细节也具有一定的有效性。未来的研究方向将探索代理如何更有效地从人类学习或通过学习过去的错误来自我改进。

📊 CRM任务对当前LLM代理来说困难重重,主要原因是训练数据覆盖不足和LLM对业务背景不熟悉,导致查询语法错误、数据模型混淆、对未明确任务处理模糊以及不熟悉业务流程等问题。

🔧 为LLM代理提供领域特定的知识(通过提示或工具)能极大地帮助其提升性能,例如提供任务特定的函数(TSF)、重构的子程序或完整的工作流描述,这比仅依赖原始SOQL/SOSL查询有效得多。

🗺️ 提供任务的工作流描述非常有益,即使是非技术性的描述,也能让LLM代理学习并泛化,从而在没有复杂函数调用能力的情况下显著提高准确率,最高可达54%(非技术)和72%(技术)。

💡 指导LLM代理如何完成任务(而非仅仅告知任务目标)通常比直接提供任务目标更有效,即使不提供源代码,提供完整的函数实现也能将准确率提高到48%,远超仅提供函数头部的效果。

⚖️ 在现实应用中,需要在代理的自主性和设置工作(如提示工程、编码、集成)所需的人的努力之间找到平衡点,以确保LLM代理在实际的企业环境中既实用、可扩展又值得信赖。

Main Takeaways

Background

LLM agents are seeing more and more applications in real life, from being personal assistants to helping software engineers write code and even working side by side with scientists on their research. With Agentforce, Salesforce’s trusted platform, we pioneer LLM agents for CRM applications like helping customers with their return and refund requests, coming up with the best pitch for sales representatives tailored towards their clients, and generating insights about employee productivity and roadblocks for managers.

While models such as GPT, Claude, and Gemini show impressive general abilities, CRM tasks are a different story. Their specialized nature and limited data coverage make it hard for LLMs to perform reliably. Furthermore, many of them are “noob mistakes” due to lack of sufficient understanding of the business context and specialized domain knowledge. (Check our blog about Why Generic LLM Agents Fall Short in Enterprise Environments for more details.)

To solve this problem and bridge the gap between high general capability and low specialized capability of LLMs, as well as human-in-the-loop efforts and trade-offs, we conducted a series of investigation and identified various tips and tricks to better unleash their performance on various realistic CRM tasks.

Agentic Simulation Environment with CRMArena-Pro

Our benchmark of choice is the newly released CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions, developed by researchers at Salesforce AI Research, consisting of 22 tasks and 2140 task instances spanning diverse categories such as workflow execution, policy compliance, and information retrieval. The main finding is the suboptimal performance of even state of the art LLMs with tried and true agentic implementation frameworks such as ReAct. For example, GPT-4o is only able to solve less than 30% of all tasks, while its reasoning model counterpart, o1, still fails at just over 50% of all tasks. The best performing model out of 9 flagship models from various providers, Gemini-2.5-pro, struggles to achieve a completion rate of 60%. (Check our blog on how to Evaluate LLM Agents for Enterprise Applications with CRMArena-Pro.)

After a thorough analysis of agents’ execution, we identified several potential reasons for the lack of performance.

    Query Syntax Limitations: Manipulating data on the Salesforce platform requires writing queries in SOQL and SOSL languages. While they are similar to SQL, there are certain key differences. As a result, the agent sometimes produces queries of illegal syntax. While the agent can correct some mistakes after observing the error message, for others, the attempted correction results in further mistakes.Data Model/Schema Confusion: As a hallmark future for CRMArena-Pro is the intricate and inter-connected schemas, representing real-life business entities such as account manager, pricebook, order, lead and voice call transcript, agents often confuse related concepts, such as an order item vs. a pricebook entry, or a lead vs. an opportunity. As a result, they sometimes look up information in the wrong table, resulting in failed executions or wrong results.Ambiguity in Underspecified Tasks: There are certain underspecified details in the task, such as whether a case that has been transferred from one customer service representative to another one should count for either one of them (e.g., when calculating the average handling time), both, or neither. Agents often directly assume a particular answer, failing to realize that there is ambiguity to be clarified.Unfamiliarity with Business Workflow: Finally, even if the agent is clear on the data schema and task specification, it may still fail on the task due to its unfamiliarity with the business workflow. For example, while SOQL has fuzzy search ability, most search tasks are better implemented with SOSL. Because the agent is generally unfamiliar with the fine-grained differences, it sometimes fails to use the correct tools, leading to excessively long outputs and very inefficient executions.

In the next few sections, we describe various ideas that we explored on augmenting the agent with additional information and tools. We consider the skill group of “Structured Data Querying & Numerical Computation” in CRMArena-Pro, as they embody the agentic procedural execution the most. Below, we summarize our main findings in the table, which are explained in detail in the subsequent sections.

SOQL/SOSL Only (Original CRMArena-Pro Setup)+ Function Header Only+ Full Function Implementation+ Ground Truth Workflow
Task-Specific Functions (TSF)TSF + Refactored SubroutinesTask-Specific Functions (TSF)TSF + Refactored SubroutinesTechnical DescriptionNon-Technical Description
5 Tasks0.33~ 91%Did Not Evaluate
3 Unseen Tasks0.310.310.320.480.340.720.54
Human EffortNoneHighHighHighHighMediumLow

Beyond Raw SOQL and SOSL

In the CRMArena-Pro benchmark, by default LLM agents are restricted to using only In the CRMArena-Pro benchmark, by default LLM agents are restricted to using only two functions: SOQL and SOSL. Despite their versatility in theory, the agents must handle tasks completely autonomously, starting from scratch and relying solely on these two query languages. Human setup time is intentionally kept close to zero, simulating a hands-off, fully self-reasoning agent.

By comparison, in the real world, teams can provide LLM agents with additional custom actions tailored to the tasks they care about. These can include domain-specific tools, scripts, or workflows. Teams may even ask LLMs to generate new actions on the fly — though today, this often still requires human validation or expert-level coding to make them reliable.
With platforms like Agentforce, builders can accelerate this process by leveraging default action libraries and accessing existing metadata from their org. However, there’s an important tradeoff:

Finding the right balance between autonomy and setup effort is key to making LLM agents practical, scalable, and trustworthy in enterprise settings. To investigate best ways to improve agent performance, we carefully study the characteristics of five tasks (handle time, transfer count, top issue identification, best region identification and conversion rate comprehension), while leaving three others (monthly trend analysis, sales amount understanding and sales cycle understanding) as challenge tasks to test the agent generalization. As the first result column shows, with raw SOQL/SOSL access (i.e.,the original CRMArena-Pro setup), the agent achieves a performance of 33% on former five tasks and 31% on the latter three tasks.

Higher-Level Functions Can Help, but With a Caveat

Our first exploration is to provide task-specific functions for agents to call. Writing these functions are time consuming and requires expert knowledge on programming. We expect that in most situations, these functions are provided for only a few tasks. At the same time, however, we would like the model to understand the high level goals via these functions. Therefore, our main evaluation is on a set of tasks that are not directly covered by these functions.

The most natural approach of providing functions to agents is by exhibiting the function headers, with an example below. This function finds the agent with the minimum or maximum average handle time of their assigned cases in a period of time.

def find_agent_with_handle_time(start_date, end_date, min_cases, find_min=True):    """    Finds the agent with the specified handle time criteria.    Parameters:        start_date (str): Start date in 'YYYY-MM-DD' format.        end_date (str): End date in 'YYYY-MM-DD' format.        min_cases (int): Minimum number of cases the agent must have managed. All agents who handle (min_cases - 1) or fewer non-transferred cases will be excluded.        find_min (bool): If True, find the agent with the minimum handle time. If False, find the maximum.    Returns:        str: The Id of the agent.    """

We write one function for each of the five tasks that we studied, and the agent using them can achieve a very high performance of 91%.

Things are quite different, however, on the three unseen tasks. When we provide only the function headers of these task-specific functions (TSF), the agent achieves a performance of 31%. This is the same performance as the agent with only raw SOQL/SOSL access in the original CRMArena-Pro setup, suggesting that directly exposing the function headers of these highly specialized functions are not helpful.

Given the monolithic nature of these functions, we hypothesize that providing more atomic subroutines may be beneficial. Thus, we ask GPT-4o (the LLM underlying our agent) to generate reusable subroutines for these high-level functions (with analogous header documentation). Then, we provide the headers to both the high-level functions and the subroutines to the agent to use. An example of such a subroutine is provided below.

def query_accounts_by_ids(account_ids):    """    Fetches account details for a list of account IDs.    Parameters:        account_ids (list): A list of account IDs.    Returns:        dict: A dictionary mapping account IDs to account details.    """

Interestingly, we observe very slight increase in performance, at 32%, when giving both types of function headers. After further analyses, we found that while the agent sometimes correctly use these subroutines, the implementations of these subroutines (which are generated by GPT-4o) may be problematic, resulting in incorrect result or program crashes. Furthermore, since the source code is not exposed to the agent, the agent has extremely limited insights into the reasons of these errors and methods of correction. Thus, we conclude that providing subroutines via header documentation only does not improve the agent performance.

Motivated by the findings above, we next hypothesize that showing the full source code implementation could be beneficial, since the source code tells the agent not only what the functions do, but how they work. Note that the agent is still not allowed to execute arbitrary code — only the provided (high-level or subroutine) functions and raw SOQL/SOSL.

This turns out to be very helpful: 48% accuracy when the agent is provided with the full implementation of the high-level TSF functions. By contrast, due to bugs introduced in the refactoring process, the performance of the agent, when provided with the (buggy) refactored function implementations and taking them as the source of truth, regresses back to 34%, though still slightly higher than the two setups with function header only. The significantly higher performance suggests the utility of providing correct, detailed and actionable guidance to agents, especially outside of their “natural habitats”, i.e., in unfamiliar domains.

Workflow Description Is Very Helpful

Can we further improve the performance of the unseen tasks? A natural idea, motivated by how new human employees are trained for a job, is to let the agent observe the workflow of a particular, representative task and ask it to extrapolate and generalize. We experiment with two types of workflows. The first one is a technical workflow, where we fully describe the procedure required for a task. Below is the beginning of an example workflow.

Suppose that we want to answer the following query: Today's date: 2021-05-09. Determine the agent with the quickest average time to close opportunities in the last 6 weeks.We use the following workflow to answer this query:Today's date is 2021-05-09, so six weeks ago is 2021-03-28. When we talk about an the time it takes to close or sign an opportunity, we are interested in all opportunities whose corresponding contract has a company signed date falling within the interval of interest. Therefore, we first get all contracts with a company signed date within this time interval. We want to retrieve the company signed date and the contract ID (which will be linked to the opportunity). So we use execute the following SoQL query:SELECT Id, CompanySignedDate FROM Contract WHERE CompanySignedDate != NULL AND CompanySignedDate >= 2021-03-28 AND CompanySignedDate < 2021-05-09This query results in the following records:{'Id': '800Wt00000DDfifIAD', 'CompanySignedDate': '2021-04-27'}{'Id': '800Wt00000DE1T0IAL', 'CompanySignedDate': '2021-04-15'}{'Id': '800Wt00000DE42gIAD', 'CompanySignedDate': '2021-04-29'}Then, for each contract ID, we need to find the corresponding opportunity with this ContractId__c. We need to retrieve the OwnerId (which corresponds to the agent), and the created date of the opportunity. We use the following SoQL query:(additional text omitted)

Writing this one requires a human user to first study the task, write the SOQL/SOSL queries, and analyze the results. Naturally, the writer needs to have working knowledge of the database query language. Nonetheless, compared to providing the full task-specific functions, this is still much easier, as the human only needs to perform a demonstration for a concrete example, rather than laboriously coming up with a fully general function that covers all possible cases.

By comparison, the second workflow type that we give is non-technical. For the same task, the excerpt below gives the complete workflow description in this non-technical manner.

Suppose that we want to answer the following query: Today's date: 2021-05-09. Determine the agent with the quickest average time to close opportunities in the last 6 weeks.We use the following workflow to answer this query:Today's date is 2021-05-09, so six weeks ago is 2021-03-28. When we talk about an the time it takes to close or sign an opportunity, we are interested in all opportunities whose corresponding contract has a company signed date falling within the interval of interest. Therefore, we first get all contracts with a company signed date within this time interval. We want to retrieve the company signed date and the contract ID (which will be linked to the opportunity).Then, for each contract ID, we need to find the corresponding opportunity with this ContractId__c. We need to retrieve the OwnerId (which corresponds to the agent), and the created date of the opportunity.By combining the two results, we can calculate the average closing time for each agent as the difference between the contract's company signed date and the opportunity's created date. In the end, we return the agent with the shortest average closing time.

As we can see, there is no SOQL/SOSL query and no presentation of the specific query result. Instead, only the high-level procedure is given. This description should be very easy to write for anyone with a working knowledge of the system, even if they are not familiar with the actual database query language.

With these two workflow formats, we see that the agent achieves a 72% accuracy when given technical workflows and 54% when giving non-technical workflows, suggesting strong agent capability to learn and generalize from only one instance.

Conclusion

Large Language Models (LLMs) struggle with specialized CRM tasks due to limited domain training data and insufficient business context, leading to errors in query syntax, schema confusions, ambiguity handling, and unfamiliarity with workflows. At Salesforce AI Research, we strive to make LLM agents better at CRM tasks and to do so, we explore various ways to supplement LLM agents with domain-specific tools, technical workflow descriptions, or function implementations. We find that telling agents how to perform tasks — not just what to do — makes a significant difference, even without sophisticated function calling abilities. While raw SOQL/SOSL access yields low task accuracy (~31%), providing full function implementations or detailed technical workflows can raise accuracy to as much as 74% with technical workflow descriptions. Even without technical workflows, their non-technical counterparts are still effective, and, to a lesser extent, implementation details of human-written functions. For future work, we will explore additional ways for agents to learn passively from humans, or with reasonable amount of human efforts, as well as making them better at improving themselves by learning from their past mistakes.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM CRM 大型语言模型 客户关系管理 领域特定知识 工作流描述 性能提升 Agentforce Salesforce AI Research CRMArena-Pro
相关文章