Salesforce Blog AI Research 10月25日 07:04
深度研究:企业信息整合与决策的新范式
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

深度研究(Deep Research)超越了传统的“深度搜索”,它是一种结合了自适应规划、多源检索、深入分析和上下文工程的技术,旨在理解、推理并综合信息,生成长篇幅、有引用的研究报告。与快速问答不同,深度研究需要耐心、迭代和战略深度,能够处理复杂问题,如分析企业营收增长与生成式AI采用的关系。在企业环境中,深度研究整合了内部系统(如Salesforce、Slack)和外部信息,为销售、服务、市场和领导层等业务场景提供驱动行动的洞察,而非仅仅是答案。文章还探讨了企业深度研究面临的独特挑战,包括规划智能、数据访问、隐私安全以及引用和评估等,并提出了一套包含覆盖度、引用准确性、推理连贯性、可读性及内部信息丰富度等维度的评估框架,以确保AI生成研究的信任度和可靠性。

💡 **深度研究的本质与区分**:深度研究(Deep Research)并非简单的“深度搜索”,而是指一种先进的信息处理范式,它通过自适应规划、多源信息检索、严谨的分析推理以及精细的上下文工程,实现对复杂主题的深刻理解和长篇幅、有引用的研究报告生成。与侧重于快速检索和简洁答案的“深度搜索”不同,深度研究更侧重于“解释和推理”,需要模拟人类分析师的耐心、迭代和战略思考过程。

🚀 **企业深度研究的核心价值与应用**:在企业环境中,深度研究能够融合散布于内部系统(如Salesforce、Slack、Google Docs)和外部来源(如LinkedIn、新闻、报告)的海量信息,为销售、服务、市场营销和高层决策等关键业务领域提供战略洞察。它旨在生成能够驱动行动的见解,帮助员工从个人贡献者到高管做出更明智的决策,加速信息发现,并揭示业务挑战的潜在根源。

🔒 **企业深度研究面临的挑战与评估框架**:企业深度研究面临多重挑战,包括规划智能(如何有效分解任务、平衡内外数据源)、工具与数据访问(API集成、结构化与非结构化数据解析)、隐私与访问控制(尊重权限层级和数据驻留规则),以及引用与评估(确保信息可追溯、评估研究质量)。为此,文章提出了一个包含覆盖度、引用准确性与详尽度、推理连贯性、可读性与结构,以及内部信息丰富度在内的五维企业评估框架,以量化和保障AI生成研究的质量与信任度。

What Is Deep Research?

Deep Research ≠ Deep Search.

You may have come across “Deep Search” features in tools like ChatGPT or Claude — designed to enhance retrieval and concise answers. While Deep Search focuses on retrieval performance and short-form answers, Deep Research is about understanding, reasoning, and synthesis — combining adaptive planning, retrieval, analysis, and context engineering to produce long-form, well-cited research outputs.

Think of it as moving from “find me something” → “explain and reason through this topic for me.”

Unlike simple search tasks, Deep Research requires patience, iteration, and strategic depth — similar to how a human analyst or consultant would work.

Traditional AI search systems are built to answer quick factual questions:

“What’s Salesforce’s revenue in 2024?”

But Deep Research asks:

“How is Salesforce’s revenue growth correlated with generative AI adoption in the enterprise sector, and what can we learn from competitors’ go-to-market shifts?”

This shift changes everything — the time expectation, width of exploration, and depth of reasoning are far greater.

Deep Research blends planning, reasoning, and writing — not just retrieval.

Key building blocks include:

What Is Enterprise Deep Research — and Why Does It Matter?

In an enterprise setting, research doesn’t live in isolation. Information is scattered across:

Enterprise Deep Research bridges both worlds — combining internal knowledge and external insights to serve strategic business goals.

Example business applications include:

The outcome isn’t just answers — it’s insights that drive action. 

Enterprise Deep Research reports are powerful tools that serve multiple purposes. They act like intelligent consultants—guiding every employee from individual contributors to senior executives in making better decisions. 

These reports distill complex knowledge into accessible insights that can fuel real-time services, accelerate answer discovery, and uncover hidden patterns or root causes behind business challenges.

Unique Challenges in Enterprise Deep Research

    Planner Intelligence
      Does the system know where to search for what?Can it balance internal vs. external data sources?How does it manage time-sensitive, contradictory, or incomplete information?How does it coordinate across tools like Salesforce, Slack, Google Workspace, and LinkedIn?
    Tool and Data Access
      Are the right APIs and connectors in place?Can the system parse and retrieve data accurately from structured and unstructured sources?
    Privacy and Access Control
      Internal data isn’t open to everyone. Who is allowed to see what?How can the system respect permission hierarchies and data residency rules?
    Citation and Evaluation
      How do we ensure every insight is traceable to its source?How do we evaluate the quality of research when human expertise is uneven or fragmented?How do we detect duplicated or conflicting information across systems?

An Example of Enterprise Deep Research

In one of our internal use cases for sales, we designed a modular, multi-graph architecture that mirrors how human researchers operate, dividing and conquering through specialized sub-systems that collaborate intelligently.

1. Planner Sub-Graph

The Planner is the brain of the system — decomposing high-level research goals into actionable subtasks.


2. Orchestrator Sub-Graph

The Orchestrator is the project manager of Deep Research — overseeing each subtask and synthesizing partial findings into a unified report.


3. Task Researcher / Executor Sub-Graph

Each executor agent focuses on a single task — whether that’s querying Salesforce data, summarizing a GitHub repo, or running a code experiment.
They act as the hands of the system, executing with precision, feeding results back to the orchestrator for synthesis.

4. Tools

Our framework employs specialized tools designed to navigate distinct data landscapes. These tools act as the hands of the system, executing targeted queries to build a comprehensive view. The primary tools include:

The Planner Sub-Graph intelligently coordinates these tools, determining whether to deploy them in parallel or sequentially based on the research task.

Enterprise Deep Research Evaluation

Having established the primary capabilities of these systems, it becomes necessary to assess Deep Research models using evaluation methods distinct from those applied to search or Q&A models.

In Deep Research, accuracy is necessary, but not sufficient. A system can be 100% correct on individual data points yet fail to produce a useful strategic analysis. Therefore, our evaluation must shift. 

The ultimate goal is to measure how effectively an agent can understand a complex goal, reason across disparate sources, and synthesize information into a coherent and actionable report.

Why Does Benchmarking Matter?

Without structured evaluation, Deep Research remains anecdotal to users, where “it feels smarter” or “this report looks good.” 

This isn’t enough for the enterprise. Business enterprise contexts demand repeatable, explainable, and quantitative benchmarks. This rigor is especially vital when multiple agents (OpenAI, Gemini, Slackbot, etc) are producing business-critical analyses.

Benchmarking Deep Research systems is not just about leaderboard scores. It’s about building trust. For any enterprise user to act on an AI-generated insight, they need clear answers to three fundamental questions:

Benchmark Practices in Pipeline

Our recent research: SFR-DeepResearch [1], DeepTrace [2], HERB [3], and LiveResearchBench [4], provide complementary perspectives on how to measure these capabilities:

Together, they motivate an enterprise evaluation that spans coverage/recall, citation accuracy & auditability, reasoning coherence, and readability, while keeping a clear boundary between Deep Search (finding the right evidence) and Deep Research (planning, reasoning, and long-form synthesis).

In our internal benchmarks, inspired by these frameworks, we propose on five core dimensions for enterprise evaluation:

DimensionWhat It MeasuresWhy It Matters
CoverageHow broadly and deeply the agent explores relevant information across sourcesEnsures completeness; critical for strategic and competitive analyses
Citation Accuracy & ThoroughnessWhether insights are verifiably grounded in credible internal or external evidenceBuilds trust and accountability in enterprise decisions
Reasoning CoherenceLogical consistency and clarity of multi-step reasoningReflects analytical depth; shows the model “thinks” like an analyst
Readability & StructureClarity, organization, and fluency of the final reportMakes outputs usable by business stakeholders and leadership
Internal Info RichnessHow effectively the agent leverages proprietary internal enterprise dataMeasures the system’s ability to combine internal truth with external insight


Implementation and Results in Enterprise

To validate our evaluation framework, we benchmarked multiple systems on a diverse set of Enterprise Deep Research reports, which are long-form reports that combine Salesforce’s internal knowledge with external data. 

Based on our enterprise use case of sales report generation and motivated by the above enterprise benchmark dimension, we select for dimensions that need evaluation for our pipeline: Report Readability & Structure, Internal Richness / Alignment, Citation Accuracy, and Coverage.

Our results below show that while readability is largely a solved challenge for most modern LLM agents, enterprise grounding and traceability clearly distinguish our Deep Research system. It not only generates fluent, well-structured reports but also anchors every insight to verifiable sources and Salesforce’s internal knowledge graph: a foundation for decision-grade trust in enterprise AI. 

Evaluation
Dimension
GeminiOpenAISlackBot (Salesforce)Salesforce AIR
Citation Accuracy45.8%39.5%*79.2%
Coverage
3.28 / 52.98 / 53.28 / 53.02 / 5
Internal Info
Richness
42.3 %30.2%61.4%73.7%
Readability & Structure3.6 / 53.6/53.6/53.6/5

*inaccessible org citations
* In this table, the numbers from Salesforce AIR are using OpenAI GPT-4.1-mini

With trust at the core of Salesforce’s values, we took an extra step to ensure our benchmark results were not only quantitative but also human-verified. To validate consistency, we conducted a parallel evaluation with expert human annotators and achieved a Fleiss’ κ (kappa) score of 0.6, indicating strong alignment between model judgments and human evaluations. 

Cases like these highlight the next frontier of Enterprise Deep Research, where secure organizational data can be accessed, reasoned over, and synthesized to power diverse business use cases. 

They also showcase the need for enterprise-oriented benchmarks and frameworks that measure not just accuracy or fluency, but the real-world impact and reliability of AI-driven research.

Conclusion

Enterprise Deep Research represents a significant leap beyond traditional search, transforming how businesses harness information. By integrating adaptive planning, diverse retrieval, sophisticated analysis, and long-form synthesis, it moves from simply answering “what” to deeply explaining “why” and “how.” 

The unique challenges of enterprise settings—spanning internal and external data, access control, and robust citation—necessitate a specialized approach. Our multi-graph architecture, designed to mimic human research, addresses these complexities by orchestrating specialized agents and tools. Ultimately, the evaluation of such systems must transcend mere accuracy to focus on trust, traceability, transparency, and consistency. 

Our internal benchmarks, validated by human experts, demonstrate that our Deep Research system excels in providing accurate, well-cited, and contextually rich insights from both proprietary and public data, setting a new standard for decision-grade AI in the enterprise.

Citation

Please cite this work as:

“`

Chien-Sheng Wu, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Pranav Narayanan Venkit, “Towards Trustworthy Enterprise Deep Research”, Salesforce AI Research, Oct 2025.

“`

Or use the BibTeX citation:

``@article{wu2025sfrdeepresearch,  author = {Chien-Sheng Wu, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Pranav Narayanan Venkit},  title = {Towards Trustworthy Enterprise Deep Research},  journal = {Salesforce AI Research: Blog},  year = {2025},}```

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

深度研究 企业AI 信息整合 决策支持 AI研究 Deep Research Enterprise AI Information Synthesis Decision Support AI Research
相关文章