深度研究：企业信息整合与决策的新范式

What Is Deep Research?

Deep Research ≠ Deep Search.

You may have come across “Deep Search” features in tools like ChatGPT or Claude — designed to enhance retrieval and concise answers. While Deep Search focuses on retrieval performance and short-form answers, Deep Research is about understanding, reasoning, and synthesis — combining adaptive planning, retrieval, analysis, and context engineering to produce long-form, well-cited research outputs.

Think of it as moving from “find me something” → “explain and reason through this topic for me.”

Unlike simple search tasks, Deep Research requires patience, iteration, and strategic depth — similar to how a human analyst or consultant would work.

Traditional AI search systems are built to answer quick factual questions:

“What’s Salesforce’s revenue in 2024?”

But Deep Research asks:

“How is Salesforce’s revenue growth correlated with generative AI adoption in the enterprise sector, and what can we learn from competitors’ go-to-market shifts?”

This shift changes everything — the time expectation, width of exploration, and depth of reasoning are far greater.

Deep Research blends planning, reasoning, and writing — not just retrieval.

Key building blocks include:

Adaptive Planning:

Retrieval:

Analysis & Reasoning:

Context Engineering:

see this great piece

Long-Form Synthesis:

What Is Enterprise Deep Research — and Why Does It Matter?

In an enterprise setting, research doesn’t live in isolation. Information is scattered across:

Internal systems

External sources

Enterprise Deep Research bridges both worlds — combining internal knowledge and external insights to serve strategic business goals.

Example business applications include:

Sales

Service

Marketing

Leadership

Engineering

The outcome isn’t just answers — it’s insights that drive action.

Enterprise Deep Research reports are powerful tools that serve multiple purposes. They act like intelligent consultants—guiding every employee from individual contributors to senior executives in making better decisions.

These reports distill complex knowledge into accessible insights that can fuel real-time services, accelerate answer discovery, and uncover hidden patterns or root causes behind business challenges.

Unique Challenges in Enterprise Deep Research

Planner Intelligence

where to search for what

time-sensitive

contradictory

incomplete

Tool and Data Access

Are the right APIs and connectors in place?Can the system parse and retrieve data accurately from structured and unstructured sources?

Privacy and Access Control

Internal data isn’t open to everyone. Who is allowed to see what?How can the system respect permission hierarchies and data residency rules?

Citation and Evaluation

quality

An Example of Enterprise Deep Research

In one of our internal use cases for sales, we designed a modular, multi-graph architecture that mirrors how human researchers operate, dividing and conquering through specialized sub-systems that collaborate intelligently.

1. Planner Sub-Graph

The Planner is the brain of the system — decomposing high-level research goals into actionable subtasks.

Task Input:

Background Investigation:

Background Investigator Agent

Task Decomposition:

Subtask Execution:

Orchestrator Sub-Graph

2. Orchestrator Sub-Graph

The Orchestrator is the project manager of Deep Research — overseeing each subtask and synthesizing partial findings into a unified report.

Step 1:

Task Researcher/Executor Sub-graphs

Step 2+:

Reporting:

Reporter Agent

Human-in-the-loop:

3. Task Researcher / Executor Sub-Graph

Each executor agent focuses on a single task — whether that’s querying Salesforce data, summarizing a GitHub repo, or running a code experiment.
They act as the hands of the system, executing with precision, feeding results back to the orchestrator for synthesis.

4. Tools

Our framework employs specialized tools designed to navigate distinct data landscapes. These tools act as the hands of the system, executing targeted queries to build a comprehensive view. The primary tools include:

Web Search

Enterprise Knowledge Search

CRM Search

Conversation Search

The Planner Sub-Graph intelligently coordinates these tools, determining whether to deploy them in parallel or sequentially based on the research task.

Enterprise Deep Research Evaluation

Having established the primary capabilities of these systems, it becomes necessary to assess Deep Research models using evaluation methods distinct from those applied to search or Q&A models.

In Deep Research, accuracy is necessary, but not sufficient. A system can be 100% correct on individual data points yet fail to produce a useful strategic analysis. Therefore, our evaluation must shift.

The ultimate goal is to measure how effectively an agent can understand a complex goal, reason across disparate sources, and synthesize information into a coherent and actionable report.

Why Does Benchmarking Matter?

Without structured evaluation, Deep Research remains anecdotal to users, where “it feels smarter” or “this report looks good.”

This isn’t enough for the enterprise. Business enterprise contexts demand repeatable, explainable, and quantitative benchmarks. This rigor is especially vital when multiple agents (OpenAI, Gemini, Slackbot, etc) are producing business-critical analyses.

Benchmarking Deep Research systems is not just about leaderboard scores. It’s about building trust. For any enterprise user to act on an AI-generated insight, they need clear answers to three fundamental questions:

Traceability:

Transparency:

Consistency:

Benchmark Practices in Pipeline

Our recent research: SFR-DeepResearch [1], DeepTrace [2], HERB [3], and LiveResearchBench [4], provide complementary perspectives on how to measure these capabilities:

SFR-DeepResearch [1]

https://arxiv.org/abs/2509.06283

training

DeepTRACE [2]

https://arxiv.org/abs/2509.04499

audit framework

HERB [3]

https://arxiv.org/pdf/2506.23139

Deep Search

LiveResearchBench [4]

https://arxiv.org/pdf/2510.14240

Together, they motivate an enterprise evaluation that spans coverage/recall, citation accuracy & auditability, reasoning coherence, and readability, while keeping a clear boundary between Deep Search (finding the right evidence) and Deep Research (planning, reasoning, and long-form synthesis).

In our internal benchmarks, inspired by these frameworks, we propose on five core dimensions for enterprise evaluation:

Dimension	What It Measures	Why It Matters
Coverage	How broadly and deeply the agent explores relevant information across sources	Ensures completeness; critical for strategic and competitive analyses
Citation Accuracy & Thoroughness	Whether insights are verifiably grounded in credible internal or external evidence	Builds trust and accountability in enterprise decisions
Reasoning Coherence	Logical consistency and clarity of multi-step reasoning	Reflects analytical depth; shows the model “thinks” like an analyst
Readability & Structure	Clarity, organization, and fluency of the final report	Makes outputs usable by business stakeholders and leadership
Internal Info Richness	How effectively the agent leverages proprietary internal enterprise data	Measures the system’s ability to combine internal truth with external insight

Implementation and Results in Enterprise

To validate our evaluation framework, we benchmarked multiple systems on a diverse set of Enterprise Deep Research reports, which are long-form reports that combine Salesforce’s internal knowledge with external data.

Based on our enterprise use case of sales report generation and motivated by the above enterprise benchmark dimension, we select for dimensions that need evaluation for our pipeline: Report Readability & Structure, Internal Richness / Alignment, Citation Accuracy, and Coverage.

Our results below show that while readability is largely a solved challenge for most modern LLM agents, enterprise grounding and traceability clearly distinguish our Deep Research system. It not only generates fluent, well-structured reports but also anchors every insight to verifiable sources and Salesforce’s internal knowledge graph: a foundation for decision-grade trust in enterprise AI.

Evaluation Dimension	Gemini	OpenAI	SlackBot (Salesforce)	Salesforce AIR
Citation Accuracy	45.8%	39.5%	*	79.2%
Coverage	3.28 / 5	2.98 / 5	3.28 / 5	3.02 / 5
Internal Info Richness	42.3 %	30.2%	61.4%	73.7%
Readability & Structure	3.6 / 5	3.6/5	3.6/5	3.6/5

*inaccessible org citations
* In this table, the numbers from Salesforce AIR are using OpenAI GPT-4.1-mini

With trust at the core of Salesforce’s values, we took an extra step to ensure our benchmark results were not only quantitative but also human-verified. To validate consistency, we conducted a parallel evaluation with expert human annotators and achieved a Fleiss’ κ (kappa) score of 0.6, indicating strong alignment between model judgments and human evaluations.

Cases like these highlight the next frontier of Enterprise Deep Research, where secure organizational data can be accessed, reasoned over, and synthesized to power diverse business use cases.

They also showcase the need for enterprise-oriented benchmarks and frameworks that measure not just accuracy or fluency, but the real-world impact and reliability of AI-driven research.

Conclusion

Enterprise Deep Research represents a significant leap beyond traditional search, transforming how businesses harness information. By integrating adaptive planning, diverse retrieval, sophisticated analysis, and long-form synthesis, it moves from simply answering “what” to deeply explaining “why” and “how.”

The unique challenges of enterprise settings—spanning internal and external data, access control, and robust citation—necessitate a specialized approach. Our multi-graph architecture, designed to mimic human research, addresses these complexities by orchestrating specialized agents and tools. Ultimately, the evaluation of such systems must transcend mere accuracy to focus on trust, traceability, transparency, and consistency.

Our internal benchmarks, validated by human experts, demonstrate that our Deep Research system excels in providing accurate, well-cited, and contextually rich insights from both proprietary and public data, setting a new standard for decision-grade AI in the enterprise.

Citation

Please cite this work as:

“`

Chien-Sheng Wu, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Pranav Narayanan Venkit, “Towards Trustworthy Enterprise Deep Research”, Salesforce AI Research, Oct 2025.

“`

Or use the BibTeX citation:

``@article{wu2025sfrdeepresearch,  author = {Chien-Sheng Wu, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Pranav Narayanan Venkit},  title = {Towards Trustworthy Enterprise Deep Research},  journal = {Salesforce AI Research: Blog},  year = {2025},}```