Braintrust Blog 10月21日 10:58
Portola赋能非技术专家,提升AI伴侣Tolan对话质量
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Portola开发了AI伴侣应用Tolan,旨在为用户提供真实、非浪漫的情感连接。为确保Tolan的对话质量、情感智能和真实行为,Portola构建了一个独特的工作流程,让行为研究员、科幻作家和游戏设计师等非技术专家能够直接审查日志、整理数据集并改进提示词,无需工程团队介入。该流程通过Braintrust平台实现,允许专家创建针对特定问题的细粒度数据集,并在Playgrounds中进行手动评估和迭代,最终实现4倍的迭代速度提升,有效解决了AI伴侣在记忆系统、对话模式和情感回应等方面的挑战,为构建值得信赖的AI关系提供了新范式。

🤖 **AI伴侣Tolan的情感连接核心**:Tolan是一款AI伴侣应用,其核心在于为用户提供真实、非浪漫的情感连接,区别于传统的聊天机器人。它通过自然的语音对话和复杂的记忆系统,模拟人类朋友般的互动,让用户感受到被理解和重视。其设计理念是探索“人类如何与AI建立健康的关系”,强调情感共鸣和个性化记忆的呈现方式,而非简单的信息检索。

🧠 **构建信任的关键要素**:Tolan的成功建立在三个关键因素之上:1. **真实的记忆系统**:AI不仅要记住信息,更要以符合人类朋友的方式呈现,并在恰当的时机自然地融入对话,营造特殊感。2. **自然的情感回应**:AI对用户情绪和沟通风格的反应必须流畅且富有情感共鸣,避免僵硬的算法痕迹,体现在词汇选择、对话节奏和情感表达上。3. **避免“AI痕迹”**:通过持续监控和调整,减少用户能轻易识别的AI模式,如过多的选择性提问或不恰当的网络流行语,以维持真实连接的幻觉。

🛠️ **非技术专家赋能的工作流程**:Portola通过Braintrust平台构建了一个创新的工作流程,使非技术专家(如行为研究员Lily)能够直接参与AI的迭代优化。专家们每日花费数小时审查对话日志,识别问题模式(如不当的治疗性提问、过多的选择题、滥用网络流行语),并创建针对性的数据集。这些数据集随后在Playgrounds中用于手动评估和迭代提示词,无需工程团队协调,极大地提高了问题解决的速度和效率。

🚀 **迭代速度与质量的双重飞跃**:该流程使Portola的迭代速度提升了4倍,非技术专家能够独立完成从问题识别、数据集创建、解决方案测试到生产部署的全过程。这种端到端的自主性不仅解决了AI在主观、情感复杂领域(如陪伴关系、心理健康支持)的质量提升难题,还使得团队能够快速响应模型切换带来的回归问题,并系统性地优化记忆系统、对话流畅度、品牌声音一致性以及敏感话题的处理能力。

20 October 2025Ornella Altunyan

Portola builds Tolan, an AI companion app that serves as an alien best friend for users seeking authentic, non-romantic AI relationships. Unlike typical chatbots or productivity assistants, Tolan focuses on creating genuine emotional connections through natural voice conversations and complex memory systems. As Portola's team built an AI that users could truly trust, they realized that the nuances of conversation quality, emotional intelligence, and authentic behavior couldn't be captured by automated evals alone.

In this case study, we'll explore how Portola structured their workflow to empower nontechnical subject matter experts, including a behavioral researcher, science fiction writer, and game designer, to spend hours daily reviewing logs, curating datasets, and shipping prompt improvements directly to production without engineering bottlenecks.

"How will humanity build healthy relationships with AI?" asks Quinten Farmer, Portola's CEO. "That's the question Tolan explores."

Creating an AI companion that feels authentically human demands deep domain expertise in psychology, storytelling, and conversation design. The team identified three critical factors for building user trust in their AI companion:

The memory system needs to work in a way that feels authentic to how a friend would remember things. Perfect recall matters less than the subtle nuance of what gets remembered, when it surfaces naturally, and how it integrates into conversations. "When you're chatting with your Tolan and they remember some detail about your life and bring it into the conversation, it feels really special," Quinten explains.

How Tolan reflects and responds to user emotions and communication styles must feel natural, not algorithmic. This involves vocabulary choices, conversation pacing, and emotional resonance that can't be reduced to simple metrics.

Certain question patterns and engagement behaviors immediately signal "this is AI" to users, breaking the illusion of authentic connection. Patterns like asking too many "or questions" ("Would you rather have waffles or pancakes?") or using excessive Gen Z slang need constant monitoring and adjustment.

The technical complexity of Portola's system compounds these challenges. Their prompting pipeline integrates memory retrieval, dynamically generated user context, real-time voice processing, and content that users share (like photos) into a cohesive conversation flow. There's a subjective, contextual quality that makes conversations feel authentic.

Portola built a workflow that enables subject matter experts to identify issues, curate datasets, test solutions, and deploy changes-- all without engineering handoffs. Here's how it works:

Lily Doyle, their behavioral researcher, spends about an hour each day reading through chat logs in Braintrust, looking for patterns in conversation quality. "I look for recurring patterns in form and function. That means both how the messages are sent and what the Tolan is actually saying. I also watch for any signs of user frustration," Lily explains.

When Lily identifies a recurring problem through log reviews, user feedback, or focus group sessions, she creates a dataset in Braintrust tagged with the specific issue. Each dataset becomes a collection of real conversation examples that demonstrate a particular problem.

Recent examples of issues she's tracked in Braintrust include:

    somatic-therapy: Tolan engaging in unwanted somatic therapy questions like "How does that feel in your body?"or-questions: Excessive binary choice questions that feel unnatural in conversationgen-z-lingo: Overuse of trendy slang that doesn't match Tolan's character

Rather than maintaining a single "golden dataset," Portola creates problem-specific datasets in Braintrust ranging from 10 to 200 examples. "It feels useless to come up with a golden dataset," Lily explains. "We're on a different model, we've changed the prompt eight times. Things change so fast."

Braintrust's dataset management enables several technical advantages for Portola:

    Focused iteration: Each dataset targets a specific behavioral pattern, making it easier to measure improvementsFresh data: Datasets reflect current product state rather than becoming stale snapshotsRapid response: New issues can be addressed immediately without updating comprehensive test suitesContext preservation: Each dataset maintains the full conversation context through Braintrust's trace storage

Once a dataset is curated, Lily moves to playgrounds for side-by-side prompt comparison. She manually reviews outputs from the current prompt versus iterations, assessing conversation quality through her domain expertise.

"A lot of what we're working on is really squishy stuff," Lily explains. "Things like asking the best question to the user for the context they gave you—that's not super eval-able. For things that tend to be squishier, I prefer to do it manually and use my own judgment."

This manual evaluation approach stands in contrast to typical ML workflows, but it's intentional. For Portola, conversation quality is fundamentally subjective and context-dependent. An automated scorer might flag a response as too long, but miss that the length created emotional resonance in that specific context.

Playgrounds serve as Lily's primary workspace, where she:

    Loads datasets directly from curated logsRuns comparison tests between prompt versionsReviews outputs holistically, considering tone, appropriateness, and emotional intelligenceTakes notes on specific failures or edge casesIterates with AI to refine prompts

The final piece of Portola's workflow is their prompts-as-code infrastructure, which enables subject matter experts to deploy changes directly to production once they're satisfied with playground results.

"Our science fiction writer can sit down, see something he doesn't like, test against it very quickly, and deploy his change to production," Quinten explains. "That's pretty remarkable."

This end-to-end autonomy transformed Portola's iteration velocity. Nontechnical subject matter experts own the full cycle from problem identification to production deployment.

Before implementing this workflow, prompt changes required coordination between subject matter experts and engineers. Now, domain experts identify issues, create datasets, test solutions, and ship changes, resulting in 4x the number of weekly prompt iterations.

The team has systematically addressed edge cases in:

    Memory system behavior and authentic recall patternsNatural conversation flow and question patternsBrand voice consistency across different contextsAppropriate handling of sensitive topics like mental health

The workflow also enabled Portola to quickly handle model transitions, rapidly identifying and fixing regressions when switching to new models.

Portola's approach demonstrates that building trustworthy AI systems for subjective, emotionally-complex domains requires empowering nontechnical domain experts to drive quality improvements. Their workflow offers several lessons:

"Scores are useful, but a lot of what we're working on is really squishy stuff," Lily notes. For conversation quality, emotional intelligence, and brand voice, human judgment from domain experts is just as important as automated metrics.

Instead of maintaining comprehensive test suites that grow stale, create focused datasets for specific issues as they arise. This maintains agility while building institutional knowledge about edge cases and failure modes.

When building AI for emotionally complex domains like companion relationships, mental health support, or therapeutic contexts, manual review from domain experts is essential. The nuances that make these interactions work can't be captured by metrics alone.

Budget time for subject matter experts to spend hours reviewing real usage, and use tools like human review in Braintrust to make that time productive.

Portola's workflow demonstrates that the teams best positioned to improve AI quality might be the nontechnical domain experts who understand the nuances of the user experience you're trying to create. By building infrastructure that empowers a behavioral researcher, science fiction writer, and game designer to identify issues, curate datasets, iterate on prompts, and deploy changes directly to production, Portola achieved a 4x improvement in iteration velocity while systematically improving conversation quality.

If you're building AI for subjective, emotionally complex domains where automated evals can't capture what makes interactions truly work, take a page from Portola's playbook. Invest in observability, build problem-specific datasets, and empower your domain experts to own the quality improvement cycle end-to-end.

Learn more about Tolan and Braintrust.

Want to empower subject matter experts on your team to improve AI quality? Book a demo to learn how Braintrust can support manual review workflows alongside traditional evals.

Thank you to Quinten and Lily for sharing these insights!

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI伴侣 Tolan Portola 对话质量 情感智能 非技术专家 工作流程 Braintrust AI信任 AI关系 AI companion Tolan Portola Conversation Quality Emotional Intelligence Non-technical Experts Workflow Braintrust AI Trust AI Relationships
相关文章