Val Town Blog 10月02日
构建并优化你的代码助手:LLM生产环境的实践经验
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文分享了在生产环境中运行大型语言模型(LLMs)的经验,特别关注Val Town的AI助手Townie。文章探讨了如何通过使用不同LLM提供商来优化成本和性能,例如为代码生成选择Sonnet,为命名选择GPT-4o。同时,文章也介绍了处理LLM响应持久性问题的策略,包括用户提示、消息缓存和数据库存储。此外,还讨论了LLM的可靠性挑战,如宕机和响应不一致,以及通过改进错误处理和使用Braintrust进行提示调优来应对。最后,文章介绍了在UI实现上采用tRPC的Batch Stream Link以平衡请求响应和实时流式传输。

💡 **成本与性能优化策略**:通过为不同任务选择最合适的LLM提供商来有效控制成本和提升性能。例如,使用成本较高的Sonnet进行精确的代码生成,而使用更经济的GPT-4o来处理命名等任务,从而在满足需求的同时实现成本效益。

💬 **处理LLM响应持久性**:针对LLM缺乏内置持久性的问题,文章介绍了多种方法。包括通过用户提示引导模型,利用消息缓存来减少重复解析成本,以及将对话历史存储在数据库中以便快速检索,从而提升用户体验和系统效率。

⚠️ **应对LLM的不可靠性**:文章强调了LLM在可用性和响应质量上的挑战。通过实施强大的错误处理机制,例如使用neverthrow库,以及利用Braintrust等评估工具对自定义提示进行系统性调优,来提高系统的稳定性和生成内容的质量。

🚀 **UI实现与技术选型**:在用户界面方面,Val Town采用了tRPC的Batch Stream Link,结合了传统的POST请求和流式响应。这种方式既保留了tRPC的类型安全优势,又允许服务器端使用异步生成器来处理数据流,实现了高效且响应迅速的用户交互体验。

Building a code-writing robot and keeping it happy

on

I recently gave this talk at a lovely event put on by our friends atJamsocket, where we discussed different experiencesrunning LLMs in production. With Townie,we’ve been dealing with the magic and eccentricities of this new kindof engineering.

Title slide with illustration of robot

For those who couldn’t make it or are interested in following along, hereare some slides and speaker notes.

Val Town is mostly a platform for running little bits of JavaScript that we call “vals”, and it makes programming accessible to a lot of new people because vals can be super small, simple, and they don’t require any configuration. But we’ve been using LLMs to make it accessible to even more people who want to create things with natural language.

Here’s the feature I’m talking about today - Townie. It’s the Val Town bot. If you want to be old-fashioned, you can write the code yourself with your hands and fingers, but Townie will let you write it indirectly with English. It’s similar in broad strokes to Anthropic Artifacts or Vercel v0, but one of the biggest differences is that the artifacts have full-stack backends, can be shared and forked, and so on.

We’re running a pretty plain-vanilla LLM setup! The meat and potatoes of Val Town are the challenges of running a lot of user code, at scale, with good security guarantees, and building community and collaboration tools. We aren’t training our own models or running our own GPU clusters.

So for LLMs, we’re riding on the coattails of Anthropic and OpenAI and running none of our own LLM infrahttps://blog.val.town/_astro/slide-5.QRIgWo5d_1dLU66.webpe chaotic" loading="lazy" decoding="async" src="/_astro/slide-5.QRIgWo5d_1dLU66.webp">

But despite the simplicity, LLMs are pretty chaotic, and this has taken longer to build than we expected.

LLMs are unlike everything else on the backend: they can easily blow up your server costs, they have plenty of downtime, they’re really hard to tune, and they can be super, super slow. Here’s a quick run-through of how we’ve been dhttps://blog.val.town/_astro/slide-7.Bng0vZ2J_Z1ge9wk.webpg alt="Good LLM models are expensive and eat money" loading="lazy" decoding="async" src="/_astro/slide-7.Bng0vZ2J_Z1ge9wk.webp">

The good models are expensive. LLMs eat money. We’re practicing financial disciplihttps://blog.val.townhttps://blog.val.town/_astro/slide-8.DYUuJ--S_Z1JTTuy.webp How do we do it?

The easiest cost and performance alpha is just using different providers for different tasks. For example, when someone creates a val, we want to generate the TypeScript code for that val, which is pretty nitpicky work: it needs to work with Deno’s TypeScript environment and it should reliably know whether it’s running on the frontend or the backend, and so on. For that we use Sonnet, which is pretty expensive.

But then once we have the generated code, we want to generate a name for it, given both the prompt and response to that prompt: for that we use Openhttps://blog.val.townhttps://blog.val.town/_astro/slide-9.De7VQRuq_ZinY4t.webp and good enough.

And to use different models, it’s really nice to use an abstraction library like @vercel/ai, which is what we’ve been using. I know that there are several hundred options, so that isn’t a fully-rhttps://blog.val.townhttps://blog.val.town/_astro/slide-11.Dk3XCfG2_Z1QbsVN.webp it’s worked fine for us.

So the other problem is that this is a chat interface, and none of the LLM providers provide persistence: you have to put the whole chat history into context for every single new message. And the responses that we get from Claude can be pretty long - these are complete programs that we’re getting back, lots of tokens. So we’vehttps://blog.val.townhttps://blog.val.town/_astro/slide-12.BtIoD74W_25OjLc.webp approaches to this problem.

Our first line of defense against this is just reminding users that Townie gets dumber when you give it more input. This is pretty similar to the UIs in other tools that say the same thttps://blog.val.townhttps://blog.val.town/_astro/slide-14.B57boPhP_k8fPO.webptenth reply or so, Townie will start losing the thread of what you wanted in the first place.

Caching old messages somehow lets Anthropic amortize the cost of parsing messages and turning them into embeddinghttps://blog.val.townhttps://blog.val.town/_astro/slide-15.MydtCPdQ_ZLyToy.webprepresentation they have. It costs more money to create the cached stuff initially, but less to use it later.

Storing all of this in the database was also a pretty big win - so I was able to just query this from our messages table right into an Observable notebook.

This works! You can see some of the effects here, wherehttps://blog.val.townhttps://blog.val.town/_astro/slide-16.DqwFpdYr_Z1DC2eW.webpare people having super long conversations with Townie, but instead of hitting the tokens statistic, they’re being read from the cache.

https://blog.val.townhttps://blog.val.town/_astro/slide-17.DnwfFpuz_doHaA.webpnd enhance style, colored by which thread they’re part of: you can see some spots where it’s just the same exact thread that’s benefiting from the caching.

A funny thing that doesn’t work is trying to truncate old messages. Like, you’re sending code back and forth, right - so if the LLM writes code and fixes some bug you have two copies of the code in the context window. So we tried replacing old code replhttps://blog.val.townhttps://blog.val.town/_astro/slide-18.Ywk3LOJs_Z1BlikW.webptruncated for brevity,” but the LLM took this very literally. We found that if we truncated code in the conversation history, the LLM decides to truncate code in its responses, too.

Another potential solution that we’ve struggled to implement successfully is diffs. Claude can convincingly provide a good unified diff some percentage of the time, but it is very slophttps://blog.val.townhttps://blog.val.town/_astro/slide-20.Dm71Wtjb_1na3HC.webpthe wrong line numbers or missing context, ahttps://blog.val.townhttps://blog.val.town/_astro/slide-21.Baova9tT_Z1G4yWM.webpelf merge implementations. We’ve implemented our own fuzzy merge solution, but even then, diffs haven’t been reliable enough for regular usage.

Unfortunately, we’ve also hit reliability issues with both Anthropic and OpenAI – both center around two nines of uptime recently. They’re doing magical things, it’s understandable that there are some speed bumps along the way! But there’s not too much we can do abouthttps://blog.val.townhttps://blog.val.town/_astro/slide-26.CZj_mNkY_Z2U1Y8.webpelves, so we have just decided to really double down on implementing good error handling in our application and have recently adopted neverthrow to improve how we use errors in TypeScript.

We’ve also spent a lot of time tuning our custom prompt, which felt like shooting in the dark until we started using a system for evals. Braintrust has been great for this.

Here’s our progress with Braintrust over time. We https://blog.val.townhttps://blog.val.town/_astro/slide-28.gj7oRaum_dznIJ.webpat we want to forbid from our outputs: things like using require() instead of import, using the alert() method, or omitting code from results. Some evals rely on simple string matching, while ohttps://blog.val.townhttps://blog.val.town/_astro/slide-34.yCiX_02N_Z5izhK.webpsults from the “smart” LLM.

Braintrust gives us notifications on each Pull Request, letting us know whether a given system prompt change is a regression or improvement in quality.

The final challenge of Townie is simply implementing the UI, which is anything but simple. We’ve found a nice middle ground between an overly traditional request/response flow and a fully real-time, streaming, WebSocket-based infrastructure. We’re using tRPC’s Batch Stream Link, which pairs traditional POST requests with streaming responses. It gives us the nice type-safety of tRPC and lets us structure our responses using Async Generators on the server-side.


That’s it! Thanks for following along.

Edit this page

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 生产环境 Val Town Townie 成本优化 性能优化 错误处理 提示工程 AI助手 LLM production cost optimization performance optimization error handling prompt engineering AI assistant
相关文章