译:「Agent Design Is Still Hard」

译:「Agent Design Is Still Hard」

智能体设计依然很难

 
I felt like it might be a good time to write about some new things I've learned. Most of this is going to be about building agents, with a little bit about using agentic coding tools.
我觉得是时候写下我学到的一些新东西了。本文大部分内容将围绕构建智能体(Agents)展开,同时也稍微涉及一些关于使用智能编码工具的体会。
TL;DR: Building agents is still messy. SDK abstractions break once you hit real tool use. Caching works better when you manage it yourself, but differs between models. Reinforcement ends up doing more heavy lifting than expected, and failures need strict isolation to avoid derailing the loop. Shared state via a file-system-like layer is an important building block. Output tooling is surprisingly tricky, and model choice still depends on the task.
太长不看版(TL;DR): 构建智能体依然是一团乱麻。一旦涉及到真实的工具使用,SDK 的抽象层往往会崩塌。自己管理缓存的效果更好,但不同模型之间存在差异。强化(Reinforcement)在循环中所起的作用比预期的更重要,而且必须严格隔离故障,以避免整个循环脱轨。通过类似文件系统的层来共享状态是一个重要的构建模块。输出工具(Output tooling)的处理出人意料地棘手,而模型的选择仍然取决于具体的任务。

Which Agent SDK To Target?

该选择哪个智能体 SDK?

When you build your own agent, you have the choice of targeting an underlying SDK like the OpenAI SDK or the Anthropic SDK, or you can go with a higher level abstraction such as the Vercel AI SDK or Pydantic. The choice we made a while back was to adopt the Vercel AI SDK but only the provider abstractions, and to basically drive the agent loop ourselves. At this point we would not make that choice again. There is absolutely nothing wrong with the Vercel AI SDK, but when you are trying to build an agent, two things happen that we originally didn't anticipate:
构建自己的智能体时,你可以选择针对底层 SDK(如 OpenAI SDK 或 Anthropic SDK)进行开发,也可以选择更高层级的抽象(如 Vercel AI SDK 或 Pydantic)。我们在不久前做出的选择是采用 Vercel AI SDK,但仅使用其提供商抽象层,而智能体循环(Agent loop)基本上由我们自己驱动。如果是现在,我们不会再做这个选择了。Vercel AI SDK 本身完全没有问题,但在构建智能体时,会发生两件我们最初没有预料到的事情:
The first is that the differences between models are significant enough that you will need to build your own agent abstraction. We have not found any of the solutions from these SDKs that build the right abstraction for an agent. I think this is partly because, despite the basic agent design being just a loop, there are subtle differences based on the tools you provide. These differences affect how easy or hard it is to find the right abstraction (cache control, different requirements for reinforcement, tool prompts, provider-side tools, etc.). Because the right abstraction is not yet clear, using the original SDKs from the dedicated platforms keeps you fully in control. With some of these higher-level SDKs you have to build on top of their existing abstractions, which might not be the ones you actually want in the end.
首先,不同模型之间的差异非常显著,以至于你需要构建自己的智能体抽象层。我们发现这些 SDK 中没有任何一个解决方案能为智能体构建出“正确”的抽象。我认为部分原因在于,尽管基本的智能体设计只是一个循环,但基于你所提供的工具,存在着细微的差别。这些差别影响了找到合适抽象层的难易程度(例如缓存控制、对强化的不同需求、工具提示词、提供商侧工具等)。由于目前尚不清楚什么是正确的抽象,直接使用专用平台的原生 SDK 能让你保持完全的控制权。而使用那些高级 SDK 时,你不得不在它们现有的抽象之上进行构建,这最终可能并非你真正想要的。
We also found it incredibly challenging to work with the Vercel SDK when it comes to dealing with provider-side tools. The attempted unification of messaging formats doesn't quite work. For instance, the web search tool from Anthropic routinely destroys the message history with the Vercel SDK, and we haven't yet fully figured out the cause. Also, in Anthropic's case, cache management is much easier when targeting their SDK directly instead of the Vercel one. The error messages when you get things wrong are much clearer.
我们还发现,在使用 Vercel SDK 处理提供商侧工具(Provider-side tools)时,挑战极大。试图统一消息格式的做法效果并不理想。例如,Anthropic 的网络搜索工具在使用 Vercel SDK 时经常会破坏消息历史记录,我们至今还没完全搞清楚原因。此外,就 Anthropic 而言,直接针对其 SDK 进行开发时,缓存管理要比使用 Vercel SDK 容易得多。当你出错时,错误信息也清晰得多。
This might change, but right now we would probably not use an abstraction when building an agent, at least until things have settled down a bit. The benefits do not yet outweigh the costs for us.
这种情况可能会改变,但就目前而言,我们在构建智能体时大概率不会使用抽象层,至少在局势稍微稳定下来之前是这样。对我们来说,收益目前还无法抵消成本。
Someone else might have figured it out. If you're reading this and think I'm wrong, please drop me a mail. I want to learn.
也许其他人已经搞明白了。如果你读到这里并觉得我错了,请给我发邮件。我很想学习一下。

Caching Lessons

关于缓存的教训

The different platforms have very different approaches to caching. A lot has been said about this already, but Anthropic makes you pay for caching. It makes you manage cache points explicitly, and this really changes the way you interact with it from an agent engineering level. I initially found the manual management pretty dumb. Why doesn't the platform do this for me? But I've fully come around and now vastly prefer explicit cache management. It makes costs and cache utilization much more predictable.
不同的平台对缓存有着截然不同的处理方式。关于这一点已经有很多讨论了,但 Anthropic 是让你为缓存付费的。它要求你显式地管理缓存点(Cache points),这确实改变了你从智能体工程层面与其交互的方式。起初我觉得手动管理非常愚蠢。为什么平台不帮我做这个?但我现在完全改观了,并且非常喜欢显式缓存管理。它让成本和缓存利用率变得更加可预测。
Explicit caching allows you to do certain things that are much harder otherwise. For instance, you can split off a conversation and have it run in two different directions simultaneously. You also have the opportunity to do context editing. The optimal strategy here is unclear, but you clearly have a lot more control, and I really like having that control. It also makes it much easier to understand the cost of the underlying agent. You can assume much more about how well your cache will be utilized, whereas with other platforms we found it to be hit and miss.
显式缓存允许你做某些在其他方式下很难实现的事情。例如,你可以分叉一个对话,让它同时朝两个不同的方向运行。你也有机会进行上下文编辑(Context editing)。这里的最佳策略尚不明确,但显而易见的是你拥有了更多的控制权,而我真的很喜欢这种控制权。它也让你更容易理解底层智能体的成本。你可以对缓存的利用率有更有把握的假设,而在其他平台上,我们发现这往往是碰运气。
The way we do caching in the agent with Anthropic is pretty straightforward. One cache point is after the system prompt. Two cache points are placed at the beginning of the conversation, where the last one moves up with the tail of the conversation. And then there is some optimization along the way that you can do.
我们在智能体中对 Anthropic 进行缓存的方式非常直接。一个缓存点设在系统提示词(System prompt)之后。两个缓存点放在对话的开头,其中后一个随着对话的尾部向上移动。在此过程中,你还可以做一些优化。
Because the system prompt and the tool selection now have to be mostly static, we feed a dynamic message later to provide information such as the current time. Otherwise, this would trash the cache. We also leverage reinforcement during the loop much more.
由于系统提示词和工具选择现在必须大部分保持静态,我们会在稍后输入一条动态消息来提供诸如当前时间等信息。否则,这会破坏缓存。我们在循环中也更多地利用了强化(Reinforcement)。

Reinforcement In The Agent Loop

智能体循环中的强化

Every time the agent runs a tool you have the opportunity to not just return data that the tool produces, but also to feed more information back into the loop. For instance, you can remind the agent about the overall objective and the status of individual tasks. You can also provide hints about how the tool call might succeed when a tool fails. Another use of reinforcement is to inform the system about state changes that happened in the background. If you have an agent that uses parallel processing, you can inject information after every tool call when that state changed and when it is relevant for completing the task.
每次智能体运行工具时,你不仅有机会返回工具产生的数据,还可以将更多信息反馈回循环中。例如,你可以提醒智能体当前的总体目标以及各个任务的状态。当工具调用失败时,你还可以提供关于如何成功调用的提示。强化的另一个用途是通知系统后台发生的状态变化。如果你有一个使用并行处理的智能体,当状态发生变化且与完成任务相关时,你可以在每次工具调用后注入这些信息。
Sometimes it's enough for the agent to self-reinforce. In Claude Code, for instance, the todo write tool is a self-reinforcement tool. All it does is take from the agent a list of tasks that it thinks it should do and echo out what came in. It's basically just an echo tool; it really doesn't do anything else. But that is enough to drive the agent forward better than if the only task and subtask were given at the beginning of the context and too much has happened in the meantime.
有时候,让智能体进行自我强化(Self-reinforce)就足够了。例如在 Claude Code 中,待办事项写入工具就是一个自我强化工具。它所做的仅仅是接收智能体认为应该执行的任务列表,然后把接收到的内容回显出来。它本质上只是一个回显工具;真的不做其他任何事情。但这足以比仅在上下文开头给出任务和子任务更能推动智能体前进,特别是当中间已经发生了很多事情的时候。
We also use reinforcements to inform the system if the environment changed during execution in a way that's problematic for the agent. For instance, if our agent fails and retries from a certain step forward but the recovery operates off broken data, we inject a message informing it that it might want to back off a couple of steps and redo an earlier step.
我们还使用强化来通知系统,如果在执行过程中环境发生了对智能体不利的变化。例如,如果我们的智能体失败了并试图从某一步骤开始重试,但恢复操作基于损坏的数据,我们会注入一条消息,通知它可能需要回退几步并重做之前的步骤。

Isolate Failures

隔离故障

If you expect a lot of failures during code execution, there is an opportunity to hide those failures from the context. This can happen in two ways. One is to run tasks that might require iteration individually. You would run them in a subagent until they succeed and only report back the success, plus maybe a brief summary of approaches that did not work. It is helpful for an agent to learn about what did not work in a subtask because it can then feed that information into the next task to hopefully steer away from those failures.
如果你预计在代码执行过程中会出现大量失败,那么这是一个向上下文隐藏这些失败的机会。这可以通过两种方式实现。一种是单独运行可能需要迭代的任务。你可以在一个子智能体(Subagent)中运行它们直到成功,然后只报告成功的结果,或许再加上一个关于哪些方法无效的简短总结。让智能体了解子任务中哪些方法行不通是有帮助的,因为它随后可以将该信息输入到下一个任务中,希望能避开这些失败。
The second option doesn't exist in all agents or foundation models, but with Anthropic you can do context editing. So far we haven't had a lot of success with context editing, but we believe it's an interesting thing we would love to explore more. We would also love to learn if people have success with it. What is interesting about context editing is that you should be able to preserve tokens for further down the iteration loop. You can take out of the context certain failures that didn't drive towards successful completion of the loop, but only negatively affected certain attempts during execution. But as with the point I made earlier: it is also useful for the agent to understand what didn't work, but maybe it doesn't require the full state and full output of all the failures.
第二种选择并非在所有智能体或基础模型中都存在,但在 Anthropic 中,你可以进行上下文编辑。到目前为止,我们在上下文编辑方面还没有取得太大的成功,但我们认为这是一件有趣的事情,值得进一步探索。我们也希望能了解到是否有人在这方面取得了成功。上下文编辑的有趣之处在于,你应该能够为迭代循环的后续部分保留 Token。你可以从上下文中剔除某些未能推动循环成功完成的失败尝试,这些尝试只是在执行过程中产生了负面影响。但正如我之前指出的:让智能体理解什么行不通也是有用的,但也许并不需要所有失败的完整状态和完整输出。
Unfortunately, context editing will automatically invalidate caches. There is really no way around it. So it can be unclear when the trade-off of doing that compensates for the extra cost of trashing the cache.
不幸的是,上下文编辑会自动使缓存失效。这确实无法避免。因此,这样做所带来的权衡是否能弥补破坏缓存的额外成本,目前尚不清楚。

Sub Agents / Sub Inference

子智能体 / 子推理

As I mentioned a couple of times on this blog already, most of our agents are based on code execution and code generation. That really requires a common place for the agent to store data. Our choice is a file system—in our case a virtual file system—but that requires different tools to access it. This is particularly important if you have something like a subagent or subinference.
正如我在这个博客上多次提到的那样,我们的大多数智能体都是基于代码执行和代码生成的。这确实需要一个让智能体存储数据的公共场所。我们的选择是文件系统——在我们的案例中是一个虚拟文件系统——但这需要不同的工具来访问它。如果你有像子智能体或子推理(Subinference)之类的东西,这一点尤为重要。
You should try to build an agent that doesn't have dead ends. A dead end is where a task can only continue executing within the sub-tool that you built. For instance, you might build a tool that generates an image, but is only able to feed that image back into one more tool. That's a problem because you might then want to put those images into a zip archive using the code execution tool. So there needs to be a system that allows the image generation tool to write the image to the same place where the code execution tool can read it. In essence, that's a file system.
你应该尝试构建一个没有死胡同的智能体。死胡同是指任务只能在你构建的子工具内部继续执行。例如,你可能构建了一个生成图像的工具,但它只能将该图像反馈给另外一个工具。这很有问题,因为你可能随后想使用代码执行工具将这些图像放入 zip 压缩包中。因此,需要一个系统允许图像生成工具将图像写入代码执行工具可以读取的同一个地方。本质上,这就是一个文件系统。
Obviously it has to go the other way around too. You might want to use the code execution tool to unpack a zip archive and then go back to inference to describe all the images so that the next step can go back to code execution and so forth. The file system is the mechanism that we use for that. But it does require tools to be built in a way that they can take file paths to the virtual file system to work with.
显然,反过来也必须行得通。你可能想使用代码执行工具解压 zip 归档文件,然后回到推理阶段描述所有图像,以便下一步可以再次回到代码执行,依此类推。文件系统是我们用于此目的的机制。但这确实要求工具在构建时能够接受并处理虚拟文件系统的文件路径。
So basically an ExecuteCode tool would have access to the same file system as the RunInference tool which could take a path to a file on that same virtual file system.
所以基本上,ExecuteCode(执行代码)工具将有权访问与 RunInference(运行推理)工具相同的文件系统,后者可以接受该虚拟文件系统上的文件路径。

The Use Of An Output Tool

输出工具的使用

One interesting thing about how we structured our agent is that it does not represent a chat session. It will eventually communicate something to the user or the outside world, but all the messages that it sends in between are usually not revealed. The question is: how does it create that message? We have one tool which is the output tool. The agent uses it explicitly to communicate to the human. We then use a prompt to instruct it when to use that tool. In our case the output tool sends an email.
关于我们如何构建智能体,一件有趣的事情是它并不代表一个聊天会话。它最终会与用户或外部世界交流某些内容,但在其间发送的所有消息通常不会被展示出来。问题是:它是如何创建那条消息的?我们有一个工具,即输出工具(Output tool)。智能体显式地使用它来与人类沟通。然后我们使用提示词来指示它何时使用该工具。在我们的案例中,输出工具用于发送电子邮件。
But that turns out to pose a few other challenges. One is that it's surprisingly hard to steer the wording and tone of that output tool compared to just using the main agent loop's text output as the mechanism to talk to the user. I cannot say why this is, but I think it's probably related to how these models are trained.
但这实际上带来了一些其他挑战。其中之一是,与仅仅使用主智能体循环的文本输出作为与用户交谈的机制相比,引导输出工具的措辞和语气出奇地困难。我说不出原因,但我认为这可能与这些模型的训练方式有关。
One attempt that didn't work well was to have the output tool run another quick LLM like Gemini 2.5 Flash to adjust the tone to our preference. But this increases latency and actually reduces the quality of the output. In part, I think the model just doesn't word things correctly and the subtool doesn't have sufficient context. Providing more slices of the main agentic context into the subtool makes it expensive and also didn't fully solve the problem. It also sometimes reveals information in the final output that we didn't want to be there, like the steps that led to the end result.
一个效果不佳的尝试是让输出工具运行另一个快速的 LLM(如 Gemini 2.5 Flash)来将语气调整为我们的偏好。但这增加了延迟,实际上还降低了输出质量。部分原因在于,我认为该模型只是没能正确地措辞,而且子工具缺乏足够的上下文。将主智能体上下文的更多切片提供给子工具会让成本变得昂贵,而且也没有完全解决问题。它有时还会最终输出中泄露我们不希望出现的信息,例如导致最终结果的步骤。
Another problem with an output tool is that sometimes it just doesn't call the tool. One of the ways in which we're forcing this is we remember if the output tool was called. If the loop ends without the output tool, we inject a reinforcement message to encourage it to use the output tool.
输出工具的另一个问题是,有时它根本不调用该工具。我们强制解决这个问题的方法之一是记录输出工具是否已被调用。如果循环结束时没有调用输出工具,我们会注入一条强化消息,鼓励它使用输出工具。

Model Choice

模型选择

Overall our choices for models haven't dramatically changed so far. I think Haiku and Sonnet are still the best tool callers available, so they make for excellent choices in the agent loop. They are also somewhat transparent with regards to what the RL looks like. The other obvious choices are the Gemini models. We so far haven't found a ton of success with the GPT family of models for the main loop.
总的来说,我们对模型的选择至今没有发生巨大的变化。我认为 Haiku 和 Sonnet 仍然是目前最好的工具调用者(Tool callers),因此它们是智能体循环的绝佳选择。关于 RL(强化学习/反馈回路)的样子,它们也相对透明。其他明显的选择是 Gemini 模型。到目前为止,我们还没发现在主循环中使用 GPT 系列模型有多大的成功。
For the individual sub-tools, which in part might also require inference, our current choice is Gemini 2.5 if you need to summarize large documents or work with PDFs and things like that. That is also a pretty good model for extracting information from images, in particular because the Sonnet family of models likes to run into a safety filter which can be annoying.
对于个别子工具(部分可能也需要推理),如果你需要总结大型文档或处理 PDF 之类的内容,我们目前的选择是 Gemini 2.5。它也是从图像中提取信息的一个相当不错的模型,特别是考虑到 Sonnet 系列模型总是喜欢触发安全过滤器,这可能会很烦人。
There's also probably the very obvious realization that token cost alone doesn't really define how expensive an agent. A better tool caller will do the job in fewer tokens. There are some cheaper models available than sonnet today, but they are not necessarily cheaper in a loop.
还有一个非常明显的认知:单凭 Token 成本并不能真正定义一个智能体的昂贵程度。一个更好的工具调用者会用更少的 Token 完成工作。如今有一些比 Sonnet 更便宜的模型,但在循环中使用时,它们未必更省钱。
But all things considered, not that much has changed in the last couple of weeks.
但综合考虑,过去几周并没有发生太大的变化。

Testing and Evals

测试与评估

We find testing and evals to be the hardest problem here. This is not entirely surprising, but the agentic nature makes it even harder. Unlike prompts, you cannot just do the evals in some external system because there's too much you need to feed into it. This means you want to do evals based on observability data or instrumenting your actual test runs. So far none of the solutions we have tried have convinced us that they found the right approach here. Unfortunately, I have to report that at the moment we haven't found something that really makes us happy. I hope we're going to find a solution for this because it is becoming an increasingly frustrating aspect of building an agent.
我们发现测试和评估(Evals)是这里最难的问题。这并不完全令人惊讶,但智能体的特性使其变得更加困难。与提示词(Prompts)不同,你不能仅仅在某个外部系统中进行评估,因为你需要输入的东西太多了。这意味着你需要基于可观测性数据或对实际测试运行进行插桩来做评估。到目前为止,我们尝试过的解决方案中,没有一个能让我们确信它们找到了正确的方法。不幸的是,我不得不报告,目前我们要还没找到真正让我们满意的东西。我希望我们能找到解决方案,因为这正成为构建智能体过程中一个越来越令人沮丧的方面。

Coding Agent Updates

编码智能体更新

As for my experience with coding agents, not really all that much has changed. The main new development is that I'm trialing Amp more. In case you're curious why: it's not that it's objectively a better agent than what I'm using, but I really quite like the way they're thinking about agents from what they're posting. The interactions of the different sub agents like the Oracle with the main loop is beautifully done, and not many other harnesses do this today. It's also a good way for me to validate how different agent designs work. Amp, similar to Claude Code, really feels like a product built by people who also use their own tool. I do not feel every other agent in the industry does this.
至于我在编码智能体方面的经验,其实并没有太大变化。主要的新进展是我在更多地试用 Amp。如果你好奇原因:并非它客观上比我正在使用的智能体更好,而是从他们发布的内容来看,我真的很喜欢他们对智能体的思考方式。不同的子智能体(如 Oracle)与主循环的交互做得非常漂亮,如今没有多少其他的框架(Harnesses)能做到这一点。这也是我验证不同智能体设计如何工作的好方法。Amp 和 Claude Code 类似,感觉真的是由那些自己也在使用该工具的人构建的产品。我觉得业内并非每个智能体都能给我这种感觉。

Quick Stuff I Read And Found

我读到和发现的一些简讯

That's just a random assortment of things that I feel might also be worth sharing:
这只是一些随意的汇集,我觉得可能也值得分享:
  • What if you don't need MCP at all?: Mario argues that many MCP servers are overengineered and include large toolsets that consume lots of context. He proposes a minimalist approach for browser-agent use-cases by relying on simple CLI tools (e.g., start, navigate, evaluate JS, screenshot) executed via Bash, which keeps token usage small and workflows flexible. I built a Claude/Amp Skill out of it.
    • 如果根本不需要 MCP 会怎样?: Mario 认为许多 MCP 服务器过度设计,包含消耗大量上下文的大型工具集。他针对浏览器智能体用例提出了一种极简方法,即依赖通过 Bash 执行的简单 CLI 工具(如 start, navigate, evaluate JS, screenshot),这使得 Token 使用量很小且工作流灵活。我基于此构建了一个 Claude/Amp 技能。
  • The fate of “small” open source: The author argues that the age of tiny, single-purpose open-source libraries is coming to an end, largely because built-in platform APIs and AI tools can now generate simple utilities on demand. Thank fucking god.
    • “小型”开源项目的命运: 作者认为,微型、单一用途的开源库时代即将结束,这主要是因为内置的平台 API 和 AI 工具现在可以按需生成简单的实用程序。谢天谢地。
  • Tmux is love. There is no article that goes with it, but the TLDR is that Tmux is great. If you have anything that remotely looks like an interactive system that an agent should work with, you should give it some Tmux skills.
    • Tmux 就是爱。 这里没有附带的文章,但简而言之(TLDR)就是 Tmux 很棒。如果你有任何哪怕看起来有点像智能体应该协同工作的交互式系统,你应该给它一些 Tmux 技能。
  • LLM APIs are a Synchronization Problem. This was a separate realization that was too long for this post, so I wrote a separate one.
    • LLM API 是一个同步问题。 这是一个单独的感悟,对于这篇文章来说太长了,所以我单独写了一篇。