Effective context engineering for AI agents By Anthropic（译）

Published Sep 29, 2025

Gemini Translate from

Effective context engineering for AI agents

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

对于 AI 智能体而言，上下文（Context）是一种至关重要但又有限的资源。在本文中，我们将探讨如何高效地筛选和管理那些驱动智能体运行的上下文信息。

过去几年，提示词工程（prompt engineering）一直是应用 AI 领域的焦点，而今一个新术语正崭露头角：上下文工程（context engineering）。构建语言模型应用的重点，正在从“为提示词找到合适的词句”转向一个更宏观的问题：“什么样的上下文配置最有可能促使模型产生我们期望的行为？”

“上下文”指的是从大语言模型（LLM）采样时所包含的全部词元（token）集合。“工程”问题则在于，如何在 LLM 的固有约束下，优化这些词元的效用，以稳定地实现预期结果。要高效地驾驭 LLM，往往需要我们具备“上下文思维”—— 换言之：在任意时刻，都要全面考虑 LLM 可用的整体状态，并预判该状态可能催生哪些潜在行为。

在本文中，我们将探讨上下文工程这门新兴的艺术，并为构建可引导的、高效的智能体提供一个更精炼的心智模型。

上下文工程 vs. 提示词工程

At Anthropic, we view context engineering as the natural progression of prompt engineering. Prompt engineering refers to methods for writing and organizing LLM instructions for optimal outcomes (see our docs for an overview and useful prompt engineering strategies). Context engineering refers to the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference, including all the other information that may land there outside of the prompts.

在 Anthropic，我们将上下文工程视为提示词工程的自然演进。提示词工程指的是为实现最佳结果而编写和组织 LLM 指令的方法（请参阅我们的文档以获取概述和有用的提示词工程策略）。而上下文工程则是在 LLM 推理期间，筛选和维护最佳词元（信息）集的整套策略，它不仅包括提示词，还涵盖了所有其他可能进入上下文的信息。

In the early days of engineering with LLMs, prompting was the biggest component of AI engineering work, as the majority of use cases outside of everyday chat interactions required prompts optimized for one-shot classification or text generation tasks. As the term implies, the primary focus of prompt engineering is how to write effective prompts, particularly system prompts. However, as we move towards engineering more capable agents that operate over multiple turns of inference and longer time horizons, we need strategies for managing the entire context state (system instructions, tools, Model Context Protocol (MCP), external data, message history, etc).

在 LLM 工程的早期阶段，提示词是 AI 工程工作的最主要部分，因为除了日常聊天互动外，大多数用例都需要为单次（one-shot）分类或文本生成任务优化提示词。正如该术语所暗示的，提示词工程的核心焦点是如何编写有效的提示词，尤其是系统提示词。然而，随着我们开始构建更强大的、能够在多轮推理和更长周期上运行的智能体，我们就需要策略来管理整个上下文状态（包括系统指令、工具、模型上下文协议 (MCP)、外部数据、消息历史等）。

An agent running in a loop generates more and more data that could be relevant for the next turn of inference, and this information must be cyclically refined. Context engineering is the art and science of curating what will go into the limited context window from that constantly evolving universe of possible information.

一个循环运行的智能体会产生越来越多可能与下一轮推理相关的数据，而这些信息必须被周期性地提炼。上下文工程，就是一门“艺术与科学”——它研究的是，如何从那个不断演变的信息海洋中，精心筛选出将要被纳入有限上下文窗口的内容。

In contrast to the discrete task of writing a prompt, context engineering is iterative and the curation phase happens each time we decide what to pass to the model.

与编写提示词这一离散任务相比，上下文工程是迭代的，其筛选阶段发生在我们每一次决定向模型传递什么信息之时。

为什么上下文工程对构建强大的智能体至关重要

Despite their speed and ability to manage larger and larger volumes of data, we’ve observed that LLMs, like humans, lose focus or experience confusion at a certain point. Studies on needle-in-a-haystack style benchmarking have uncovered the concept of context rot: as the number of tokens in the context window increases, the model’s ability to accurately recall information from that context decreases.

我们观察到，尽管 LLM 速度很快，也能处理越来越大的数据量，但它们和人类一样，在某个点上会“失去焦点”或感到“困惑”。“大海捞针”式的基准测试研究揭示了“上下文衰减”（context rot）的概念：随着上下文窗口中词元数量的增加，模型从中准确回忆信息的能力会随之下降。

While some models exhibit more gentle degradation than others, this characteristic emerges across all models. Context, therefore, must be treated as a finite resource with diminishing marginal returns. Like humans, who have limited working memory capacity, LLMs have an “attention budget” that they draw on when parsing large volumes of context. Every new token introduced depletes this budget by some amount, increasing the need to carefully curate the tokens available to the LLM.

尽管某些模型表现出的性能下降比其他模型更平缓，但这种特性普遍存在于所有模型中。因此，上下文必须被视为一种“边际回报递减”的有限资源。就像人类的工作记忆容量有限一样，LLM 在解析大量上下文时，也需要消耗其“注意力预算”。每一个新引入的词元都会消耗掉一部分预算，这就使得我们愈发需要精心筛选 LLM 可用的词元。

This attention scarcity stems from architectural constraints of LLMs. LLMs are based on the transformer architecture, which enables every token to attend to every other token across the entire context. This results in n² pairwise relationships for n tokens.

这种注意力的稀缺性源于 LLM 的架构约束。LLM 基于 Transformer 架构，该架构使每个词元都能“关注”到整个上下文中的其他所有词元。对于 n 个词元，这会产生 $n^2$ 种成对关系。

As its context length increases, a model's ability to capture these pairwise relationships gets stretched thin, creating a natural tension between context size and attention focus. Additionally, models develop their attention patterns from training data distributions where shorter sequences are typically more common than longer ones. This means models have less experience with, and fewer specialized parameters for, context-wide dependencies.

随着上下文长度的增加，模型捕捉这些成对关系的能力被“摊薄”，在上下文大小和注意力焦点之间产生了一种天然的矛盾。此外，模型的注意力模式是在训练数据分布中形成的，而这些数据中短序列通常比较长序列更常见。这意味着模型对于上下文范围内的长程依赖关系缺乏足够的经验，也缺少专门用于处理它们的参数。

Techniques like position encoding interpolation allow models to handle longer sequences by adapting them to the originally trained smaller context, though with some degradation in token position understanding. These factors create a performance gradient rather than a hard cliff: models remain highly capable at longer contexts but may show reduced precision for information retrieval and long-range reasoning compared to their performance on shorter contexts.

诸如“位置编码插值”（position encoding interpolation）之类的技术，通过使模型适应最初训练的较小上下文，使其能够处理更长的序列，尽管这会牺牲一些对词元位置的理解精度。这些因素导致了“性能平滑下降”（performance gradient）而非“性能断崖”（hard cliff）：模型在长上下文中仍然非常强大，但与在短上下文中的表现相比，其信息检索和长程推理的精度可能会有所下降。

These realities mean that thoughtful context engineering is essential for building capable agents.

这些现实情况意味着，深思熟虑的上下文工程对于构建强大的智能体至关重要。

高效上下文的构成要素

Given that LLMs are constrained by a finite attention budget, good context engineering means finding the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome. Implementing this practice is much easier said than done, but in the following section, we outline what this guiding principle means in practice across the different components of context.

鉴于 LLM 受到有限“注意力预算”的约束，优秀的上下文工程意味着找到一个“最小可行的高信号词元集”，使其最大限度地提高达成预期结果的可能性。这一实践说起来容易做起来难，但在接下来的部分，我们将概述这一指导原则在上下文的不同组成部分中具体意味着什么。

System prompts should be extremely clear and use simple, direct language that presents ideas at the right altitude for the agent. The right altitude is the Goldilocks zone between two common failure modes. At one extreme, we see engineers hardcoding complex, brittle logic in their prompts to elicit exact agentic behavior. This approach creates fragility and increases maintenance complexity over time. At the other extreme, engineers sometimes provide vague, high-level guidance that fails to give the LLM concrete signals for desired outputs or falsely assumes shared context. The optimal altitude strikes a balance: specific enough to guide behavior effectively, yet flexible enough to provide the model with strong heuristics to guide behavior.

系统提示词应极其清晰，使用简洁、直接的语言，并在“恰当的高度”上向智能体传达思想。“恰当的高度”是指介于两种常见失败模式之间的“黄金区域”：一个极端是，工程师在提示词中硬编码复杂、脆弱的逻辑，试图引发智能体精确的行为，这种方法会增加系统的脆弱性，并随时间推移提高维护复杂性；另一个极端是，工程师有时提模糊、高层次的指导，未能给 LLM 提供关于预期输出的具体信号，或者错误地假定了双方具有共享的上下文。最佳的高度则是在两者间取得平衡：既要足够具体以有效引导行为，又要足够灵活以便为模型提供强大的启发式规则来指导其行为。

At one end of the spectrum, we see brittle if-else hardcoded prompts, and at the other end we see prompts that are overly general or falsely assume shared context.

一个极端是脆弱的、硬编码了 if-else 逻辑的提示词，另一个极端是过于宽泛或错误地假定存在共享上下文的提示词。

We recommend organizing prompts into distinct sections (like <background_information>, <instructions>, ## Tool guidance, ## Output description, etc) and using techniques like XML tagging or Markdown headers to delineate these sections, although the exact formatting of prompts is likely becoming less important as models become more capable.

我们建议将提示词组织成不同的部分（例如 <background_information>、<instructions>、## Tool guidance、## Output description 等），并使用 XML 标签或 Markdown 标题等技术来划分这些区域——尽管随着模型能力越来越强，提示词的确切格式可能正变得不那么重要。

Regardless of how you decide to structure your system prompt, you should be striving for the minimal set of information that fully outlines your expected behavior. (Note that minimal does not necessarily mean short; you still need to give the agent sufficient information up front to ensure it adheres to the desired behavior.) It’s best to start by testing a minimal prompt with the best model available to see how it performs on your task, and then add clear instructions and examples to improve performance based on failure modes found during initial testing.

无论您决定如何构建系统提示词，都应力求使用“最小信息集”来完整地勾勒出您期望的行为。（请注意：最小并不一定意味着简短；您仍需要预先为智能体提供足够的信息，以确保它遵循期望的行为。）最好的方法是，从一个最小化的提示词开始，在最先进的模型上测试它在您任务上的表现，然后根据初始测试中发现的失败模式，添加清晰的指令和示例来改善其性能。

Tools allow agents to operate with their environment and pull in new, additional context as they work. Because tools define the contract between agents and their information/action space, it’s extremely important that tools promote efficiency, both by returning information that is token efficient and by encouraging efficient agent behaviors.

工具允许智能体与其环境交互，并在工作时引入新的、额外的上下文。由于工具定义了智能体与其信息/行动空间之间的“契约”，因此工具必须能提升效率，这一点至关重要——这既包括返回“词元高效”（token efficient）的信息，也包括鼓励智能体采取高效的行为。

In Writing tools for AI agents – with AI agents, we discussed building tools that are well understood by LLMs and have minimal overlap in functionality. Similar to the functions of a well-designed codebase, tools should be self-contained, robust to error, and extremely clear with respect to their intended use. Input parameters should similarly be descriptive, unambiguous, and play to the inherent strengths of the model.

在《（与 AI 智能体一起）为 AI 智能体编写工具》一文中，我们讨论了如何构建易于 LLM 理解且功能重叠最小的工具。与设计良好的代码库中的函数类似，工具应当是自洽的、对错误具有鲁棒性，并且其预期用途必须极其明确。输入参数同样应具有描述性、无歧义，并能发挥模型的固有优势。

One of the most common failure modes we see is bloated tool sets that cover too much functionality or lead to ambiguous decision points about which tool to use. If a human engineer can’t definitively say which tool should be used in a given situation, an AI agent can’t be expected to do better. As we’ll discuss later, curating a minimal viable set of tools for the agent can also lead to more reliable maintenance and pruning of context over long interactions.

我们观察到的最常见的失败模式之一是“臃肿的工具集”，它们要么功能覆盖过多，要么在“该使用哪个工具”上导致了模糊的决策点。如果在特定情况下，人类工程师都无法明确说出该使用哪个工具，那么我们就不能指望 AI 智能体做得更好。正如我们稍后将讨论的，为智能体筛选一套“最小可行工具集”，也有助于在长时交互中更可靠地维护和“修剪”上下文。

Providing examples, otherwise known as few-shot prompting, is a well known best practice that we continue to strongly advise. However, teams will often stuff a laundry list of edge cases into a prompt in an attempt to articulate every possible rule the LLM should follow for a particular task. We do not recommend this. Instead, we recommend working to curate a set of diverse, canonical examples that effectively portray the expected behavior of the agent. For an LLM, examples are the “pictures” worth a thousand words.

提供示例（即“少样本提示”，few-shot prompting）是一个众所周知的最佳实践，我们仍然强烈推荐。然而，团队常常会在提示词中“塞入”一长串的边缘案例（edge cases），试图阐明 LLM 在特定任务上应遵循的每条规则。我们不推荐这样做。相反，我们建议您去筛选一套多样化的、典型的示例，以有效展示智能体的预期行为。对于 LLM 而言，示例就是“值千金的图片”。

Our overall guidance across the different components of context (system prompts, tools, examples, message history, etc) is to be thoughtful and keep your context informative, yet tight. Now let's dive into dynamically retrieving context at runtime.

我们对上下文不同组成部分（系统提示词、工具、示例、消息历史等）的总体指导方针是：深思熟虑，并保持您的上下文“信息丰富，但又高度凝练”。现在，让我们深入探讨一下在运行时动态检索上下文。

上下文检索与智能体搜索

In Building effective AI agents, we highlighted the differences between LLM-based workflows and agents. Since we wrote that post, we’ve gravitated towards a simple definition for agents: LLMs autonomously using tools in a loop.

在《构建高效的 AI 智能体》一文中，我们强调了基于 LLM 的工作流（workflows）与智能体（agents）之间的区别。自那篇文章发布以来，我们逐渐倾向于一个对智能体更简洁的定义：LLM 在循环中自主使用工具。

Working alongside our customers, we’ve seen the field converging on this simple paradigm. As the underlying models become more capable, the level of autonomy of agents can scale: smarter models allow agents to independently navigate nuanced problem spaces and recover from errors.

通过与客户的并肩协作，我们看到整个领域都在向这个简洁的范式趋同。随着底层模型的能力越来越强，智能体的自主性水平也在提升：更智能的模型允许智能体独立地在复杂的、充满细微差别的问题空间中导航，并从错误中恢复。

We’re now seeing a shift in how engineers think about designing context for agents. Today, many AI-native applications employ some form of embedding-based pre-inference time retrieval to surface important context for the agent to reason over. As the field transitions to more agentic approaches, we increasingly see teams augmenting these retrieval systems with “just in time” context strategies.

我们现在看到，工程师在为智能体设计上下文时的思路正在发生转变。如今，许多 AI 原生应用都采用某种形式的、基于嵌入（embedding）的“推理前检索”，以便为智能体提供重要的上下文来进行推理。随着领域向更具智能体特性的（agentic）方法过渡，我们越来越多地看到团队开始使用“即时”（just in time）上下文策略来增强这些检索系统。

Rather than pre-processing all relevant data up front, agents built with the “just in time” approach maintain lightweight identifiers (file paths, stored queries, web links, etc.) and use these references to dynamically load data into context at runtime using tools. Anthropic’s agentic coding solution Claude Code uses this approach to perform complex data analysis over large databases. The model can write targeted queries, store results, and leverage Bash commands like head and tail to analyze large volumes of data without ever loading the full data objects into context.

采用“即时”方法的智能体，并不会预先处理所有相关数据，而是维护一组“轻量级标识符”（如文件路径、存储的查询、网页链接等），并在运行时使用工具，通过这些引用将数据动态加载到上下文中。Anthropic 的智能体编码解决方案 Claude Code 就使用这种方法来对大型数据库执行复杂的数据分析。模型可以编写有针对性的查询、存储结果，并利用像 head 和 tail 这样的 Bash 命令来分析海量数据，而无需将完整的数据对象加载到上下文中。

This approach mirrors human cognition: we generally don’t memorize entire corpuses of information, but rather introduce external organization and indexing systems like file systems, inboxes, and bookmarks to retrieve relevant information on demand.

这种方法映照了人类的认知：我们通常不会记住全部的信息语料库，而是引入外部的组织和索引系统（如文件系统、收件箱和书签）来按需检索相关信息。

Beyond storage efficiency, the metadata of these references provides a mechanism to efficiently refine behavior, whether explicitly provided or intuitive. To an agent operating in a file system, the presence of a file named test_utils.py in a tests folder implies a different purpose than a file with the same name located in src/core_logic/ Folder hierarchies, naming conventions, and timestamps all provide important signals that help both humans and agents understand how and when to utilize information.

除了存储效率之外，这些引用的元数据（metadata）还提供了一种高效优化行为的机制，无论这些元数据是显式提供的还是直观的。对于一个在文件系统中操作的智能体来说，一个位于 tests 文件夹中名为 test_utils.py 的文件，其隐含的用途与一个位于 src/core_logic/ 文件夹下的同名文件截然不同。文件夹的层级结构、命名规范和时间戳都提供了重要的信号，帮助人类和智能体理解如何以及何时利用这些信息。

Letting agents navigate and retrieve data autonomously also enables progressive disclosure—in other words, allows agents to incrementally discover relevant context through exploration. Each interaction yields context that informs the next decision: file sizes suggest complexity; naming conventions hint at purpose; timestamps can be a proxy for relevance. Agents can assemble understanding layer by layer, maintaining only what's necessary in working memory and leveraging note-taking strategies for additional persistence. This self-managed context window keeps the agent focused on relevant subsets rather than drowning in exhaustive but potentially irrelevant information.

让智能体自主导航和检索数据，也实现了“渐进式披露”（progressive disclosure）——换句话说，允许智能体通过探索来逐步发现相关上下文。每一次交互都会产生新的上下文，为下一步决策提供信息：文件大小暗示了复杂性；命名规范提示了用途；时间戳可以作为相关性的代理指标。智能体可以逐层构建理解，只在工作记忆中保留必要的内容，并利用笔记策略来实现额外的数据持久化。这种“自我管理的上下文窗口”使智能体能专注于相关的子集，而不是被详尽但可能无关的信息所淹没。

Of course, there's a trade-off: runtime exploration is slower than retrieving pre-computed data. Not only that, but opinionated and thoughtful engineering is required to ensure that an LLM has the right tools and heuristics for effectively navigating its information landscape. Without proper guidance, an agent can waste context by misusing tools, chasing dead-ends, or failing to identify key information.

当然，这需要权衡：运行时探索比检索预先计算好的数据要慢。不仅如此，这还需要“有主见且深思熟虑的工程设计”（opinionated and thoughtful engineering），以确保 LLM 拥有合适的工具和启发式规则来高效地导航其信息版图。没有适当的引导，智能体可能会因为滥用工具、追逐死胡同或未能识别关键信息而浪费上下文。

In certain settings, the most effective agents might employ a hybrid strategy, retrieving some data up front for speed, and pursuing further autonomous exploration at its discretion. The decision boundary for the ‘right’ level of autonomy depends on the task. Claude Code is an agent that employs this hybrid model: CLAUDE.md files are naively dropped into context up front, while primitives like glob and grep allow it to navigate its environment and retrieve files just-in-time, effectively bypassing the issues of stale indexing and complex syntax trees.

在某些情况下，最高效的智能体可能会采用一种混合策略：为了速度，预先检索部分数据，同时自行决定是否进行进一步的自主探索。“正确”的自主性水平的决策边界取决于具体任务。Claude Code 就是一个采用这种混合模型的智能体：CLAUDE.md 文件会“朴素地”（naively）被预先置入上下文中，而像 glob 和 grep 这样的原语（primitives）则允许它在环境中导航并即时检索文件，从而有效绕过了“索引陈旧”和“复杂语法树”的问题。

The hybrid strategy might be better suited for contexts with less dynamic content, such as legal or finance work. As model capabilities improve, agentic design will trend towards letting intelligent models act intelligently, with progressively less human curation. Given the rapid pace of progress in the field, "do the simplest thing that works" will likely remain our best advice for teams building agents on top of Claude.

混合策略可能更适用于内容动态变化较少的场景，例如法律或金融工作。随着模型能力的提升，智能体设计的趋势将是“让智能模型更智能地行动”，逐步减少人工的筛选介入。鉴于该领域的飞速进步，“做最简单的有效方案”（do the simplest thing that works）可能仍然是我们给基于 Claude 构建智能体的团队的最佳建议。

针对长周期任务的上下文工程

Long-horizon tasks require agents to maintain coherence, context, and goal-directed behavior over sequences of actions where the token count exceeds the LLM’s context window. For tasks that span tens of minutes to multiple hours of continuous work, like large codebase migrations or comprehensive research projects, agents require specialized techniques to work around the context window size limitation.

长周期任务要求智能体在执行一系列动作（其词元总数超过 LLM 上下文窗口）的过程中，始终保持一致性、上下文感知和目标导向的行为。对于那些需要持续工作几十分钟到数小时的任务，例如大型代码库迁移或综合性研究项目，智能体需要专门的技术来绕过上下文窗口大小的限制。

Waiting for larger context windows might seem like an obvious tactic. But it's likely that for the foreseeable future, context windows of all sizes will be subject to context pollution and information relevance concerns—at least for situations where the strongest agent performance is desired. To enable agents to work effectively across extended time horizons, we've developed a few techniques that address these context pollution constraints directly: compaction, structured note-taking, and multi-agent architectures.

等待更大的上下文窗口似乎是一个显而易见的策略。但在可预见的未来，无论上下文窗口大小如何，都可能会受到“上下文污染”（context pollution）和“信息相关性”问题的困扰——至少在追求最强智能体性能的场景下是如此。为了使智能体能够在更长的时间跨度上有效工作，我们开发了几种直接解决这些上下文污染约束的技术：压缩（compaction）、结构化笔记（structured note-taking）和多智能体架构（multi-agent architectures）。

压缩 (Compaction)

Compaction is the practice of taking a conversation nearing the context window limit, summarizing its contents, and reinitiating a new context window with the summary. Compaction typically serves as the first lever in context engineering to drive better long-term coherence. At its core, compaction distills the contents of a context window in a high-fidelity manner, enabling the agent to continue with minimal performance degradation.

“压缩”是指当对话接近上下文窗口限制时，将其内容进行总结，并使用该摘要重新启动一个新的上下文窗口的做法。“压缩”通常是上下文工程中推动实现更佳长期一致性的首要杠杆。其核心在于，“压缩”以高保真（high-fidelity）的方式“蒸馏”出上下文窗口的内容，使智能体能够在性能下降最小的情况下继续工作。

In Claude Code, for example, we implement this by passing the message history to the model to summarize and compress the most critical details. The model preserves architectural decisions, unresolved bugs, and implementation details while discarding redundant tool outputs or messages. The agent can then continue with this compressed context plus the five most recently accessed files. Users get continuity without worrying about context window limitations.

例如，在 Claude Code 中，我们通过将消息历史传递给模型，让其总结和压缩最关键的细节来实现这一点。模型会保留架构决策、未解决的错误和实现细节，同时丢弃冗余的工具输出或消息。然后，智能体可以使用这个“压缩后的上下文”加上最近访问的五个文件继续工作。用户获得了连续性，而无需担心上下文窗口的限制。

The art of compaction lies in the selection of what to keep versus what to discard, as overly aggressive compaction can result in the loss of subtle but critical context whose importance only becomes apparent later. For engineers implementing compaction systems, we recommend carefully tuning your prompt on complex agent traces. Start by maximizing recall to ensure your compaction prompt captures every relevant piece of information from the trace, then iterate to improve precision by eliminating superfluous content.

“压缩”的艺术在于选择保留什么和丢弃什么，因为“过于激进的压缩”可能会导致丢失那些微妙但关键的上下文，而这些上下文的重要性可能在稍后才会显现。对于实现压缩系统的工程师，我们建议您在复杂的智能体轨迹（traces）上仔细调优您的（压缩）提示词。首先要最大限度地提高“召回率”（recall），确保您的压缩提示词能从轨迹中捕获每一条相关信息；然后通过消除多余内容来迭代提高“精确率”（precision）。

An example of low-hanging superfluous content is clearing tool calls and results – once a tool has been called deep in the message history, why would the agent need to see the raw result again? One of the safest lightest touch forms of compaction is tool result clearing, most recently launched as a feature on the Claude Developer Platform.

一个“唾手可得”的多余内容示例是“清除工具的调用和结果”——一旦某个工具在消息历史的深处被调用过，智能体为什么还需要再次看到原始结果呢？“工具结果清除”是目前最安全、“轻触式”（lightest touch）的压缩形式之一，它最近已作为一项功能在 Claude 开发者平台上发布。

结构化笔记 (Structured note-taking)

Structured note-taking, or agentic memory, is a technique where the agent regularly writes notes persisted to memory outside of the context window. These notes get pulled back into the context window at later times.

“结构化笔记”，或称“智能体记忆”（agentic memory），是一种让智能体定期将笔记写入并持久化到上下文窗口之外的内存中的技术。这些笔记可以在稍后的时间被拉回到上下文窗口中。

This strategy provides persistent memory with minimal overhead. Like Claude Code creating a to-do list, or your custom agent maintaining a NOTES.md file, this simple pattern allows the agent to track progress across complex tasks, maintaining critical context and dependencies that would otherwise be lost across dozens of tool calls.

这种策略以最小的开销提供了“持久化记忆”。就像 Claude Code 创建一个待办事项列表（to-do list），或者您的自定义智能体维护一个 NOTES.md 文件一样，这种简单的模式允许智能体跟踪复杂任务的进展，维护那些否则可能会在几十次工具调用中丢失的关键上下文和依赖关系。

Claude playing Pokémon demonstrates how memory transforms agent capabilities in non-coding domains. The agent maintains precise tallies across thousands of game steps—tracking objectives like "for the last 1,234 steps I've been training my Pokémon in Route 1, Pikachu has gained 8 levels toward the target of 10." Without any prompting about memory structure, it develops maps of explored regions, remembers which key achievements it has unlocked, and maintains strategic notes of combat strategies that help it learn which attacks work best against different opponents.

“Claude 玩《精灵宝可梦》（Pokémon）”的示例展示了记忆如何在非编码领域改变智能体的能力。该智能体在数千个游戏步骤中保持着精确的记录——跟踪诸如“在过去的 1234 个步骤中，我一直在 1 号公路训练我的宝可梦，皮卡丘已经升了 8 级，目标是 10 级”这样的目标。在没有任何关于记忆结构的提示下，它自发地绘制了已探索区域的地图，记住了它已解锁的关键成就，并维护着战斗策略的战略笔记，帮助它学习哪种攻击对不同的对手最有效。

After context resets, the agent reads its own notes and continues multi-hour training sequences or dungeon explorations. This coherence across summarization steps enables long-horizon strategies that would be impossible when keeping all the information in the LLM’s context window alone.

在上下文重置后，智能体会读取它自己的笔记，并继续执行长达数小时的训练序列或地牢探索。这种跨越“压缩”步骤的一致性，使得那些长周期策略成为可能，而如果仅靠 LLM 的上下文窗口来保留所有信息，这些策略是无法实现的。

As part of our Sonnet 4.5 launch, we released a memory tool in public beta on the Claude Developer Platform that makes it easier to store and consult information outside the context window through a file-based system. This allows agents to build up knowledge bases over time, maintain project state across sessions, and reference previous work without keeping everything in context.

作为我们 Sonnet 4.5 发布的一部分，我们在 Claude 开发者平台上公开发布了一个“记忆工具”（memory tool）的 Beta 版，它使通过基于文件的系统在上下文窗口之外存储和查阅信息变得更加容易。这允许智能体随时间推移建立知识库，跨会话（sessions）维护项目状态，并在无需将所有内容保留在上下文中的情况下引用先前的工作。

子智能体架构 (Sub-agent architectures)

Sub-agent architectures provide another way around context limitations. Rather than one agent attempting to maintain state across an entire project, specialized sub-agents can handle focused tasks with clean context windows. The main agent coordinates with a high-level plan while subagents perform deep technical work or use tools to find relevant information. Each subagent might explore extensively, using tens of thousands of tokens or more, but returns only a condensed, distilled summary of its work (often 1,000-2,000 tokens).

“子智能体架构”提供了另一种绕过上下文限制的方法。它不是让一个智能体试图在整个项目中维护状态，而是让专门的“子智能体”（sub-agents）在“干净的”上下文窗口中处理集中的任务。主智能体（main agent）负责协调一个高层计划，而子智能体则执行深入的技术工作或使用工具查找相关信息。每个子智能体可能会进行广泛的探索，使用数万甚至更多的词元，但最终只返回一个（通常 1000-2000 词元）凝练、蒸馏过的工作摘要。

This approach achieves a clear separation of concerns—the detailed search context remains isolated within sub-agents, while the lead agent focuses on synthesizing and analyzing the results. This pattern, discussed in How we built our multi-agent research system, showed a substantial improvement over single-agent systems on complex research tasks.

这种方法实现了清晰的“关注点分离”（separation of concerns）——详细的搜索上下文被隔离在子智能体内部，而领导智能体（lead agent）则专注于综合和分析结果。我们在《我们如何构建多智能体研究系统》一文中讨论过这种模式，它在复杂研究任务上的表现比单智能体系统有了显著的改进。

The choice between these approaches depends on task characteristics. For example:

Compaction maintains conversational flow for tasks requiring extensive back-and-forth;

Note-taking excels for iterative development with clear milestones;

Multi-agent architectures handle complex research and analysis where parallel exploration pays dividends.

这些方法之间的选择取决于任务的特性。例如：

压缩（Compaction）适用于需要大量来回交互的任务，以保持对话的流畅性；

笔记（Note-taking）擅长于具有清晰里程碑的迭代式开发；

多智能体架构（Multi-agent architectures）则能处理那些“并行探索”能带来巨大收益的复杂研究和分析任务。

Even as models continue to improve, the challenge of maintaining coherence across extended interactions will remain central to building more effective agents.

即使模型在不断进步，“如何在长时交互中保持一致性”这一挑战，仍将是构建更高效智能体的核心问题。

结语

Context engineering represents a fundamental shift in how we build with LLMs. As models become more capable, the challenge isn't just crafting the perfect prompt—it's thoughtfully curating what information enters the model's limited attention budget at each step. Whether you're implementing compaction for long-horizon tasks, designing token-efficient tools, or enabling agents to explore their environment just-in-time, the guiding principle remains the same: find the smallest set of high-signal tokens that maximize the likelihood of your desired outcome.

上下文工程代表了我们使用 LLM 进行构建方式的一次根本性转变。随着模型变得越来越强大，挑战不再仅仅是精心设计一个完美的提示词，而是在每一步都深思熟虑地筛选哪些信息可以进入模型有限的“注意力预算”。无论您是在为长周期任务实现“压缩”，设计“词元高效”的工具，还是让智能体能够“即时”探索其环境，其指导原则始终如一：找到那个能最大限度提高预期结果可能性的、最小的高信号词元集。

The techniques we've outlined will continue evolving as models improve. We're already seeing that smarter models require less prescriptive engineering, allowing agents to operate with more autonomy. But even as capabilities scale, treating context as a precious, finite resource will remain central to building reliable, effective agents.

我们概述的这些技术将随着模型的进步而不断演变。我们已经看到，更智能的模型需要更少的“规定性工程”（prescriptive engineering），允许智能体以更大的自主性运行。但即便能力在不断扩展，“将上下文视为一种宝贵的、有限的资源”这一点，仍将是构建可靠、高效智能体的核心。

Get started with context engineering in the Claude Developer Platform today, and access helpful tips and best practices via our memory and context management cookbook.

立即在 Claude 开发者平台开始您的上下文工程实践，并通过我们的“内存与上下文管理”指南（cookbook）获取有用的提示和最佳实践。

致谢

Written by Anthropic's Applied AI team: Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, and Jeremy Hadfield, with contributions from team members Rafi Ayub, Hannah Moran, Cal Rueb, and Connor Jennings. Special thanks to Molly Vorwerck, Stuart Ritchie, and Maggie Vo for their support.

本文由 Anthropic 的应用 AI 团队撰写：Prithvi Rajasekaran、Ethan Dixon、Carly Ryan 和 Jeremy Hadfield，团队成员 Rafi Ayub、Hannah Moran、Cal Rueb 和 Connor Jennings 亦有贡献。特别感谢 Molly Vorwerck、Stuart Ritchie 和 Maggie Vo 的支持。