Chapter 05

System Prompt 与缓存经济

一次错误的 system prompt 修改可能让你的 API 账单暴涨 10 倍。 Hermes 用三层结构 + 几条铁律避免这件事。

本章约 5,500 字阅读 ~20 分钟关键词：prompt cache · stable/context/volatile · cache_control

Agent 的成本 90% 来自 input tokens。一个 10 轮对话的 turn，每轮把 8K system prompt 重发一次，就是 80K input tokens。换成钱：以 Claude Opus 计算（input $15/M tokens），这次对话光 system prompt 就花 $1.20。如果 prompt cache 命中，价格降到 $0.12—— 10 倍差距。Agent 工程里没什么比这条线更值钱了。

5.1Prompt Cache 是什么

Prompt cache 的核心想法：同一段 prompt 前缀被反复使用时，server 端缓存它的 KV-state，下次相同前缀的请求只算"增量"部分的算力。

主流 provider 都有：

Provider	缓存方式	费率（cached input vs normal input）	TTL
Anthropic	显式（`cache_control` 标记）	cached: 10% (read) / 125% (write)	5 min / 1 hour 两档
OpenAI	自动（≥1024 tokens 前缀）	cached: 50%	5–10 min
DeepSeek	自动	cached: 10%	动态
Google Gemini	显式（context caching API）	cached: 25% + 小时存储费	用户配置

无论哪家，cache 命中的条件都是：发送的 messages 前缀和上次逐 token 完全一致。哪怕中间多一个空格、少一个标点，cache 就破。

铁律 Agent 的所有 prompt 管理逻辑，第一原则是： 不要修改已经发出去的 prompt 前缀。所有"动态内容"必须 append 到末尾，不能 insert 中间。

5.2System Prompt 的三层结构

Hermes 把 system prompt 切成三层，每层有不同的稳定性。看 agent/system_prompt.py:60-77：

agent/system_prompt.py:60-77
def build_system_prompt_parts(agent, system_message=None) -> Dict[str, str]:
    """Assemble the system prompt as three ordered parts.

    Returns a dict with three keys:
      * stable   — identity, tool guidance, skills prompt,
                   environment hints, platform hints, model-family
                   operational guidance.
      * context  — context files (AGENTS.md, .cursorrules, etc.)
                   and caller-supplied system_message.
      * volatile — memory snapshot, user profile, external
                   memory provider block, timestamp line.
    """
    stable_parts = []
    context_parts = []
    volatile_parts = []
    ...
    return {
        "stable": "\n\n".join(stable_parts),
        "context": "\n\n".join(context_parts),
        "volatile": "\n\n".join(volatile_parts),
    }

Listing 5.1 三层 prompt 的核心入口

三层的稳定性对照：

层	包含	稳定性	变化触发条件
stable	SOUL.md 身份、工具用法、模型家族指导	整个 session 不变	切 session 或重启
context	AGENTS.md、.cursorrules、调用方 system_message	session 内一般不变	切工作目录、用户改文件
volatile	Memory 快照、user profile、时间戳	每轮可能变	每次 turn / 内存更新

拼接顺序是 stable + context + volatile，从最稳定到最易变。这个顺序极其重要——它直接决定 cache 命中率。

为什么这个顺序

Prompt cache 命中的边界是最长公共前缀。如果你把 volatile 放在最前面，所有后面的稳定内容也跟着被破坏。所以要让最稳定的在最前。

flowchart TB
  subgraph Bad["❌ 坏的顺序：volatile + stable"]
    direction TB
    B1["Turn 1: [时间戳-T1] [身份] [工具] [AGENTS.md]"]
    B2["Turn 2: [时间戳-T2] [身份] [工具] [AGENTS.md]"]
    B1 -.->|"T1 != T2 ⇒ 前缀全变"| B2
    BAll["⚠️ 全部 cache miss"]:::warn
    B2 --> BAll
  end
  subgraph Good["✅ 好的顺序：stable + context + volatile"]
    direction TB
    G1["Turn 1: [身份] [工具] [AGENTS.md] [时间戳-T1]"]
    G2["Turn 2: [身份] [工具] [AGENTS.md] [时间戳-T2]"]
    G1 -.->|"前缀完全相同"| G2
    GHit["✓ 前缀 cache 命中
只有末尾时间戳是 miss"]:::good
    G2 --> GHit
  end
  classDef warn fill:#faf0e3,stroke:#b85c00,color:#b85c00
  classDef good fill:#ecf3eb,stroke:#2f5d3a,color:#2f5d3a

5.3每层装什么

Stable 层 — 不变的灵魂

看 system_prompt.py 第 84–254 行：

agent/system_prompt.py:84-150 (节选)
# Agent 身份（来自 SOUL.md 或 DEFAULT_AGENT_IDENTITY）
_soul_content = _r.load_soul_md()
stable_parts.append(_soul_content)

# 工具具体使用指导（条件加入）
if "memory" in agent.valid_tool_names:
    tool_guidance.append(MEMORY_GUIDANCE)
if "session_search" in agent.valid_tool_names:
    tool_guidance.append(SESSION_SEARCH_GUIDANCE)
if "todo" in agent.valid_tool_names:
    tool_guidance.append(TODO_GUIDANCE)

# 工具调用纪律（某些模型容易"光说不练"）
if agent.valid_tool_names:
    _enforce = agent._tool_use_enforcement
    if _enforce is True or agent.model in TOOL_USE_ENFORCEMENT_MODELS:
        stable_parts.append(TOOL_USE_ENFORCEMENT_GUIDANCE)

# 模型家族特定指导（Gemini / Qwen / GPT 各有不同 quirk）
model_family_block = _get_model_family_guidance(agent.model)
if model_family_block:
    stable_parts.append(model_family_block)

设计精神：

条件加入指导：没启用某工具就不加它的指导。避免无意义的 token。
模型家族区分：Gemini 倾向"过度补全"，加一段"answer concisely"； Qwen 在 reasoning 后偶尔忘记 tool_call_id，加一段提醒……每个 block 都是踩坑得来的。
所有内容启动时拼好，session 内永不重算。

Context 层 — 工作区相关

agent/system_prompt.py:260-270
if system_message is not None:
    context_parts.append(system_message)

if not agent.skip_context_files:
    context_files_prompt = _r.build_context_files_prompt(
        cwd=_context_cwd,
        skip_soul=_soul_loaded,
    )
    if context_files_prompt:
        context_parts.append(context_files_prompt)

build_context_files_prompt() 扫描当前工作目录，自动发现并包括：

AGENTS.md — 项目对 AI 的开发指南（Hermes 本身有 53KB 的）
.cursorrules — Cursor IDE 用的规则文件
CLAUDE.md — Claude Code 用的指南
用户自定义的其他 .md

设计 Hermes 把 AGENTS.md/CLAUDE.md/.cursorrules 都当成同一种东西—— 项目对 AI 的"使用说明"。这是 2024 年起业界的共识。第 14 章我们会讲怎么给你自己的项目写这种文件。

Volatile 层 — 每轮可能变

agent/system_prompt.py:272-280
# Memory snapshot — 上次 sync 的内存状态
memory_block = agent.memory_manager.build_system_prompt()
if memory_block:
    volatile_parts.append(memory_block)

# User profile — USER.md
user_profile = _r.load_user_profile()
if user_profile:
    volatile_parts.append(user_profile)

# 时间戳 + session + model — 总是最后一行
volatile_parts.append(f"Current time: {now_iso()}\nSession: {sid}\nModel: {agent.model}")

注意时间戳——它每轮都变。所以 volatile 层就算第一次也注定 cache miss。但因为它在最末尾，只影响它自己之后（其实没有了），前面 stable + context 该命中还是命中。

5.4cache_control 标记（Anthropic）

Anthropic 的 cache 是显式——你要主动加 cache_control 字段告诉它 "在这里设个 cache breakpoint"。最多 4 个 breakpoints。Hermes 的标注策略：

# Anthropic 兼容 endpoint 的 messages 示例
[
    {
        "role": "system",
        "content": [
            {"type": "text", "text": stable_part},
            {"type": "text", "text": context_part,
             "cache_control": {"type": "ephemeral"}},   # ← 缓存到这里
            {"type": "text", "text": volatile_part}        # volatile 不被 cache
        ]
    },
    {"role": "user", "content": "..."},
    ...
]

这个 cache_control 标记的语义："从消息开头到这里为止，是缓存目标"。 Hermes 通常在 context_part 末尾标记，volatile 部分不进缓存。

tools 的 cache 标记

Tools schemas 数组本身也算 input tokens——可能上千 tokens。Anthropic 允许给 tools 加 cache_control：

tools = [
    {
        "type": "function",
        "function": {...},
    },
    ...
    {
        "type": "function",
        "function": {...},
        "cache_control": {"type": "ephemeral"}   # 最后一个工具标记
    }
]

Hermes 在 tools 列表末尾自动打标记。所以工具 schema 的 cache 命中率非常高—— 只要你不改 toolset，整个 session 都命中。

5.5什么操作会破缓存

Hermes 的 AGENTS.md 里有这样一条铁律：

Hermes-Agent ensures caching remains valid throughout a conversation. Do NOT implement changes that would:

Alter past context mid-conversation

Change toolsets mid-conversation

Reload memories or rebuild system prompts mid-conversation

Hermes AGENTS.md

具体来说，下面这些操作会让 cache 报废：

动作	破缓存吗	原因
追加新消息（user / assistant）	否	纯 append，前缀不变
修改 system prompt	是	前缀第一段就变了
修改 tools 列表（即使只是顺序）	是	tools 在 prompt cache 范围内
在历史中间插入消息	是	前缀变了
修改历史消息（编辑过去）	是	同上
把 volatile 内容（时间戳）放进 stable	是	每轮变化的内容在前缀范围
每轮重算 system prompt（即使内容相同）	不应该	字节级要一致，再算一次有风险

正确做法：append-only

所有动态信息只 append 到末尾。Hermes 这条规则被多个机制实现：

Skills 加载：不改 system prompt，而是追加一条 user message。包含 SKILL.md 内容。这样 system prompt cache 不破。
Memory prefetch：每轮把检索结果追加到 volatile 层。 stable 和 context 不变。
工具结果：作为新的 tool message append。
用户中途调指令（比如 /model）：不立即生效，下个 session 才生效。或者用 --now flag 显式接受破缓存。

slash 命令的 cache-aware 设计 Hermes 的 /skills install 默认deferred invalidation—— 装完不立即重建 prompt，等下个 session 自动生效。如果你急着要： /skills install --now，承担一次 cache miss 的成本。这条原则贯穿所有"会修改 prompt"的命令。

5.6Compaction：当 cache 救不了你

Cache 优化的是"重复前缀"。但当对话越来越长，前缀本身就要塞不下 context window。这时就要用 compaction：

前沿引用

Effective Context Engineering for AI Agents

Anthropic Engineering · 2025

Anthropic 2025 年的 context engineering 文章里把 compaction 称为 "the first lever in context engineering to drive better long-term coherence by distilling the contents in a high-fidelity manner."

Hermes 的 compaction 机制：

agent/context_compression.py (摘要)
def compress_context(messages, system_prompt, max_tokens, ...):
    """把长 messages 列表压成短消息列表。

    1. 留下最后 N 条消息（最近的，仍然 high-fidelity）。
    2. 把前面的部分喂给一个 auxiliary LLM（小模型/便宜模型），
       生成结构化总结。
    3. 用 [system, <总结消息>, ...最近 N 条] 替换原 messages。
    """
    keep_recent = messages[-RECENT_KEEP:]
    history_to_compress = messages[:-RECENT_KEEP]

    summary = aux_llm.summarize(history_to_compress,
                                  goal="preserve facts, decisions, errors")

    new_messages = [
        {"role": "user", "content": f"[Conversation summary: {summary}]"},
        *keep_recent,
    ]
    return new_messages

关键设计选择：

用便宜模型做 summary：Hermes 配置一个 auxiliary client，默认是个便宜小模型（如 Claude Haiku、GPT-4o-mini）。压缩本身不该贵。
结构化 prompt：让总结保留事实、决策、错误，丢掉对话客套。
留 N 条最近的：最近的内容仍然原样，让模型不丢"上下文动量"。
压缩后开新 session 还是同 session？Hermes 默认同 session 内替换 messages，但有些场景（cron 任务）用 parent_session_id 链接到新 session。

5.7Just-In-Time Context（懒加载）

Anthropic 2025 提的第二招：不要预加载所有数据进 prompt，让 Agent 用工具按需拉取。

反模式

# BAD: 把整个文件预先塞进 prompt
system_prompt = """You are a code reviewer. Here are all the files:
File 1: src/auth.py
[完整内容，2000 lines]
File 2: src/db.py
[完整内容，3000 lines]
...
"""

这会让 system prompt 巨大，cache miss 成本巨大，而且模型可能只用其中一两个文件。

JIT 模式

# GOOD: 只放索引，模型自己决定读哪个
system_prompt = """You are a code reviewer. The repo has these files:
- src/auth.py
- src/db.py
- src/api/users.py
...
Use the read_file tool to view file contents when needed.
"""

模型用 read_file 工具按需读。这正是 Claude Code、Cursor、Hermes 干的事—— 它们从不预先 dump 所有代码。

对比：

	预加载	JIT
System prompt 体积	大（含全部数据）	小（只含索引）
每轮成本	所有数据都付钱	只为读到的付钱
Cache 命中	一旦数据变就全破	system 不变，永远命中
模型注意力	分散在所有数据上	聚焦于当前需要的
开发负担	需要决定塞什么	需要写好的工具描述

Tool Result Clearing（Anthropic 2025）

JIT 还有个进阶：模型读完文件用了，清掉旧的 tool result，只留 assistant 的总结。 Anthropic 在 2025 年把这做成 Claude API 一级功能：

# Anthropic API 的 tool_choice 现在支持 clear_tool_results
client.messages.create(
    ...
    tool_choice={"type": "auto"},
    clear_tool_results_above="some_message_id",   # 清掉旧的
)

Hermes 还没完全用上这个新 API（写作时是 2026.05），但代码层面有类似机制：压缩时优先丢老的 tool result，保留 assistant 的文本总结。

5.8实战：Skills 的注入策略

这一节回到具体场景：用户输入 /skills install fortune。怎么做才不破 cache？

方案 A（坏）：把新 skill 加到 system prompt。立刻破整个 cache。

方案 B（好——Hermes 的做法）：

下载 skill 到 ~/.hermes/skills/。
不修改 system prompt。
下次用户输入 /fortune 或者新 session 开始时，自然加载新 skill。

当用户调用 /fortune：

# 不是把 SKILL.md 塞进 system prompt
# 而是作为 user message append 进去
messages.append({
    "role": "user",
    "content": f"""[Skill activated: fortune]

{SKILL.md 内容}

请按照这个 skill 的指导回答用户接下来的问题。
"""
})

这样 system prompt 和 tools 缓存照样命中。skill 内容作为 user message 也参与缓存—— 下次同一 skill 被启用，从这条 user message 开始的前缀也能命中（如果消息内容字节级一致）。

设计金句 System prompt 像操作系统镜像，user message 像应用程序。装应用不该重启操作系统。把"上下文相关的能力"放进 user message 而不是 system prompt，是 Hermes 工程的核心智慧。

5.9怎么衡量你的 cache 命中率

不同 provider 在 response 里都返回缓存统计：

# Anthropic 响应中的 usage 字段
{
    "usage": {
        "input_tokens": 200,                # 本次真正"算"的 input
        "cache_creation_input_tokens": 0,    # 写 cache 的 input
        "cache_read_input_tokens": 8000,      # 从 cache 读的 input
        "output_tokens": 150
    }
}

命中率 = cache_read / (cache_read + input_tokens + cache_creation)。理想是 > 90%。如果你看到 < 50%，肯定有某处反复破 cache。

Hermes 提供 /insights 命令查看本 session 的 cache 统计：

$ hermes
> /insights
─── Cache Performance ──────────────────────────
  Read:     12,580 tokens   (cached, $0.013 saved)
  Write:    1,420 tokens    (this session built cache)
  Miss:     320 tokens      ← 健康
  Hit rate: 96%
─── Cost Breakdown ─────────────────────────────
  Input:    $0.0042
  Cached:   $0.0095
  Output:   $0.0231
  Total:    $0.0368

5.102025–2026 Cache 协议新进展

2025–2026 各家在 cache 上又出了几个新功能，影响 cache 设计：

① Anthropic 1-hour Cache TTL（2025 GA）

原本 Anthropic cache 默认 5 分钟 TTL。2025 年 GA 了1 小时 TTL 选项，写 cache 时按 2 倍价格收费（vs 5 分钟的 1.25 倍），但适合长期不变的大块前缀 （比如静态指令、整套工具 schema、AGENTS.md 等）。

选 TTL 的简单经济算法：

同一前缀未来 1 小时被命中 ≥ 2 次 → 选 1 小时 TTL 划算。
突发短任务 → 5 分钟 TTL 即可。

Hermes 默认 5 分钟。生产部署如果你的 Agent 长期固定 + 流量稳定，改 1 小时省 ~30%。

② OpenAI Prompt Caching 自动化（2024.10 GA）

OpenAI 不需要显式标记。它自动检测"前缀重复 ≥ 1024 tokens"就 cache， cache hit 享 50% 折扣。开发者透明——但你仍然要避免"破坏前缀"。和 Anthropic 哲学相同，机制不同。

③ DeepSeek Auto Caching

DeepSeek 在 2024 年初早早做了 auto caching：缓存 hit 收原价 10%，比 OpenAI 还便宜。配 V4/R2 用，性价比相当强。

④ Tool Result Clearing（Anthropic 2025）

上一节我们提过 JIT context 和它的关系。具体协议层：

client.messages.create(
    ...
    cache_control={"type": "ephemeral", "ttl": "1h"},
    # 2025 新字段：让某个时间点之前的 tool result 清掉
    clear_tool_results_above="msg_abc",
)

Hermes 还没用上这个新参数（写作时 2026.05），但代码里已经有等价机制—— 在 Compaction 时优先丢老 tool result、保留 assistant 总结。

⑤ Extended Context Cache（2026 初）

2026 年初 OpenAI 推 GPT-5 的"context cache"——和 prompt cache 类似但 TTL 长达 24 小时、按存储量按时收费。适合"上传一份 10MB 文档让模型反复回答"这种场景。 Hermes 内置 memory 系统某种意义上是这个能力的客户端实现。

⑥ 多模态 cache

2025 起，图片/视频 input 也支持 cache（如 Claude vision 同一张图被多次问）。这对 GUI Agent / vision 类应用极其关键——一张 screenshot 反复分析时不再每次重算。

2026 cache 经济学最佳实践 ① 长前缀 + 长 TTL —— 1 小时 cache 比 5 分钟划算的临界点是"未来 1 小时内 ≥ 2 次命中"。
② 用 Anthropic 4 个 cache_control breakpoints 优势全占——别只用一个。
③ 跨 session 共用同一份 system prompt + AGENTS.md，让多用户共享 cache。
④ 监控 cache hit rate，目标 > 90%；< 70% 必须查为什么。

5.11代码深挖:cache_control 注入与 token 追踪

5.11.1 apply_anthropic_cache_control:"system_and_3" 策略

第 5.4 节说 Anthropic 4 个 breakpoint。看实际怎么放:

agent/prompt_caching.py:15-79
def _apply_cache_marker(msg, cache_marker, native_anthropic=False):
    """Add cache_control to a single message, handling all format variations."""
    role = msg.get("role", "")
    content = msg.get("content")

    if role == "tool":
        if native_anthropic:
            msg["cache_control"] = cache_marker
        return

    if content is None or content == "":
        msg["cache_control"] = cache_marker
        return

    if isinstance(content, str):
        # 字符串 content → 转 list of content blocks,marker 放最后块
        msg["content"] = [
            {"type": "text", "text": content, "cache_control": cache_marker}
        ]
        return

    if isinstance(content, list) and content:
        last = content[-1]
        if isinstance(last, dict):
            last["cache_control"] = cache_marker


def _build_marker(ttl):
    marker = {"type": "ephemeral"}
    if ttl == "1h":
        marker["ttl"] = "1h"
    return marker


def apply_anthropic_cache_control(api_messages, cache_ttl="5m", native_anthropic=False):
    """system_and_3 strategy: system + last 3 non-system messages, same TTL.

    Returns deep copy with breakpoints injected."""
    messages = copy.deepcopy(api_messages)
    if not messages:
        return messages

    marker = _build_marker(cache_ttl)
    breakpoints_used = 0

    if messages[0].get("role") == "system":
        _apply_cache_marker(messages[0], marker, native_anthropic=native_anthropic)
        breakpoints_used += 1

    remaining = 4 - breakpoints_used
    non_sys = [i for i in range(len(messages)) if messages[i].get("role") != "system"]
    for idx in non_sys[-remaining:]:
        _apply_cache_marker(messages[idx], marker, native_anthropic=native_anthropic)

    return messages

关键设计点:

content 三种形态都处理:None/空 → 加 msg 级 cache_control; string → 转 list of blocks 再加;list → 给最后块加。这种"格式归一化"是 protocol adapter 的典型工作量。
system_and_3 而不是 last_4:总是优先用 system prompt 占一个 breakpoint, 剩 3 个分给最近 3 条非 system 消息。理由:system prompt 跨 turn 不变 → cache 复用最稳。最近 3 条 user/assistant 是"刚发的"—— 下个 turn 又被新一轮变成"非最近 3 条"。
copy.deepcopy:原 messages 不被污染——同一组 messages 可能在 retry / fallback 场景被多次调用,不允许 in-place 修改。

5.11.2 ContextCompressor 的 anti-thrash 保护

第 3.8 节我们说 compression 触发用真实 token 数。看完整状态机:

agent/context_compressor.py:578-634
class ContextCompressor:
    def __init__(self, ...):
        self.last_prompt_tokens = 0
        self.last_completion_tokens = 0
        self.last_total_tokens = 0

        # Anti-thrashing 状态
        self._last_compression_savings_pct = 100.0
        self._ineffective_compression_count = 0
        self._summary_failure_cooldown_until = 0.0

    def update_from_response(self, usage):
        """Update tracked token usage from API response."""
        self.last_prompt_tokens = usage.get("prompt_tokens", 0)
        self.last_completion_tokens = usage.get("completion_tokens", 0)
        self.last_total_tokens = usage.get("total_tokens",
            self.last_prompt_tokens + self.last_completion_tokens)

    def should_compress(self, prompt_tokens=None):
        """Check if context exceeds compression threshold.

        Anti-thrashing: if last 2 compressions each saved < 10%, skip —
        avoids infinite loops where each pass removes 1-2 messages."""
        tokens = prompt_tokens if prompt_tokens is not None else self.last_prompt_tokens
        if tokens < self.threshold_tokens:
            return False
        # 连续 2 次"压缩没省多少" → 跳过,告诉用户 /new
        if self._ineffective_compression_count >= 2:
            if not self.quiet_mode:
                logger.warning(
                    "Compression skipped — last %d compressions saved <10%% each. "
                    "Consider /new to start a fresh session, or /compress <topic> "
                    "for focused compression.",
                    self._ineffective_compression_count,
                )
            return False
        return True

为什么是 2 次? 1 次不够说明问题(可能恰好那次内容真的不好压)。 3 次太晚——已经浪费 2 次 LLM 调用钱了。 2 次是"快速失败"的甜点。并且 warning 不是默默 skip——它告诉用户具体怎么办(/new 或 /compress <topic>)。错误信息是给人看的"使用说明书",这是好工程。

5.12本章带走的

Prompt cache 是 Agent 经济的命脉。命中 vs 未命中差 5–10 倍成本。
所有 cache 命中的条件都是：发送前缀逐字节和上次一致。
Hermes 用三层结构组织 system prompt：stable / context / volatile，按稳定性从前到后排列，保证 cache 命中边界尽可能靠后。
动态内容只append 到末尾，永不 insert/edit 中间。
Anthropic 用显式 cache_control 标记，其他家自动。Hermes 的标记策略：context 末尾 + tools 末尾。
Skills 通过 user message 注入，不污染 system prompt。这是 Hermes 工程最核心的一招。
当 context 接近窗口上限时用 Compaction（让小模型总结）。
JIT context：放索引、不放全数据，让 Agent 用工具按需拉取。
用 /insights 或 provider 的 usage 字段监控 cache 命中率，> 90% 才健康。

章末练习

Easy 下列哪个动作会破 cache？
- (a) 在历史中间删一条消息
- (b) 在最后追加一条 user message
- (c) 把当前时间放在 system prompt 第一行
- (d) 修改第 5 条消息的 1 个标点
- (e) 添加新工具到 tools 列表
Easy 为什么 Hermes 三层 prompt 顺序必须是 stable → context → volatile，不能反过来？
Medium 读 agent/system_prompt.py，列出 stable 层里有哪些条件分支（if "X" in agent.valid_tool_names: 这种）。想清楚：每个 if 是怎样在不同 session 之间影响 cache 行为的？
Medium 设计一个场景：用户每天早上 9 点都让 Agent 总结他的 RSS。每天的 RSS 内容不同，但 prompt 模板不变。怎么组织 prompt 让 cache 命中最大化？把每天变化的部分放到哪里？
Hard 把"JIT context"应用到一个具体场景：你要 Agent 在 50,000 行代码库上工作。列出 5 个工具的 schema，让 Agent 不预加载任何代码就能完成任务。
Hard Anthropic 的 5 分钟 vs 1 小时 cache TTL 选择——什么场景选 5 分钟划算，什么场景选 1 小时？写一个决策树。提示：考虑 cache_creation_tokens 在两种 TTL 下的价格。

← 上一章

第 4 章 · 消息协议与 Tool Calling

第 6 章 · Tool Registry 与发现