Chapter 03

run_conversation 深度解剖

3800 行的一个 while 循环里到底装了什么。我们从最干净的 13 行开始，逐层加回工程现实。

本章约 8,200 字阅读 ~35 分钟关键词：agent loop · budget · interrupt · retry · streaming

本书的中心章节。打开 Hermes 的 agent/conversation_loop.py，从 232 行到 3821 行，是一个函数——run_conversation()。3589 行。一个 while 循环。把整个 Agent 的灵魂都装在这里。

这一章我们做一件事：把这 3589 行剥成 13 行精髓，然后一层一层把工程现实加回来。读完你应该明白每一层为什么必须加，以及加错了会出什么 bug。

3.113 行精髓

把 ReAct 循环写成最干净的 Python，长这样：

# Agent 循环的最干净版本
def run_conversation(user_message, messages, tools, max_iter=30):
    messages.append({"role": "user", "content": user_message})

    for _ in range(max_iter):
        response = llm.chat(messages, tools=tools)
        messages.append(response.message)

        if response.message.tool_calls:
            for call in response.message.tool_calls:
                result = execute_tool(call.name, call.args)
                messages.append({"role": "tool",
                                 "tool_call_id": call.id,
                                 "content": result})
        else:
            return response.message.content

这 13 行能跑。它就是 ReAct loop 的最小可执行实现。但你把它扔到生产环境，会立刻撞上以下问题。我们按问题分组加回代码。

阅读建议打开 hermes-agent/agent/conversation_loop.py。我会标注每个问题对应的代码区间，你随时切到 IDE 对照看。整章的目标是"把代码读出来"而不是"把代码再写一遍"。

3.2问题一：循环不能跑到死

上面 13 行有 max_iter 上限，看起来 OK。但有几个坑：

如果最后一次循环是 tool_calls，工具结果加进去就跳出，LLM 从来没机会"看到结果并答用户"——用户拿到的是沉默。
max_iter 是按"次"算的，但有的 turn 可能要 5 次工具调用，有的要 50 次。固定数字不够灵活。
多个并发 turn 共用一个预算池怎么办？

Hermes 的解法：IterationBudget + Grace Call

看 Hermes 的实际代码（行号近似）：

agent/conversation_loop.py:644-669
# 真实的循环条件
while (api_call_count < agent.max_iterations
        and agent.iteration_budget.remaining > 0) \
        or agent._budget_grace_call:

    agent._checkpoint_mgr.new_turn()

    if agent._interrupt_requested:
        interrupted = True
        _turn_exit_reason = "interrupted_by_user"
        if not agent.quiet_mode:
            agent._safe_print("\n⚡ Breaking out of tool loop due to interrupt...")
        break

    api_call_count += 1
    agent._api_call_count = api_call_count
    agent._touch_activity(f"starting API call #{api_call_count}")

    # Grace call: 预算耗尽但给一次宽限。消耗这个 flag。
    if agent._budget_grace_call:
        agent._budget_grace_call = False
    elif not agent.iteration_budget.consume():
        _turn_exit_reason = "budget_exhausted"
        if not agent.quiet_mode:
            agent._safe_print(
                f"\n⚠️ Iteration budget exhausted "
                f"({agent.iteration_budget.used}/"
                f"{agent.iteration_budget.max_total} iterations used)")
        break

Listing 3.1 循环入口的预算与中断检查

IterationBudget 这个抽象

IterationBudget 是一个共享对象，跨多个 subagent 也算同一份预算。它有 max_total、used、remaining 三个属性，以及一个 consume() 方法：减 1，返回 True；如果已经为 0，返回 False。

为什么不直接用 api_call_count < max_iterations？因为 Hermes 有 delegate_task 工具：父 Agent 可以派子 Agent，子 Agent 在自己的循环里也消费 token。如果预算是各 Agent 独立计数，父 Agent 派 10 个子 Agent 就能用 10 倍预算——失控。共享预算池强制全树总量受控。

Grace Call 详解

关键的 4 行：

if agent._budget_grace_call:
    agent._budget_grace_call = False      # 这次用掉
elif not agent.iteration_budget.consume():
    ## 第一次预算耗尽 — 退出
    break

但等等——这里看不到 _budget_grace_call = True 在哪儿设置？答案在循环之外，budget 耗尽时的退出后处理逻辑里。简化版：

# 循环外（不在 conversation_loop 主体）
if _turn_exit_reason == "budget_exhausted":
    # 检查最后一条消息是 assistant(with tool_calls) → tool
    # 如果是，让模型有一次"收尾"的机会
    if messages[-1]["role"] == "tool":
        agent._budget_grace_call = True
        # 重新进入循环 ↑

效果：第一次撞预算上限，立刻 break。但发现最后挂着 tool 消息，模型还没说话—— 设 grace flag，重新进 while。这次因为 grace flag 是 True，循环条件成立，允许再一次 API call。但 grace flag 会被消耗，下次循环再撞预算就真的退出了。

为什么不直接 max_iterations += 1 Grace call 和"多给一次"的区别是语义化：grace call 只在"模型还没说话就被截断" 时触发。如果模型已经返回文本（最后一条 message 是 assistant 文本），不需要 grace。这避免无意义的 token 浪费。

3.3问题二：中途打断怎么办

Agent 在跑一个长任务时，用户改主意了，发了新消息。你不能让用户等当前 turn 跑完才被听见。Hermes 的方案——一个布尔 flag：

agent/conversation_loop.py:649-654
if agent._interrupt_requested:
    interrupted = True
    _turn_exit_reason = "interrupted_by_user"
    break

三个细节值得注意：

Flag 在每轮循环开头检查一次——开销极小。不是用 signal 中断，不是用 threading.Event 阻塞。最低成本的轮询。
不是 abort 信号。如果你正在执行某个工具（比如 terminal 跑一个长命令）， Hermes 不杀进程。它等当前工具完成、把结果记进 messages、然后跳出。这样消息历史不会破损（不会出现"有 tool_call 但没有对应 tool result"的局面）。
Flag 由谁设置？UI 层。CLI 里按 Ctrl+C、Gateway 里收到新消息、TUI 里检测到键盘事件——都会把这个 flag 设为 True。线程安全靠 Python GIL。

更深层的打断：流式中

上面 flag 只在循环顶部检查。如果一个 API call 本身要 30 秒呢？Hermes 用 interruptible streaming：

agent/conversation_loop.py:1141-1145
if stream_callback:
    response = agent._interruptible_streaming_api_call(...)
else:
    response = agent._interruptible_api_call(api_kwargs)

这两个方法的内部用 yield 推迭代器，每收到一个 token 就检查 interrupt flag。被打断时就地停止，已经收到的部分文本作为 partial 响应保留下来（章 3.7 会用到）。

3.4问题三：消息历史预处理

伪代码里 messages 是干净的 list of dict。生产里完全不是。每次 API call 前 messages 要经过大量预处理——拼接、清洗、规范化、加缓存标记。

看 conversation_loop 第 755–922 行的核心步骤：

agent/conversation_loop.py:755-922 (摘要)
# 1. 构造 api_messages（不是直接传 agent._session_messages）
api_messages = agent._session_messages.copy()

# 2. 拼接 system prompt（三层 — 第 5 章详解）
effective_system = agent._build_system_prompt(
    system_message=system_message, refresh_volatile=True)
api_messages = [{"role": "system",
                  "content": effective_system}] + api_messages

# 3. Prompt cache 注解（Anthropic specific）
if agent._provider_supports_cache_control():
    api_messages = agent._annotate_cache_breakpoints(api_messages)

# 4. 清洗 surrogate 字符（Unicode 代理对处理）
api_messages = _sanitize_surrogates(api_messages)

# 5. 工具结果的尺寸限制（避免一条 tool result 撑爆 context）
api_messages = agent._truncate_tool_results(api_messages)

# 6. 标准化空白字符（KV cache 一致性 — 多余空格会破缓存）
api_messages = _normalize_whitespace(api_messages)

Listing 3.2 messages 在送入 API 前要经过的预处理流水线

每步都值得展开，但我们重点看两个反直觉的：

缓存标记（cache_control）

Anthropic 的 prompt cache 是显式的——你要在消息上加 cache_control 标记告诉 API "请缓存到这里"。Hermes 自动在合适的位置标记：

System prompt 末尾
第一条非空 user message 末尾
tools schema 之后

细节太多，我们第 5 章专章讲。这里只要知道：messages 不是原样发给 API 的。

Tool Result 截断

如果工具返回一个 200 KB 的 HTML 页面，原样塞进 context 立刻爆炸。Hermes 给每个工具注册时可以设 max_result_size_chars。超过的部分被截断，加一行 "[Output truncated; original was N chars]" 提示。

陷阱截断发生在把 messages 传给 LLM 之前，但 agent._session_messages 里保留的是原始结果。如果你想做 session search 或保存 trajectory，看到的是完整结果；但 LLM 看到的是截断版。这种"双视图"是 Agent 工程里常见的模式。

3.5问题四：工具调用校验

LLM 不是完美的。Function calling 会出三种错：

幻觉工具名：调一个根本没注册的工具。
JSON 截断：arguments 写到一半被 max_tokens 切断。
JSON 语法错：模型生成的 JSON 不合法。

每种都得单独处理。Hermes 在 conversation_loop 3254–3389 行用一个分层 retry：

agent/conversation_loop.py:3254-3305 (简化)
if response.tool_calls:
    invalid = [tc.function.name for tc in response.tool_calls
               if tc.function.name not in agent.valid_tool_names]

    if invalid:
        agent._invalid_tool_retries += 1
        if agent._invalid_tool_retries >= 3:
            return {"error": "Model keeps generating invalid tool calls"}

        # 关键：把错误回传给模型，让它自己改正
        messages.append(response.message)
        for tc in response.tool_calls:
            if tc.function.name not in agent.valid_tool_names:
                messages.append({
                    "role": "tool",
                    "tool_call_id": tc.id,
                    "content": f"Tool '{tc.function.name}' does not exist. "
                               f"Available: {sorted(agent.valid_tool_names)}",
                })
        continue   # 让模型基于错误信息再生成

Listing 3.3 幻觉工具名的处理

这段代码的精神值得高亮——错误不是抛给框架的，是回传给模型自己改的。

设计原则 能让 LLM 自己改的，就让 LLM 自己改。它读到"Tool 'web_seach' doesn't exist. Available: web_search, ..." 会自动改成 web_search。比抛异常给用户、要求重发请求优雅得多。这就是利用 LLM 智能的核心打法。

JSON 解析的两种失败

agent/conversation_loop.py:3306-3389 (摘要)
# 校验 arguments JSON
invalid_json = []
for tc in response.tool_calls:
    try:
        json.loads(tc.function.arguments)
    except json.JSONDecodeError as e:
        invalid_json.append((tc.function.name, str(e)))

if invalid_json:
    # 关键判断：是被截断 还是 真的语法错
    truncated = any(
        not (tc.function.arguments or "").rstrip().endswith(("}", "]"))
        for tc in response.tool_calls
    )

    if truncated:
        # 截断不能 retry — 再 retry 还是会截断
        return {"error": "Response truncated by max_tokens"}

    if agent._invalid_json_retries < 3:
        agent._invalid_json_retries += 1
        # 不污染历史，直接重试 API call
        continue
    else:
        # 3 次还不行，注入错误让模型看到
        ...

判断"是否截断"的逻辑——JSON 不以 } 或 ] 结尾——很优雅，不需要 token 计数，直接看字符串就够了。

错误	原因	处理
JSON 截断	max_tokens 太小	直接报错，告诉用户调大 max_tokens
JSON 语法错（未截断）	模型 hiccup	重试 API call，不污染 messages
幻觉工具名	schema 没传 / 模型走神	把错误作为 tool result 回传，让模型自己改
参数类型错	schema 有 `type: integer` 但传了字符串 `"42"`	`coerce_tool_args()` 静默转换

3.6问题五：工具并发执行

现代 LLM 经常一次返回多个 tool_calls。比如 Claude 看到一个 issue，会同时 read_file("README.md") 和 read_file("CONTRIBUTING.md")。如果你串行执行，浪费时间。

Hermes 的实际工具执行调用：

agent/conversation_loop.py:3477
agent._execute_tool_calls(response.message, messages, effective_task_id, api_call_count)

这个方法内部决定并发还是串行：

并发：读文件、搜索、调 API——这些工具不修改共享状态，可以并发。
串行：写文件、执行 shell、修改数据库——必须按顺序，否则数据竞争。

判断标准在每个工具注册时声明（concurrent_safe=True/False）。 Hermes 默认并发 safe的工具集合包括所有 read-only 和搜索类工具。

异步桥接

工具 handler 可以是 sync 也可以是 async。Hermes 用 _run_async() 桥接（model_tools.py:84-173）：

def _run_async(coro):
    """Run an async coroutine from a sync context."""
    try:
        loop = asyncio.get_running_loop()
        # 已经在 event loop 里 — 起一个临时线程跑
        return _run_in_thread(coro)
    except RuntimeError:
        # 没有 loop — 用持久 loop 跑
        return _persistent_loop().run_until_complete(coro)

这套桥接处理三种环境：纯 CLI（无 loop）、Gateway（有自己的 asyncio loop）、被并行 worker 调用（每线程独立 loop）。这种细节看似边角，但"async client 在 sync 环境里能跑"是 Hermes 能同时支持 asyncio Gateway 和 sync CLI 的关键。

3.7问题六：空响应恢复

LLM 偶尔返回完全空的响应——既没 tool_calls 也没 content。可能原因：max_tokens 太小、 provider bug、streaming 被服务端截断。

"空响应"作为正常的退出条件是不对的——用户看到的是沉默。Hermes 用三层 fallback：

agent/conversation_loop.py:3571-3699 (摘要)
# 模型返回空：尝试恢复
final_response = response.message.content or ""

# Fallback 1: 看流式已经发给用户的文本
_partial = getattr(agent, "_current_streamed_assistant_text", "")
if agent._has_content_after_think_block(_partial):
    final_response = agent._strip_think_blocks(_partial).strip()
    agent._emit_status("↻ Stream interrupted — using delivered content")
    break

# Fallback 2: 上一轮 LLM 有文本输出（虽然带了 tool_calls）
fallback = getattr(agent, '_last_content_with_tools', None)
if fallback and getattr(agent, '_last_content_tools_all_housekeeping', False):
    final_response = agent._strip_think_blocks(fallback).strip()
    agent._emit_status("↻ Empty response after tools — using earlier content")
    break

# Fallback 3: 注入合成消息让模型继续
if _prior_was_tool and not agent._post_tool_empty_retried:
    messages.append({"role": "assistant", "content": "(empty)"})
    messages.append({
        "role": "user",
        "content": "You just executed tool calls but returned an empty "
                   "response. Please process the tool results above and "
                   "continue with the task.",
    })
    agent._post_tool_empty_retried = True
    continue

Listing 3.4 空响应的三层恢复

设计精神：永远不要静默失败。哪怕注入一段合成的 user message，也要让对话能继续，让用户看到有意义的输出。

3.8问题七：Context Compression

跑 50 个工具调用，messages 可能有 200KB token。这会撞 context window 上限。 Anthropic 2025 年提出的 Compaction 模式，Hermes 早就实现了：

agent/conversation_loop.py:3536-3551 (摘要)
# 每轮 tool call 之后检查是否需要压缩
_compressor = agent.context_compressor

# 用真实的 prompt token 数（不是估计）
if _compressor.last_prompt_tokens > 0:
    _real_tokens = _compressor.last_prompt_tokens
else:
    _real_tokens = estimate_request_tokens_rough(messages, tools=agent.tools)

if agent.compression_enabled and _compressor.should_compress(_real_tokens):
    agent._safe_print("  ⟳ compacting context…")
    messages, active_system_prompt = agent._compress_context(
        messages, system_message,
        approx_tokens=_compressor.last_prompt_tokens,
        task_id=effective_task_id,
    )

Listing 3.5 Compression 触发条件

核心机制：

用真实 token 数，不是估算。不是用 message 数量。为什么？因为有 reasoning model：reasoning content 不进 context window，但会显著拉高 completion_tokens。基于 message 数会误判。
only when needed：阈值默认是 context window 的 80%。过早压缩浪费 token，过晚撞上限报错。
压缩 = LLM 总结历史 + 替换：把前 N 条消息让 auxiliary LLM 总结，替换原 messages 列表的前面部分，保留最近几条。

前沿对照

Anthropic Context Engineering · 2025

Anthropic Engineering Blog

Anthropic 2025 年中正式提出 "Compaction" 作为 context engineering 三招之一。 Hermes 这套已经写在代码里超过一年——但思想完全对应： "taking a conversation nearing the context window limit, summarizing its contents, and reinitiating a new context window with the summary."

3.9问题八：流式回调

用户不愿意等 30 秒看到一次性 paste 进来的长答复。流式输出是基本要求：

agent/conversation_loop.py:3099-3111
def _fire_stream_delta(self, text: str) -> None:
    """Fire all registered stream delta callbacks (display + TTS)."""
    if getattr(self, "_stream_needs_break", False) and text and text.strip():
        self._stream_needs_break = False
        # 工具结果和文本之间补换行，避免视觉粘连
        text = "\n\n" + text

    if isinstance(text, str):
        callbacks = [cb for cb in (self.stream_delta_callback,
                                self._stream_callback) if cb is not None]
        for cb in callbacks:
            try:
                cb(text)
            except Exception:
                pass   # 不让回调错误打死循环

几个细节：

多个 callback：UI 一个、TTS 一个、转日志一个，互不知道彼此。
try/except 永远在：一个 callback 抛错不能影响主循环。
_stream_needs_break：工具结束后第一个文本 token 自动加换行，避免 tool_result foo 和模型说"bar" 粘成 tool_result foobar。

3.10整张地图

把以上所有问题串起来，run_conversation 的真实流程图：

flowchart TD
  U(["user_message"]) --> Pre["pre-loop setup
prefetch memory, build messages"]
  Pre --> Top(["循环顶部"])

  Top --> CInt{"interrupt?"}
  CInt -->|yes| BrkInt["break"]:::warn
  CInt -->|no| CBud{"budget ok?"}

  CBud -->|no| CGr{"grace?"}
  CGr -->|no| BrkBud["break"]:::warn
  CGr -->|yes| Prep
  CBud -->|yes| Prep["pre-call:
sanitize · annotate cache"]

  Prep --> Llm["LLM call
streaming, interrupt-safe"]
  Llm --> CTc{"tool_calls?"}

  CTc -->|no| CEmpty{"empty
response?"}
  CEmpty -->|no| Fin["final answer"]:::good
  CEmpty -->|yes| Fb["3-layer fallback
nudge & continue"]
  Fb --> Top

  CTc -->|yes| Val["validate"]
  Val --> CVal{"invalid?"}
  CVal -->|yes| Retry["retry up to 3x"] --> Top
  CVal -->|no| Exec["execute
concurrent / serial"]
  Exec --> Append["append tool results"]
  Append --> CComp{"compression?"}
  CComp -->|yes| Compress["compress"] --> Top
  CComp -->|no| Top

  Fin --> Post["post-loop:
sync memory · save trajectory"]
  BrkInt --> Post
  BrkBud --> Post
  Post --> Done(["return"])

  classDef warn fill:#faf0e3,stroke:#b85c00,color:#b85c00
  classDef good fill:#ecf3eb,stroke:#2f5d3a,color:#2f5d3a

这就是 3589 行代码的真实形状。

3.11对照：其他主流 Agent Loop

把 Hermes 的 3589 行放在一边，看看其他几家怎么写"loop"。每家选择都不同—— 背后是不同的优先级。

Anthropic "Agentic Loop"（2024–2025 官方说法）

Anthropic 在 Building Effective Agents 里给出的骨架伪代码：

env_state = get_initial_environment()
while True:
    response = llm_call(env_state)
    if response.is_terminal():
        return
    env_state = apply_action(env_state, response.action)

对比 Hermes 的 13 行精髓：几乎一样的形态——only Anthropic 把"工具调用" 抽象成更通用的"action on environment"。这是 Anthropic 内部 Claude Code、 Claude Desktop computer use 共用的最小心智模型。

OpenAI Agents SDK 的 loop

OpenAI Agents SDK 把循环封装成 Runner.run()。核心结构：

from openai.agents import Agent, Runner, function_tool

@function_tool
def web_search(query: str) -> str:
    ...

agent = Agent(
    name="researcher",
    instructions="...",
    tools=[web_search],
)

result = await Runner.run(agent, "Find Python 3.13 release date")

内部和 Hermes 类似——while 循环 + 工具调用 + tool_choice 控制。但 OpenAI Agents SDK 把 budget / interrupt / retry 都藏在 Runner 内部，开发者看不到这层复杂。好处：上手快。坏处：debug 难——出问题时不知道是 Runner 内的哪一层。

Claude Code 的 loop

Anthropic 自家的 Claude Code（CLI 工具）loop 设计公开度较低，但有几个关键特点：

无 max_iterations 上限——只有 token budget。让 Agent 自己决定停。
Sub-agent 默认走 stateful Sonnet（便宜版），父用 Opus。
Tool call 失败自动 retry 3 次，但每次给模型看完整 error。
"Plan mode" 隔离——规划阶段不让模型用 write 类工具，只允许 read。

Hermes 沿用了上面前两条，第三条做得相似，第四条没做。

LangGraph 的 loop

LangGraph 的"loop"其实是"图遍历"——你把整个 agent 行为定义成 StateGraph， Runner 按节点 + 条件边走。每个节点可能是 LLM call、可能是工具、可能是判断。和 Hermes 的"while 循环"心智完全不同。

选择 LangGraph 的正当理由：

需要在工作流中间持久化状态、断点恢复。
工作流复杂到画图才说得清楚。
需要 LangSmith 那种 trace observability。

选择 Hermes 这种直接 while 循环的正当理由：

工作流是"无限循环 + LLM 决定"——状态机画不出来。
不想引入第三方 framework 的抽象债。
需要每一步都能在 Python debugger 里 step through。

选哪个没有标准答案。但 2026 业界共识：核心 agentic loop 用直接代码（学 Hermes）+ 外层用 LangGraph 编排多个 agent 是最稳的组合。OpenAI Agents SDK 适合"我就用 OpenAI"的纯 Anthropic / OpenAI 生态用户。

3.12代码深挖:per-chunk interrupt 与诊断 accounting

第 3.3 节我们说"interrupt 是 flag,每轮循环开头检查一次"。其实流式 LLM 调用内部还有per-chunk检查——让 Ctrl+C 在 chunk 边界立刻响应, 不用等整个 stream 跑完。看实现:

agent/chat_completion_helpers.py:1472-1495
for chunk in stream:
    last_chunk_time["t"] = time.time()
    agent._touch_activity("receiving stream response")

    # 更新每次 attempt 的诊断 counter。Best-effort —
    # 失败被吞,绝不让诊断 accounting 打断 streaming hot path。
    try:
        _diag["chunks"] = int(_diag.get("chunks", 0)) + 1
        if _diag.get("first_chunk_at") is None:
            _diag["first_chunk_at"] = last_chunk_time["t"]
        # 用 repr(chunk) 长度近似 wire byte 数 — SDK 不暴露准确 byte,
        # 但 len(repr) 是稳定 proxy,跨 stub provider 也适用。
        try:
            _diag["bytes"] = int(_diag.get("bytes", 0)) + len(repr(chunk))
        except Exception:
            pass
    except Exception:
        pass

    if agent._interrupt_requested:
        break

三件值得学的事:

诊断 try/except 包两层:外层 try 防 counter 字典操作出错, 内层 try 防 repr(chunk) 抛(某些 chunk 类型 __repr__ 可能 NotImplemented)。这是"诊断绝不挡住主路"的代码层防御。
len(repr(chunk)) 当 byte 数 proxy:OpenAI SDK 没暴露原始字节大小。 repr 是稳定的、跨 provider 一致的代理指标。给运维看流式速率(KB/s)足够。
interrupt 检查在 chunk 累加之后:本 chunk 已经收下, flag 被设也不会丢这个 chunk。下个 chunk 才退出。消息累加器保持一致——用户 Ctrl+C 拿到的是"目前已经渲染的部分回复"。

3.12.1 grace_call flag 的实际设置时机

agent/conversation_loop.py (循环外,turn 后处理)
# 第 1 次 budget 耗尽 → 不立即退出,看是否有"该说话还没说"的情境
if _turn_exit_reason == "budget_exhausted":
    last_msg = messages[-1] if messages else {}
    # 末尾是 tool 消息 = LLM 调了工具但没机会基于工具结果说话
    if last_msg.get("role") == "tool":
        agent._budget_grace_call = True
        # while 循环主条件含 ` or agent._budget_grace_call`,
        # 这一标志让 loop 再进一次。再进去后 grace_call 被消耗。

第 3.2 节我们说"grace call 给 LLM 收尾机会"。触发条件具体是"最后一条 message 是 tool 结果"——精确判断"LLM 还没看到结果就被打断了"。如果最后是 assistant 文本(已说话),不触发 grace。不浪费 token。

3.13本章带走的

Agent loop 的精髓是 13 行，但生产实现是 3589 行。差距全在边界条件处理。
Budget + Grace Call：硬上限 + 一次宽限，避免"工具调用完没机会答用户"。
Interrupt 是 flag，不是 abort：保证 messages 完整性。
消息预处理流水线：cache 标记、surrogate 清洗、tool result 截断、空白标准化。
工具调用校验：幻觉、JSON 截断、JSON 语法错，三种不同处理方式。
错误回传给模型是核心打法：能让 LLM 自己改正的就让它改。
空响应有三层 fallback：partial stream、prior content、nudge。永不静默失败。
Compression触发条件用真实 prompt tokens，不是估算。
Streaming callback用 try/except 包裹，回调失败不打死主循环。

心法把这一章的 8 个问题印在脑子里。每次你写"agent loop"代码时，自问：这 8 个问题中我处理了几个？每多处理一个，你的 agent 就离能用近一点。

章末练习

Easy 把第 3.1 节的 13 行精髓敲一遍。然后用你最喜欢的 LLM API 跑一个最简单的工具（比如 get_current_time）调用。
Easy 为什么 Hermes 不直接 max_iterations += 1 而要单独引入 _budget_grace_call flag？用 50 字解释。
Medium 打开 agent/conversation_loop.py，找到 _interrupt_requested 在文件里出现的所有位置（约 10 处）。画出它的生命周期：哪里被设 True、哪里被检查、哪里被清空。
Medium 实现一个简化版的"幻觉工具名处理"。给定一个 tool_calls 列表（含一个不存在的工具名），返回正确的 messages.append 调用，让 LLM 下一轮能改正。
Hard Hermes 用 _compressor.last_prompt_tokens 作为压缩触发条件——但这个值是上一次 API call 返回的。如果上一次很短、本次会暴涨，可能错过压缩时机。设计一种更鲁棒的触发条件，不依赖 round-trip 之后的反馈。
Hard 研究 conversation_loop 中"reasoning content"的处理（约 2283–2334 行）。写一段 200 字的总结：不同 provider（OpenAI、Anthropic、DeepSeek、Qwen）的 reasoning 格式有什么差异？ Hermes 怎么统一？

← 上一章

第 2 章 · 心智模型

第 4 章 · 消息协议与 Tool Calling