引子

最近的俱乐部群，很多时间都在讨论Coding Agent的相关技术。

这让我很诧异，在我的印象中，Agent是一个Reinforcement Learning的terminology。RL的主体是一个Agent，Agent通过与环境互动，从而获得奖惩，优化自身的参数，达到最优的表现。

但是，在讨论中我发现，如今这个术语发生了很巧妙的变化。

本文就来记录一下，我认知中的术语迁移缘由，和我对AI Agent的看法。

术语迁移：什么是AI Agent？

所谓AI Agent，就是一个基于大模型的，能够自主规划、调用工具、完成复杂任务的智能体。

AI Agent这个词，需要拆成两个词来看——AI和Agent。

在现在的语境下，AI基本上等同于自然语言处理领域(NLP)的大语言模型(LLM)。在ChatGPT刚出现那一阵，很多人就把这类技术叫做AI了。实际上，大模型是，且仅是这样一个东西——接收一个不定长的文本，输出一个不定长的文本，仅此而已。

而Agent——正如同我们在引子里所说的那样——是一个来自于强化学习领域的术语。强化学习的部分理论可以参见我的博文【AI】强化学习常用算法小记（上）。强化学习和一般的学习方法不同，聚焦于「模型与环境之间的交互」上，从而获得「奖惩」，作为学习依据。而在AI Agent这个词中，剥夺去了「奖惩」这一部分，更强调「模型本身能够与环境进行交互」，能够使用外部工具，能够接收一个模糊的需求，输出一个详细的解决方案来。

工作原理：AI Agent其内如何？

这个东西究竟是怎么工作的呢？一个AI Agent需要本地和云端两个部分，本地需要部署相应的框架，比如OpenClaw等，而云端则需要购买AI使用权限。

举例来说，假设我本地有一个AI Agent，接入了硅基流动的DeepSeek V3模型，然后，在session中输入：

text

1	Create a file named a.txt for me

然后，它就在本地创建了一个名为a.txt的文件。

这里我们发现，AI Agent和AI不同，它的输出不仅仅是「不定长的文本」，而是真真切切在本地做了一些操作。那么这是怎么做到的呢？还是以OpenClaw为例，我们查找session历史可以发现：

{"type":"thinking","thinking":"The user wants me to create a file named \"a.txt\". This is a straightforward request. I should use the write tool to create this file.\n\nSince they didn't specify any content, I'll create an empty file or add a simple placeholder. Let me just create it with empty content or a minimal placeholder.\n\nI'll create the file in the workspace directory: XXX\n\nAccording to instructions, default behavior is to just do routine tasks without narrating. This is a simple file creation, so I should just do it.","thinkingSignature":"reasoning_content"},
...
{"type":"thinking","thinking":"PowerShell doesn't use echo. > for creating files. Let me try the PowerShell way.","thinkingSignature":"reasoning_content"},{"type":"toolCall","id":"","name":"exec","arguments":{"command":"Out-File -FilePath a.txt"}}
...
{"type":"thinking","thinking":"Good, that should have created an empty file. Let me verify it exists.","thinkingSignature":"reasoning_content"}

这里我们可以窥探到全貌，工作流程是：需求输入-云端模型处理-返回内容和指令-执行指令-根据指令结果判断正误。AI还是那个AI，那个接收不定长文本，输出不定长文本的AI，只是OpenClaw对输出的结果进行了处理，提取出了工具，然后在终端执行。

联想时刻：早已有之的思想

文章写到这里，我突然想起来了刚进实验室做科研的时候，读到的一篇文章——Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT，那个时候笔者还在进行知识图谱-大模型的相关研究。那个时候就有一种借助知识图谱辅助思考的方法，给定知识图谱，让大模型输出相应的SPARQL，然后在知识图谱上检索，把内容塞到模型的context中，进行推理。

这部分没什么意思，暂且按下不表。

小小实战：做一个自己的AI Agent

这个技术难度并不大，我们用Qwen-0.6B模型作为基础模型，做一个终端小Agent。

首先clone模型：

1 2	git lfs install git clone https://huggingface.co/Qwen/Qwen3-0.6B

然后光速搓一个加载模型的demo：

from transformers import AutoTokenizer, AutoModelForCausalLM
import argparse
import torch
import os

def get_answer(input_sent: str, tokenizer, model):
    input_dict = [
        {"role": "user", "content": input_sent}
    ]
    text = tokenizer.apply_chat_template(
        input_dict,
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True
    )
    inputs = tokenizer([text], return_tensors="pt").to(model.device)
    generated_ids = model.generate(
        **inputs,
        max_new_tokens=4096,
    )
    output_ids = generated_ids[0][len(inputs.input_ids[0]):].tolist() 

    try:
        index = len(output_ids) - output_ids[::-1].index(151668)
    except ValueError:
        index = 0

    content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
    return content

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_name_or_path", "-m", type=str, default=os.path.join("..", "Qwen3-0.6B"))
    args = parser.parse_args()

    tokenizer = AutoTokenizer.from_pretrained(args.model_name_or_path)
    model = AutoModelForCausalLM.from_pretrained(
        args.model_name_or_path,
        dtype=torch.bfloat16,
        device_map="auto",
    )

    while True:
        print("Please input your question:")
        input_sent = input()
        print(get_answer(input_sent, tokenizer, model))

到这里，我们算是极其粗糙地模拟好了「需求输入-云端模型处理-返回内容和指令」的过程：

text

Please input your question:
What should I do to build a building?
Building a building involves several key steps and considerations. Here’s a structured approach:

1. **Define the Purpose**: Determine the type of building (e.g., home, office, commercial) and its intended use.
2. **Research and Planning**: Study the location, climate, and zoning regulations to ensure compliance.
3. **Design and Planning**: Create a detailed blueprint, including architectural design, materials, and layout.
4. **Budget and Materials**: Plan your budget and select suitable materials based on cost and durability.
5. **Purchase and Secure Funding**: Secure permits, contracts, and funding (e.g., loans or grants).
6. **Construction**: Begin construction, ensuring quality and adherence to design specifications.
7. **Testing and Completion**: Test the building’s functionality and complete the project.

**Key Tips**:
- Avoid common mistakes like ignoring permits or choosing inferior materials.
- Prioritize safety and sustainability in design.

Let me know if you need guidance for a specific project!

在这之后，我们要编写一个执行指令的过程，执行指令分为两部：从输出中提取出要执行的指令和执行指令。为了能够高效地提取出指令，我们需要和LLM做约定，通常需要通过prompt约束LLM输出的格式：

def get_answer(input_sent: str, tokenizer, model):
    input_dict = [
        {"role": "system", "content": "You are a helpful assistant. You need to generate the result as json format, do not generate any other text. An acceptable example is {\"command\": ""}."},
        {"role": "user", "content":input_sent}
    ]
    ...

然后写一个比较粗糙的提取函数：

def filter_out_json_part_of_answer(raw_answer: str):
    if "{" not in raw_answer or "}" not in raw_answer:
        return "", False
    
    start_index = raw_answer.find("{")
    end_index = raw_answer.find("}") + 1

    try:
        command_dict = json.loads(raw_answer[start_index:end_index])
    except json.JSONDecodeError:
        return "", False

    return list(command_dict.values())[0], True

主函数中：

...
    while True:
        print("Please input your question:")
        input_sent = input()
        while True:
            raw_answer = get_answer(input_sent, tokenizer, model)
            answer, is_json = filter_out_json_part_of_answer(raw_answer)
            if is_json:
                print(answer)
                break
            else:
                print("Error occurs!")

实测：

text

1
2
3

Please input your question:
Create a text file in Bash 
touch filename.txt

然后我们直接执行这个命令即可，在主函数中：

...
    while True:
        print("Please input your question:")
        input_sent = input()
        while True:
            raw_answer = get_answer(input_sent, tokenizer, model)
            answer, is_json = filter_out_json_part_of_answer(raw_answer)
            if is_json:
                print(answer)
                break
            else:
                print("Error occurs!")
        try:
            os.system(answer)
        except Exception as e:
            print(e)

即可。

实际上AI Agent的思想是很简单的，在大模型输出的文本层和实际的工具层之间，做了一个转接，从而使大模型能够利用工具。

末尾浅思

AI Agent这个技术的痛点就在于——它依赖于云端的AI模型提供商。离了提供商，这些Agent基本上就用不了了。这就引出了另一个问题——AI目前还没有办法很好地在民用级显卡上部署。民用级显卡通常也就4G到8G左右，除去日常使用，剩下的显存根本不够7-24部署一个大模型的。而利用云端，则会带来大量的token开销。

AI Agent缩短了「笨蛋和聪明人之间的距离」，而代价就是「笨蛋要花更多的钱」。等什么时候，AI能够不这么吃算力，AI Agent就会真正成为一个颠覆性的技术，将彻底改变世界。

技术上讲，AI Agent也面临着一个问题，这也是大模型的问题——Context Rot，随着Context不断增大，Model的Performance会降低，如何让模型保有长期记忆，也是很重要的一点。现在的解决方案有向量召回、Skills等方法，这些方法各有利弊——或许我会出一期新的RethinkAI讨论讨论。