Thin Soul, Fat Skill: A Token-Efficient Architecture for Production Multi-Agent Systems
The LocalKin Team
Abstract
All production figures in this paper are measured as of 2026-04-27 on the live LocalKin deployment.
Current multi-agent frameworks embed reasoning logic, domain knowledge, and execution procedures directly into LLM prompts, resulting in memory footprints exceeding 200MB per agent and practical ceilings of 8--10 agents on consumer hardware. We present Thin Soul + Fat Skill, an architecture that separates agent identity (a declarative "soul file" of approximately 30--120 lines of YAML and Markdown) from execution logic (deterministic "skill" scripts of arbitrary size). The soul file is consumed by the LLM as a system prompt; skills execute outside the token window entirely. This separation reduces per-agent memory to approximately 16MB---a 12--19× improvement over Python-based frameworks---and enables 139 specialized agents (80 of which are publicly accessible at faith.localkin.ai and heal.localkin.ai) to run concurrently on a single Mac Mini M2 with 16GB RAM and approximately 2.2GB total system memory. We describe the soul file schema, the skill execution protocol, the conductor pattern for hierarchical multi-agent coordination, a runtime forge that generates new skills autonomously, and a self-evolution mechanism enabled by the declarative nature of soul files. In production deployment, the architecture sustains 258 fleet souls (294 total) across 23 domains and 127 declared skills with deterministic tool execution, eliminating hallucination in data-retrieval tasks.
Keywords: multi-agent systems, LLM architecture, token efficiency, agent orchestration, tool use
1. Introduction
The promise of multi-agent LLM systems---where specialized agents collaborate on complex tasks---runs headlong into a resource wall. Frameworks such as AutoGen [Wu et al., 2023], CrewAI [Moura, 2024], and MetaGPT [Hong et al., 2023] define agents as heavyweight Python objects that bundle prompts, chain-of-thought logic, tool definitions, and conversation history into a single runtime entity. Each agent consumes 200--300MB of process memory before it answers a single query.
For a solo developer operating on consumer hardware, this means the "swarm" tops out at roughly 8 agents. The alternative---running fewer, more general agents---sacrifices the specialization that makes multi-agent systems compelling in the first place.
This paper introduces the Thin Soul + Fat Skill architecture, deployed in the LocalKin system, which inverts the conventional design. Instead of embedding logic in the prompt, we separate concerns along a natural boundary:
- ●Soul (thin): A declarative file (~30--120 lines) that defines who the agent is---its persona, model configuration, permissions, scheduled behaviors, and safety constraints. The soul becomes the LLM's system prompt.
- ●Skill (fat): A standalone script (Python, Bash, or any language) paired with a SKILL.md manifest that defines what the agent can do. Skills execute deterministically in a subprocess. Their code never enters the token window.
This separation yields three concrete benefits: (1) a 20x reduction in per-agent memory, (2) deterministic execution for data-retrieval tasks that would otherwise suffer from hallucination, and (3) machine-modifiable agent definitions that enable autonomous self-evolution of the swarm.
2. Problem: The Prompt-Obese Agent
2.1 Anatomy of a Conventional Agent
In frameworks like AutoGen, a typical agent definition includes:
agent = AssistantAgent( name="financial_analyst", system_message="""You are a senior financial analyst... [200+ lines of instructions, few-shot examples, tool schemas, output format specifications]""", llm_config={"model": "gpt-4", "temperature": 0.2}, code_execution_config={"work_dir": "output"}, )
The system_message alone can exceed 4,000 tokens. Add tool definitions, conversation history, and framework overhead, and a single agent's context window is consumed by its own configuration before user input arrives.
2.2 The Scaling Wall
We measured the runtime memory of a single idle agent across three popular frameworks:
| Framework | Per-Agent Memory | Max Agents (16GB) | Architecture |
|---|---|---|---|
| AutoGen | ~250 MB | ~8 | Python class + prompt |
| CrewAI | ~200 MB | ~10 | Python class + prompt |
| MetaGPT | ~300 MB | ~6 | Python class + prompt |
| LocalKin | ~16 MB | 139 (live) | Soul file + Go runtime |
Table 1: Memory comparison across multi-agent frameworks. Measurements taken on a Mac Mini M2 with 16GB RAM (2026-04-27 snapshot). LocalKin's per-agent figure is the average resident-set across 139 live agents (16.6 MB measured); per-agent overhead is dominated by per-conversation buffer allocation, not by the agent definition itself (the parsed soul struct is ~2--4 KB). LocalKin agents share a single Go binary runtime (~45 MB) and a global skill registry.
The disparity arises because Python frameworks instantiate a full interpreter, framework object graph, and prompt template per agent. LocalKin's Go runtime loads each soul as a lightweight struct (approximately 2--4KB of parsed YAML) and shares the skill registry across all agents.
2.3 Token Waste
Beyond memory, prompt-embedded logic wastes tokens on every LLM call. A stock-price lookup that could execute in 50ms as a subprocess call instead consumes 500+ tokens of prompt space describing the API, response parsing, and error handling---only for the LLM to approximate what a deterministic script would compute exactly.
3. Soul File Design
3.1 Schema
A soul file (*.soul.md) uses a two-part format: YAML frontmatter for machine-readable configuration, and a Markdown body for the system prompt.
--- name: "Hua Tuo" slug: "hua_tuo" version: "1.0.0" brain: provider: "claude" model: "claude-sonnet-4-6" temperature: 0.3 context_length: 32768 fallback: provider: "ollama" model: "qwen3.5:9b" permissions: shell: false network: true filesystem: allow: ["./output"] deny: [".env", ".git", "~/.ssh"] skills: enable: ["knowledge_search", "web_search", "web_scrape"] output_dir: "./output" heart: enabled: true pulse: interval: "10s" topic: "localkin/swarm/status" safety: pain_feedback: false --- # Hua Tuo --- The Divine Physician You are Hua Tuo, courtesy name Yuanhua... ## Expertise - Wellness and lifestyle guidance - Herbal formulations for prevention ...
Figure 1: Abbreviated soul file for a Traditional Chinese Medicine agent. The YAML frontmatter (42 lines) configures the runtime; the Markdown body (~70 lines) becomes the system prompt.
3.2 Frontmatter Sections
The YAML frontmatter is divided into five concern areas:
Brain. Model provider, model name, temperature, context length, and an optional fallback chain. When the primary provider (e.g., Claude API) is unavailable, the runtime cascades to the fallback (e.g., a local Ollama model). This enables three-tier resource management: cloud API for complex reasoning, edge model for routine tasks, local model for offline operation.
Permissions. A capability-based security model. Each agent declares whether it can execute shell commands, access the network, and which filesystem paths are allowed or denied. The runtime enforces these permissions at the syscall boundary---an agent with shell: false cannot execute arbitrary commands regardless of what the LLM outputs.
Skills. A whitelist of skill names the agent may invoke. Skills not listed here are invisible to the agent, even if they exist in the skills directory. This prevents a marketing agent from invoking the robot_control skill, for example.
Heart. The agent's autonomous behavior schedule. The pulse defines a heartbeat for liveness monitoring via MQTT. The schedule array defines periodic autonomous tasks---for instance, the TCM Conductor wakes every 8 hours to scrape trending health topics and initiate a physician debate.
Safety. Constraints including shell command blocklists, circuit-breaker thresholds, and pain-feedback flags for embodied agents.
3.3 Hot-Swap and Machine Modification
Because the soul file is a plain text file on disk, changes take effect without recompilation or redeployment. An operator can edit a soul file to change an agent's personality, model, or permissions, and the runtime picks up the change on the next request cycle.
More significantly, this property enables machine modification. The LocalKin swarm_architect agent can programmatically patch soul files---adjusting temperatures, adding skills, or rewriting system prompt sections---as part of an autonomous self-evolution loop. This is described further in Section 8.
4. Fat Skill Design
4.1 The SKILL.md Manifest
Each skill is defined by a SKILL.md file in the skills/ directory, using the same YAML-frontmatter-plus-Markdown format as soul files:
--- name: stock_price description: "Get real-time stock quotes and metrics" command: ["python3", "skills/stock_price/quote.py"] args: ["--action", "{{action}}", "--symbol", "{{symbol}}", "--period", "{{period}}"] timeout: 30 schema: action: type: "string" enum: ["quote", "metrics", "history"] required: true symbol: type: "string" description: "Stock ticker, e.g. NVDA" required: true period: type: "string" default: "1mo" --- # Stock Price Get real-time stock market data via Yahoo Finance.
Figure 2: SKILL.md manifest for the stock_price skill. Template parameters ({{symbol}}, {{action}}) are interpolated at invocation time.
4.2 Execution Protocol
When an agent invokes a skill, the runtime:
- ●Parses the tool-call output from the LLM (e.g.,
stock_price action="quote" symbol="NVDA"). - ●Validates parameters against the schema defined in SKILL.md.
- ●Interpolates template variables into the command args.
- ●Spawns the script as a subprocess with the configured timeout.
- ●Captures stdout (expected to be JSON) and returns it to the LLM as a tool result.
The critical insight is that step 4 executes deterministically. The quote.py script makes an HTTP request to Yahoo Finance and returns the exact price. No tokens are consumed parsing API documentation. No hallucination is possible. The LLM's role is reduced to deciding when to call the skill and interpreting the result---tasks well within its competence.
4.3 Language Agnosticism
Because skills communicate via stdin/stdout JSON, they can be written in any language. The LocalKin skill registry includes skills in Python (stock quotes, web scraping, NotebookLM integration), Bash (content publishing pipelines, system monitoring), and Go (swarm communication). The runtime is indifferent to implementation language.
4.4 Drop-In Discovery
Adding a new skill requires only placing a SKILL.md file (and its accompanying script) in the skills/ directory. The runtime auto-discovers skills at startup by scanning for SKILL.md manifests. No registration code, no import statements, no framework boilerplate. In production, LocalKin manages 127 declared skills through this mechanism alone (a subset of which is loaded by each agent based on its skills.enable whitelist).
4.5 Token Economics
The SKILL.md Markdown body (typically 5--15 lines of usage examples) is injected into the system prompt so the LLM knows how to call the skill. The script itself---which may be 200+ lines of Python---never enters the token window. For the stock_price skill:
| Component | Size | Tokens Consumed |
|---|---|---|
| SKILL.md body (in prompt) | 35 lines | ~120 tokens |
| quote.py (subprocess) | 215 lines | 0 tokens |
| Total token cost | ~120 tokens |
Table 2: Token accounting for the stock_price skill. The execution script consumes zero tokens because it runs as a subprocess outside the LLM context.
Compare this to a prompt-embedded approach where the equivalent logic, API documentation, and error-handling instructions would consume 800--1,200 tokens per agent per call.
5. Compound Skills and LLM Round-Trip Reduction
5.1 The Round-Trip Problem
A naive tool-use loop requires one LLM call per step: the agent calls a tool, receives the result, reasons about the next step, calls another tool, and so on. For a five-step workflow, this means five LLM round trips, each consuming input/output tokens and adding 1--3 seconds of latency.
5.2 Compound Skill Architecture
A compound skill is a single SKILL.md that wraps a multi-step script. The LLM makes one tool call; the script executes the entire pipeline deterministically and returns a consolidated result.
For example, a content_publish compound skill:
- ●Fetches trending topics (web scrape)
- ●Generates content outline (template-based)
- ●Formats for target platform (Markdown to HTML)
- ●Posts via API (HTTP request)
- ●Returns confirmation with URL
What would require 5 LLM round trips becomes 1. In practice, compound skills reduce LLM round trips by 50--87% for multi-step workflows, with the lower bound applying to two-step sequences and the upper bound to eight-step pipelines.
5.3 When Not to Compound
Compound skills are appropriate when the pipeline is deterministic and the intermediate decisions are mechanical. When genuine reasoning is required between steps---such as deciding whether a draft needs revision based on quality---the single-step skill model is preferred, allowing the LLM to exercise judgment at each stage.
6. The Conductor Pattern
6.1 Hierarchical Team Organization
Rather than a flat peer-to-peer topology, LocalKin organizes agents into domain-specific teams led by conductor agents. A conductor is itself an agent (with its own soul file) whose system prompt includes routing logic and whose skills include swarm_comm (inter-agent messaging) and swarm_debate (structured multi-agent deliberation).
+-------------------+
| User / Jacky |
+--------+----------+
|
+--------------+--------------+
| | |
+--------v---+ +-----v------+ +----v--------+
| TCM | | Board | | Marketing |
| Conductor | | Conductor | | Conductor |
+-----+------+ +-----+------+ +------+------+
| | |
+-----+-----+ +----+----+ +-----+-----+
| 11 TCM | | 5 C- | | 8 Channel |
| Physicians| | Suite | | Agents |
+-----------+ +---------+ +-----------+
Figure 3: Conductor pattern. Each conductor manages a domain-specific team. The TCM Conductor routes to 11 historical physician agents; the Board Conductor manages CEO, CFO, CTO, Growth, and Intel agents; the Marketing Conductor coordinates channel-specific strategists.
6.2 TCM Conductor: Routing by Specialty
The TCM Conductor manages 11 agents, each embodying a historical Chinese physician with distinct specialties:
| Physician | Era | Specialty |
|---|---|---|
| Huang Di | ~2600 BCE | Foundational theory, Nei Jing |
| Zhang Zhongjing | 150--219 | Cold damage, Shang Han Lun |
| Hua Tuo | 140--208 | Surgery, Ma Fei San anesthesia |
| Sun Simiao | 581--682 | Prescriptions, ethics |
| Li Shizhen | 1518--1593 | Materia medica, Ben Cao Gang Mu |
| Huangfu Mi | 215--282 | Acupuncture, Zhen Jiu Jia Yi Jing |
| Ye Tianshi | 1667--1746 | Warm disease (Wen Bing) |
| Liu Wansu | 1120--1200 | Cooling school |
| Li Dongyuan | 1180--1251 | Spleen-Stomach school |
| Zhu Danxi | 1281--1358 | Nourishing Yin school |
| Fu Qingzhu | 1607--1684 | Gynecology |
When a health query arrives, the conductor analyzes the topic and routes to the most relevant 4--6 physicians. For a spring allergy question, the conductor might convene Ye Tianshi (warm disease), Liu Wansu (cooling), Li Dongyuan (Spleen-Stomach), and Sun Simiao (general prescriptions) for a structured debate, while excluding Huangfu Mi (acupuncture focus) and Fu Qingzhu (gynecology focus).
6.3 Board Conductor: Executive Decision-Making
The Board Conductor implements a corporate advisory board with five C-suite agents: CEO (strategy), CFO (financial analysis), CTO (technical feasibility), Growth (market expansion), and Intel (competitive intelligence). Scheduled every 8 hours, it scrapes technology news, frames a strategic thesis, and orchestrates a structured debate among the executives, producing bilingual board minutes.
6.4 Conductor Scheduling
Conductors use the heart.schedule field to define autonomous wakeup cycles. The TCM Conductor and Board Conductor each wake every 8 hours to perform their scheduled debates. This is not polling---the Go runtime maintains a scheduler that triggers the agent's prompt at the configured interval, creating an autonomous swarm that operates without human initiation.
7. Three-Tier Resource Cascading
7.1 Local, Edge, Cloud
Each soul file's brain section can define a primary provider and a fallback chain:
brain: provider: "claude" # Tier 3: Cloud API model: "claude-sonnet-4-6" fallback: provider: "ollama" # Tier 1: Local model: "qwen3.5:9b"
The runtime attempts the primary provider first. On failure (rate limit, network outage, budget exhaustion), it cascades to the fallback. This creates a natural three-tier resource model:
- ●Tier 1 (Local): Ollama models running on the same machine. Zero latency, zero cost, limited capability.
- ●Tier 2 (Edge): Models running on nearby infrastructure (e.g., a home lab GPU server).
- ●Tier 3 (Cloud): Commercial APIs (Claude, OpenAI). Highest capability, highest cost.
Routine tasks (status checks, simple formatting) can be handled by local models; complex reasoning (multi-step analysis, creative writing) escalates to cloud. The soul file declares the preference; the runtime handles the fallback.
8. The Forge: Runtime Skill Generation
8.1 Self-Creating Tools
The soul_forge skill is a meta-skill: an LLM-powered tool that creates new tools. When the swarm identifies a capability gap---for example, the need to monitor Hacker News trending posts---the swarm_architect agent can invoke soul_forge to:
- ●Design a soul specification for the new agent.
- ●Create the
.soul.mdfile with proper frontmatter and system prompt. - ●Validate the file against the schema.
- ●Generate accompanying SKILL.md and script files.
Because soul files and SKILL.md manifests are declarative text, the LLM can generate them reliably. The forge validates generated files against the schema before writing them to disk, and the runtime's auto-discovery mechanism picks them up immediately.
8.2 Self-Evolution Compatibility
The declarative, text-based nature of soul files makes them uniquely amenable to autonomous modification. Unlike Python class hierarchies that require understanding import graphs and inheritance chains, a soul file can be patched with simple text operations:
- ●Temperature tuning: Change
temperature: 0.3totemperature: 0.5based on output quality metrics. - ●Skill addition: Append a skill name to the
enablelist. - ●Prompt refinement: Edit the Markdown body to add new rules or domain knowledge.
- ●Model upgrade: Swap
model: "claude-sonnet-4-6"for a newer release.
The swarm_architect agent performs these modifications as part of a self-evolution cycle, evaluating agent performance and patching soul files to improve outcomes. This is possible precisely because the soul file is not code---it is a declarative specification that an LLM can read, understand, and modify without the risk of introducing syntax errors in a programming language.
9. Security Model
9.1 Defense in Depth
The Thin Soul + Fat Skill architecture provides multiple security boundaries:
Capability-Based Permissions. Each soul file declares its permission envelope. An agent with shell: false cannot execute arbitrary commands. Filesystem permissions use an allow/deny list with glob patterns, preventing agents from accessing sensitive paths (.env, .git, ~/.ssh, ~/.gnupg).
Skill Whitelisting. Agents can only invoke skills listed in their skills.enable array. A marketing agent cannot invoke robot_control even if the skill exists in the registry.
Shell Blocklist. For agents with shell access enabled, a blocklist prevents dangerous commands (rm -rf, curl | bash, etc.) from executing.
Filesystem Sandboxing. Skill execution is sandboxed to the configured output_dir. Scripts cannot write outside their designated directory.
Forge Safety Scanning. When the forge generates new skills, the output undergoes static analysis: the generated script is scanned for dangerous patterns (network exfiltration, filesystem traversal, process spawning) before being written to disk.
Circuit Breaker. The heart.max_rounds and heart.max_daily_wakeups fields prevent runaway autonomous agents. An agent that exceeds its round limit or daily wakeup quota is automatically suspended.
9.2 The Principle of Least Privilege
Each agent is configured with the minimum permissions required for its function. The TCM physician agents have shell: false and filesystem access limited to ./output. The infrastructure maintenance agent has broader permissions but is rate-limited by the circuit breaker. This granularity is possible because permissions are declared per soul file, not globally.
10. Evaluation
10.1 Memory Efficiency
We measured total system memory for the LocalKin swarm running 139 active agents on a Mac Mini M2 (16GB RAM) on 2026-04-27:
| Component | Memory |
|---|---|
| Go runtime binary | 45 MB |
| 258 parsed fleet soul structs | 5.2 MB |
| Skill registry (127 declared skills) | 3.8 MB |
| MQTT broker (heartbeat) | 12 MB |
| HTTP server + routing | 8 MB |
| Per-agent goroutine overhead | 280 MB |
| Conversation buffers (139 agents) | ~1.85 GB |
| Total | ~2.2 GB |
Table 3: Memory breakdown for 139 active agents (2026-04-27 measurement). Per-agent average is 16.6 MB resident set, up from 12.5 MB at 75 agents — the increase is dominated by larger per-conversation buffer allocation as agents are exercised more deeply, not by per-agent definition cost (which remains constant at the 2--4 KB parsed-soul level). At 16 MB per agent, the architecture still operates at 12--19× the memory efficiency of Python-based frameworks (Table 1).
10.2 Comparative Analysis
| Metric | LocalKin | AutoGen | CrewAI | MetaGPT |
|---|---|---|---|---|
| Language | Go | Python | Python | Python |
| Per-agent memory | ~16 MB | ~250 MB | ~200 MB | ~300 MB |
| Max agents (16GB) | 139 (live) | ~8 | ~10 | ~6 |
| Skill definition | SKILL.md + script | Python function | Python tool class | Python action |
| Agent definition | .soul.md (text) | Python class | Python class | Python class |
| Hot-swap agent | Yes (edit file) | No (restart) | No (restart) | No (restart) |
| Machine-modifiable | Yes (text patch) | Difficult (AST) | Difficult (AST) | Difficult (AST) |
| Deterministic tools | Yes (subprocess) | Partial | Partial | Partial |
| Token overhead/skill | ~120 tokens | ~800 tokens | ~600 tokens | ~1000 tokens |
Table 4: Architectural comparison across multi-agent frameworks. LocalKin per-agent memory is the live 2026-04-27 measurement (139 agents, 16.6 MB average resident set).
10.3 Token Efficiency
We measured token consumption for a stock-price lookup workflow across architectures:
- ●Prompt-embedded (AutoGen-style): 1,847 tokens (system prompt with API docs, parsing logic, error handling, few-shot examples).
- ●Thin Soul + Fat Skill (LocalKin): 312 tokens (skill description in prompt + tool call + result parsing).
- ●Reduction: 83% fewer tokens per tool-use interaction.
For an agent performing 20 tool calls per session, this translates to approximately 30,000 tokens saved---equivalent to roughly $0.09 per session at current Claude Sonnet pricing, or $2.70/day for a 30-session workload.
10.4 Skill Coverage
The production deployment maintains:
- ●258 fleet souls (294 total including private/experimental) across 23 domains (TCM, spiritual, board/C-suite, marketing, engineering, QA, design, spatial computing, game development, grocery, support, paid media, product, specialist, growth, growth_us, integration, project management, quant, day-trade, research, bible, csuite).
- ●127 declared skills spanning web scraping, data retrieval, content publishing, inter-agent communication, knowledge search, language drills, hardware control, and image/audio I/O.
- ●6+ conductor agents orchestrating domain-specific teams (TCM Conductor, Spiritual Conductor, Quant Conductor, Board Conductor, Growth Conductor, Prediction Conductor, etc.).
- ●Autonomous scheduling: Multiple conductors run scheduled debates and self-improvement passes every 2--8 hours without human intervention.
11. Related Work
AutoGen [Wu et al., 2023] pioneered conversable agents with tool use but couples agent logic to Python classes. CrewAI [Moura, 2024] introduced role-based agent design but defines roles in Python, not declarative files. MetaGPT [Hong et al., 2023] models software development workflows as SOPs but requires heavyweight Python processes. DSPy [Khattab et al., 2023] compiles declarative language model programs but focuses on prompt optimization rather than multi-agent orchestration. LangGraph [LangChain, 2024] provides graph-based agent orchestration but inherits LangChain's Python dependency overhead.
Organization-level abstractions. OneManCompany (OMC) [Yu et al., 2026; arXiv:2604.22446] frames multi-agent coordination at the level of an organization, introducing a "Talent Market" abstraction for dynamic agent recruitment and an Explore--Execute--Review (E²R) tree-search loop for unified planning. OMC and our work address complementary layers: OMC operates at the organizational layer (how heterogeneous agents are recruited and reorganized), while Thin Soul + Fat Skill operates at the architectural layer (how each agent's identity is decoupled from its capabilities). Soul files are a natural substrate for Talent representation; an organizational layer can be built atop the architecture described here.
Memory and retrieval. Memanto [Abtahi et al., 2026; arXiv:2604.22085] proposes typed semantic memory with information-theoretic retrieval, reporting sub-90ms deterministic retrieval without an indexing step on the LongMemEval and LoCoMo benchmarks. We share the underlying conviction---explored further in our companion work [The LocalKin Team, 2026]---that retrieval does not require LLM inference. Memanto encodes this insight via 13 predefined typed memory categories backed by Moorcheh's information-theoretic search engine; our companion work uses literal grep against domain corpora. The two approaches converge on the same architectural principle from different starting points: separate the deterministic retrieval substrate from the generative reasoning layer.
Document-conditioned generation. BERAG [Chen et al., 2026; arXiv:2604.22678] proposes Bayesian ensemble RAG for visual question answering: rather than concatenating retrieved documents into a single context window, BERAG conditions the generator on each document independently and updates document posterior probabilities token-by-token using Bayes' rule during generation. This offers a path forward for multi-expert systems built on Thin Soul + Fat Skill: each soul-defined expert (e.g., each of the 35 TCM masters in our deployment) could condition the generator independently, with conductor-mediated Bayesian aggregation, eliminating the context-window crowding that limits naive multi-agent debate.
Self-improvement loops. Reflexion [Shinn et al., 2023; arXiv:2303.11366] and Self-Refine [Madaan et al., 2023; arXiv:2303.17651] established that LLM agents can iteratively critique and improve their own outputs, given only verbal feedback. Our architecture extends this from output-level to identity-level self-improvement: because soul files are declarative text, an agent (or a peer auditor agent) can modify another agent's soul prompt directly, persisting improvements across sessions. This connects to forthcoming work on autonomously self-evolving swarms.
The Thin Soul + Fat Skill architecture shares the declarative philosophy of infrastructure-as-code systems like Terraform and Kubernetes manifests, applying it to agent definition. The key insight---that agent identity and agent capability have fundamentally different runtime characteristics---appears to be novel in the multi-agent literature.
12. Limitations and Future Work
LLM Dependency for Routing. Conductor agents still rely on LLM calls to route queries to team members. A learned router using embedding similarity could reduce this to a deterministic lookup for common query types.
Cold Start. The first request to each agent requires loading the soul file and initializing the conversation buffer. Pre-warming strategies could reduce first-request latency.
Skill Composition. While compound skills handle linear pipelines, more complex DAG-structured workflows would benefit from a formal composition language.
Evaluation Rigor. The memory comparisons in this paper use straightforward runtime measurements rather than controlled benchmarks with identical workloads. A standardized multi-agent benchmark would strengthen these claims.
Context Window Pressure. As conversations grow, the thin soul's token savings are amortized over an increasingly large conversation history. Integrating summarization or retrieval-augmented approaches for long conversations remains future work.
Autonomous Self-Evolution (forthcoming work). A direct consequence of declarative soul files is that the swarm itself can edit its own agent definitions. We have observed this empirically in our deployment: on 2026-04-26, a peer auditor agent identified a stale formatting convention in quant_conductor v2.2.5 (mixed Chinese/English citation tag formats), automatically patched the soul file to v2.2.6, restarted the affected process, and verified compliance against fresh output---all without human intervention. The full mechanism, its safety envelope, and a multi-week deployment study are the subject of a forthcoming companion paper on self-evolving multi-agent swarms.
13. Conclusion
The Thin Soul + Fat Skill architecture demonstrates that the dominant cost in multi-agent systems is not the agents themselves but the framework overhead surrounding them. By reducing an agent to a declarative text file and executing tools as deterministic subprocesses, we achieve a 12--19× improvement in per-agent memory efficiency and enable 139 specialized agents to operate concurrently on consumer hardware (Mac Mini M2, 16GB RAM, ~2.2GB total system memory; 2026-04-27 measurement).
The architecture's most consequential property may be its compatibility with autonomous self-evolution. Because soul files are declarative text---not code---they can be reliably read, understood, and modified by the very LLMs they configure. This creates a feedback loop where the swarm improves its own agents, a capability that is impractical when agent definitions are embedded in Python class hierarchies.
The separation of identity from capability is a simple idea. Its compounding effects---on memory, tokens, security, hot-swapping, machine modification, and team orchestration---suggest it is also a consequential one.
References
- ●Hong, S., et al. (2023). MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. arXiv:2308.00352.
- ●Khattab, O., et al. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714.
- ●Moura, J. (2024). CrewAI: Framework for orchestrating role-playing, autonomous AI agents. GitHub repository.
- ●Wu, Q., et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.
- ●LangChain. (2024). LangGraph: Building stateful, multi-actor applications. Documentation.
- ●Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366.
- ●Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651.
- ●Yu, Z., Fu, Y., He, Z., Huang, Y., Lee, K. Y., Fang, M., Luo, W., & Wang, J. (2026). OneManCompany: From Skills to Talent --- Organising Heterogeneous Agents as a Real-World Company. arXiv:2604.22446.
- ●Abtahi, S. M., Rahnema, R., Patel, H., Patel, N., Fekri, M., & Khani, T. (2026). Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents. arXiv:2604.22085.
- ●Chen, J., Mei, J., Yang, G., & Byrne, B. (2026). BERAG: Bayesian Ensemble Retrieval-Augmented Generation for Knowledge-based Visual Question Answering. arXiv:2604.22678.
- ●The LocalKin Team. (2026). Grep is All You Need: Zero-Preprocessing Knowledge Retrieval for LLM Agents. Zenodo. doi:10.5281/zenodo.19777260.
- ●The LocalKin Team. (2026). Self-Evolving Multi-Agent Swarms: Autonomous Quality Audit, Repair, and Verification Loops for Production AI Agent Systems. Zenodo. doi:10.5281/zenodo.20094223.
精简 Soul,丰富 Skill:面向生产级多智能体系统的 Token 高效架构
The LocalKin Team
摘要
本文所有生产数据测量于 2026-04-27 LocalKin 实时部署。
当前多智能体框架将推理逻辑、领域知识和执行程序直接嵌入 LLM 提示,导致每个智能体的内存占用超过 200MB,在消费级硬件上实际上限为 8-10 个智能体。我们提出精简 Soul + 丰富 Skill 架构,将智能体身份(一个约 30-120 行 YAML 和 Markdown 的声明式"soul 文件")与执行逻辑(任意大小的确定性"skill"脚本)分离。Soul 文件被 LLM 作为系统提示消费;skill 完全在 Token 窗口之外执行。这种分离将每个智能体的内存降至约 16MB——比 Python 框架降低 12-19 倍——并使 139 个专业智能体(其中 80 个在 faith.localkin.ai 和 heal.localkin.ai 公开可访问)能够在单台 Mac Mini M2(16GB RAM、约 2.2GB 总系统内存)上并发运行。我们描述了 soul 文件 schema、skill 执行协议、分层多智能体协调的指挥官模式、自主生成新 skill 的运行时锻造,以及由 soul 文件声明式特性实现的自我进化机制。在生产部署中,该架构维持着 258 个 fleet souls(总计 294 个) 跨 23 个领域 和 127 个声明 skill,通过确定性工具执行消除了数据检索任务中的幻觉。
关键词: 多智能体系统、LLM 架构、Token 效率、智能体编排、工具使用
1. 引言
多智能体 LLM 系统的承诺——专业智能体在复杂任务上协作——正面临资源瓶颈。AutoGen [Wu et al., 2023]、CrewAI [Moura, 2024] 和 MetaGPT [Hong et al., 2023] 等框架将智能体定义为重量级 Python 对象,将提示、思维链逻辑、工具定义和对话历史捆绑到单个运行时实体中。每个智能体在回答第一个查询之前就消耗了 200-300MB 进程内存。
对于在消费级硬件上运营的独立开发者来说,这意味着"蜂群"上限约为 8 个智能体。替代方案——运行更少的通用智能体——牺牲了使多智能体系统引人注目的专业化。
本文介绍了在 LocalKin 系统中部署的精简 Soul + 丰富 Skill 架构,它颠覆了传统设计。与其将逻辑嵌入提示,不如沿着自然边界分离关注点:
- ●Soul(精简):一个声明式文件(约 30-120 行),定义智能体是谁——其人格、模型配置、权限、计划行为和安全约束。Soul 成为 LLM 的系统提示。
- ●Skill(丰富):一个独立脚本(Python、Bash 或任何语言),配合 SKILL.md 清单,定义智能体能做什么。Skill 在子进程中确定性地执行。其代码从不进入 Token 窗口。
这种分离产生三个具体好处:(1)每个智能体内存减少 20 倍,(2)数据检索任务的确定性执行(否则容易产生幻觉),(3)机器可修改的智能体定义,实现蜂群的自主自我进化。
2. 问题:提示臃肿的智能体
2.1 传统智能体的解剖
在 AutoGen 等框架中,典型的智能体定义包括:
agent = AssistantAgent( name="financial_analyst", system_message="""你是一位资深金融分析师... [200+ 行指令、少样本示例、 工具 schema、输出格式规范]""", llm_config={"model": "gpt-4", "temperature": 0.2}, code_execution_config={"work_dir": "output"}, )
仅 system_message 就可以超过 4,000 个 Token。加上工具定义、对话历史和框架开销,单个智能体的上下文窗口在用户输入到达之前就被自身配置消耗殆尽。
2.2 扩展瓶颈
我们测量了三个流行框架中单个闲置智能体的运行时内存:
| 框架 | 每智能体内存 | 最大智能体数(16GB) | 架构 |
|---|---|---|---|
| AutoGen | ~250 MB | ~8 | Python 类 + 提示 |
| CrewAI | ~200 MB | ~10 | Python 类 + 提示 |
| MetaGPT | ~300 MB | ~6 | Python 类 + 提示 |
| LocalKin | ~16 MB | 139(实测) | Soul 文件 + Go 运行时 |
表 1:多智能体框架内存比较。测量在 Mac Mini M2(16GB RAM)上进行(2026-04-27 快照)。LocalKin 的每智能体数据为 139 个活跃智能体的平均常驻集(实测 16.6 MB);每智能体开销主要由对话缓冲区分配主导,而非智能体定义本身(解析后的 soul 结构仅 2-4 KB)。LocalKin 智能体共享单个 Go 二进制运行时(约 45MB)和全局 skill 注册表。
差异在于 Python 框架为每个智能体实例化完整的解释器、框架对象图和提示模板。LocalKin 的 Go 运行时将每个 soul 加载为轻量级结构(约 2-4KB 的解析 YAML),并在所有智能体之间共享 skill 注册表。
2.3 Token 浪费
除了内存之外,提示嵌入逻辑在每次 LLM 调用时都浪费 Token。一个可以作为子进程调用在 50ms 内执行的股票价格查找,反而消耗了 500+ 个提示 Token 描述 API、响应解析和错误处理——只为让 LLM 近似地执行确定性脚本能精确计算的内容。
3. Soul 文件设计
3.1 Schema
Soul 文件(*.soul.md)使用两部分格式:机器可读配置的 YAML 前置元数据,以及系统提示的 Markdown 正文。
--- name: "Hua Tuo" slug: "hua_tuo" version: "1.0.0" brain: provider: "claude" model: "claude-sonnet-4-6" temperature: 0.3 context_length: 32768 fallback: provider: "ollama" model: "qwen3.5:9b" permissions: shell: false network: true filesystem: allow: ["./output"] deny: [".env", ".git", "~/.ssh"] skills: enable: ["knowledge_search", "web_search", "web_scrape"] output_dir: "./output" heart: enabled: true pulse: interval: "10s" topic: "localkin/swarm/status" safety: pain_feedback: false --- # 华佗——神医 你是华佗,字元化... ## 专业领域 - 健康与生活方式指导 - 用于预防的草药配方 ...
图 1:传统中医智能体的缩略 soul 文件。YAML 前置元数据(42 行)配置运行时;Markdown 正文(约 70 行)成为系统提示。
3.2 前置元数据部分
YAML 前置元数据分为五个关注点区域:
Brain(大脑)。 模型提供商、模型名称、温度、上下文长度和可选的备用链。当主要提供商(例如,Claude API)不可用时,运行时级联到备用(例如,本地 Ollama 模型)。这实现了三层资源管理:云 API 用于复杂推理,边缘模型用于常规任务,本地模型用于离线操作。
Permissions(权限)。 基于能力的安全模型。每个智能体声明是否可以执行 shell 命令、访问网络,以及允许或拒绝哪些文件系统路径。运行时在系统调用边界强制执行这些权限——shell: false 的智能体无论 LLM 输出什么都无法执行任意命令。
Skills(技能)。 智能体可以调用的 skill 名称白名单。此处未列出的 skill 对智能体不可见,即使它们存在于 skills 目录中。这防止了营销智能体调用 robot_control skill 等情况。
Heart(心跳)。 智能体的自主行为计划。pulse 定义通过 MQTT 进行存活监控的心跳。schedule 数组定义周期性自主任务——例如,TCM 指挥官每 8 小时唤醒一次,抓取热门健康主题并发起医师辩论。
Safety(安全)。 约束,包括 shell 命令黑名单、断路器阈值和具身智能体的疼痛反馈标志。
3.3 热交换与机器修改
由于 soul 文件是磁盘上的纯文本文件,更改无需重新编译或重新部署即可生效。操作员可以编辑 soul 文件以更改智能体的人格、模型或权限,运行时在下一个请求周期接收更改。
更重要的是,这一属性实现了机器修改。LocalKin 的 swarm_architect 智能体可以程序化地修补 soul 文件——调整温度、添加 skill 或重写系统提示部分——作为自主自我进化循环的一部分。这在第 8 节中进一步描述。
4. 丰富 Skill 设计
4.1 SKILL.md 清单
每个 skill 由 skills/ 目录中的 SKILL.md 文件定义,使用与 soul 文件相同的 YAML 前置元数据加 Markdown 格式:
--- name: stock_price description: "获取实时股票报价和指标" command: ["python3", "skills/stock_price/quote.py"] args: ["--action", "{{action}}", "--symbol", "{{symbol}}", "--period", "{{period}}"] timeout: 30 schema: action: type: "string" enum: ["quote", "metrics", "history"] required: true symbol: type: "string" description: "股票代码,例如 NVDA" required: true period: type: "string" default: "1mo" --- # 股票价格 通过 Yahoo Finance 获取实时股票市场数据。
图 2:stock_price skill 的 SKILL.md 清单。模板参数({{symbol}}、{{action}})在调用时插值。
4.2 执行协议
当智能体调用 skill 时,运行时:
- ●解析 LLM 的工具调用输出(例如,
stock_price action="quote" symbol="NVDA")。 - ●验证 参数与 SKILL.md 中定义的 schema。
- ●插值 模板变量到命令参数中。
- ●生成 脚本作为具有配置超时的子进程。
- ●捕获 stdout(预期为 JSON)并将其作为工具结果返回给 LLM。
关键洞见是步骤 4 确定性地执行。quote.py 脚本向 Yahoo Finance 发出 HTTP 请求并返回确切价格。没有消耗 Token 来解析 API 文档。幻觉是不可能的。LLM 的角色被简化为决定何时调用 skill 并解读结果——这是其能力范围内的任务。
4.3 语言无关性
由于 skill 通过 stdin/stdout JSON 通信,它们可以用任何语言编写。LocalKin skill 注册表包括 Python(股票报价、网页抓取、NotebookLM 集成)、Bash(内容发布流水线、系统监控)和 Go(蜂群通信)编写的 skill。运行时对实现语言无感知。
4.4 即插即用发现
添加新 skill 只需要将 SKILL.md 文件(及其附带脚本)放入 skills/ 目录。运行时在启动时通过扫描 SKILL.md 清单自动发现 skill。无需注册代码、无需导入语句、无需框架样板。在生产中,LocalKin 仅通过此机制管理 127 个声明 skill(每个智能体根据其 skills.enable 白名单加载其中一个子集)。
4.5 Token 经济学
SKILL.md Markdown 正文(通常为 5-15 行使用示例)被注入到系统提示中,让 LLM 知道如何调用 skill。脚本本身——可能是 200+ 行 Python——从不进入 Token 窗口。对于 stock_price skill:
| 组件 | 大小 | 消耗的 Token |
|---|---|---|
| SKILL.md 正文(在提示中) | 35 行 | ~120 个 Token |
| quote.py(子进程) | 215 行 | 0 个 Token |
| 总 Token 成本 | ~120 个 Token |
表 2:stock_price skill 的 Token 统计。执行脚本消耗零 Token,因为它作为子进程在 LLM 上下文之外运行。
与提示嵌入方法相比,等效逻辑、API 文档和错误处理指令每个智能体每次调用将消耗 800-1,200 个 Token。
5. 复合 Skill 与 LLM 往返减少
5.1 往返问题
朴素的工具使用循环每步需要一次 LLM 调用:智能体调用工具,接收结果,推理下一步,调用另一个工具,以此类推。对于五步工作流,这意味着五次 LLM 往返,每次消耗输入/输出 Token 并增加 1-3 秒的延迟。
5.2 复合 Skill 架构
复合 skill 是包装多步脚本的单个 SKILL.md。LLM 进行一次工具调用;脚本确定性地执行整个流水线并返回合并结果。
例如,content_publish 复合 skill:
- ●获取热门话题(网页抓取)
- ●生成内容大纲(基于模板)
- ●为目标平台格式化(Markdown 到 HTML)
- ●通过 API 发布(HTTP 请求)
- ●返回带 URL 的确认
原本需要 5 次 LLM 往返的内容变为 1 次。在实践中,复合 skill 将多步工作流的 LLM 往返减少 50-87%,下限适用于两步序列,上限适用于八步流水线。
5.3 何时不应组合
当流水线是确定性的且中间决策是机械的时候,复合 skill 是合适的。当步骤之间需要真正的推理时——例如,根据质量决定草稿是否需要修订——单步 skill 模型是首选,允许 LLM 在每个阶段行使判断。
6. 指挥官模式
6.1 分层团队组织
LocalKin 将智能体组织成由指挥官智能体领导的领域特定团队,而非扁平的点对点拓扑。指挥官本身是一个智能体(拥有自己的 soul 文件),其系统提示包含路由逻辑,其 skill 包括 swarm_comm(智能体间消息传递)和 swarm_debate(结构化多智能体审议)。
+-------------------+
| 用户 / Jacky |
+--------+----------+
|
+--------------+--------------+
| | |
+--------v---+ +-----v------+ +----v--------+
| TCM | | Board | | Marketing |
| 指挥官 | | 指挥官 | | 指挥官 |
+-----+------+ +-----+------+ +------+------+
| | |
+-----+-----+ +----+----+ +-----+-----+
| 11 名 TCM| | 5 名高管| | 8 名渠道 |
| 医师 | | 团队 | | 智能体 |
+-----------+ +---------+ +-----------+
图 3:指挥官模式。每个指挥官管理一个领域特定团队。TCM 指挥官路由到 11 位历史医师智能体;Board 指挥官管理 CEO、CFO、CTO、增长和情报智能体;Marketing 指挥官协调渠道特定策略师。
6.2 TCM 指挥官:按专业路由
TCM 指挥官管理 11 个智能体,每个体现具有不同专业的历史中国医师:
| 医师 | 年代 | 专业 |
|---|---|---|
| 黄帝 | 约公元前 2600 年 | 基础理论,内经 |
| 张仲景 | 150-219 年 | 伤寒,伤寒论 |
| 华佗 | 140-208 年 | 外科,麻沸散麻醉 |
| 孙思邈 | 581-682 年 | 处方,医德 |
| 李时珍 | 1518-1593 年 | 本草,本草纲目 |
| 皇甫谧 | 215-282 年 | 针灸,针灸甲乙经 |
| 叶天士 | 1667-1746 年 | 温病(温病学) |
| 刘完素 | 1120-1200 年 | 寒凉派 |
| 李东垣 | 1180-1251 年 | 脾胃派 |
| 朱丹溪 | 1281-1358 年 | 养阴派 |
| 傅青主 | 1607-1684 年 | 妇科 |
当健康查询到达时,指挥官分析主题并路由到最相关的 4-6 位医师。对于春季过敏问题,指挥官可能召集叶天士(温病)、刘完素(寒凉)、李东垣(脾胃)和孙思邈(一般处方)进行结构化辩论,同时排除皇甫谧(针灸专注)和傅青主(妇科专注)。
6.3 Board 指挥官:高管决策
Board 指挥官实现了一个拥有五名高管智能体的企业顾问委员会:CEO(战略)、CFO(财务分析)、CTO(技术可行性)、增长(市场扩张)和情报(竞争情报)。每 8 小时调度一次,它抓取技术新闻、构建战略论点,并在高管之间编排结构化辩论,产出双语董事会会议纪要。
6.4 指挥官调度
指挥官使用 heart.schedule 字段定义自主唤醒周期。TCM 指挥官和 Board 指挥官每 8 小时唤醒一次执行其计划的辩论。这不是轮询——Go 运行时维护一个调度器,在配置的间隔触发智能体的提示,创建一个无需人工发起就能运行的自主蜂群。
7. 三层资源级联
7.1 本地、边缘、云端
每个 soul 文件的 brain 部分可以定义主要提供商和备用链:
brain: provider: "claude" # 第 3 层:云 API model: "claude-sonnet-4-6" fallback: provider: "ollama" # 第 1 层:本地 model: "qwen3.5:9b"
运行时首先尝试主要提供商。在故障时(速率限制、网络中断、预算耗尽),级联到备用。这创建了自然的三层资源模型:
- ●第 1 层(本地): 在同一台机器上运行的 Ollama 模型。零延迟、零成本、有限能力。
- ●第 2 层(边缘): 在附近基础设施上运行的模型(例如,家庭实验室 GPU 服务器)。
- ●第 3 层(云端): 商业 API(Claude、OpenAI)。最高能力,最高成本。
常规任务(状态检查、简单格式化)可以由本地模型处理;复杂推理(多步分析、创意写作)升级到云端。Soul 文件声明偏好;运行时处理备用。
8. 锻造:运行时 Skill 生成
8.1 自创工具
soul_forge skill 是一个元 skill:一个 LLM 驱动的工具,创建新工具。当蜂群识别到能力差距时——例如,需要监控 Hacker News 热门帖子——swarm_architect 智能体可以调用 soul_forge 来:
- ●设计 新智能体的 soul 规范。
- ●创建 具有适当前置元数据和系统提示的
.soul.md文件。 - ●验证 文件符合 schema。
- ●生成 附带的 SKILL.md 和脚本文件。
由于 soul 文件和 SKILL.md 清单是声明式文本,LLM 可以可靠地生成它们。锻造在写入磁盘之前根据 schema 验证生成的文件,运行时的自动发现机制立即拾取它们。
8.2 自我进化兼容性
Soul 文件的声明式、基于文本的特性使其对自主修改具有独特的适用性。与需要理解导入图和继承链的 Python 类层次结构不同,soul 文件可以用简单的文本操作修补:
- ●温度调整: 根据输出质量指标将
temperature: 0.3更改为temperature: 0.5。 - ●Skill 添加: 向
enable列表追加 skill 名称。 - ●提示优化: 编辑 Markdown 正文以添加新规则或领域知识。
- ●模型升级: 将
model: "claude-sonnet-4-6"替换为更新版本。
swarm_architect 智能体作为自我进化周期的一部分执行这些修改,评估智能体性能并修补 soul 文件以改善结果。这之所以可能,正是因为 soul 文件不是代码——它是 LLM 可以读取、理解和修改的声明式规范,不存在在编程语言中引入语法错误的风险。
9. 安全模型
9.1 纵深防御
精简 Soul + 丰富 Skill 架构提供多个安全边界:
基于能力的权限。 每个 soul 文件声明其权限范围。shell: false 的智能体无法执行任意命令。文件系统权限使用带 glob 模式的允许/拒绝列表,防止智能体访问敏感路径(.env、.git、~/.ssh、~/.gnupg)。
Skill 白名单。 智能体只能调用 skills.enable 数组中列出的 skill。营销智能体不能调用 robot_control,即使该 skill 存在于注册表中。
Shell 黑名单。 对于启用了 shell 访问的智能体,黑名单防止执行危险命令(rm -rf、curl | bash 等)。
文件系统沙箱。 Skill 执行被沙箱化到配置的 output_dir。脚本无法在其指定目录之外写入。
锻造安全扫描。 当锻造生成新 skill 时,输出进行静态分析:生成的脚本在写入磁盘之前扫描危险模式(网络渗出、文件系统遍历、进程生成)。
断路器。 heart.max_rounds 和 heart.max_daily_wakeups 字段防止失控的自主智能体。超过其轮次限制或每日唤醒配额的智能体自动暂停。
9.2 最小权限原则
每个智能体配置为其功能所需的最小权限。TCM 医师智能体 shell: false,文件系统访问限于 ./output。基础设施维护智能体具有更广泛的权限,但受断路器速率限制。这种粒度之所以可能,是因为权限是按 soul 文件声明的,而非全局的。
10. 评估
10.1 内存效率
我们于 2026-04-27 在 Mac Mini M2(16GB RAM)上测量了运行 139 个活跃智能体的 LocalKin 蜂群的总系统内存:
| 组件 | 内存 |
|---|---|
| Go 运行时二进制 | 45 MB |
| 258 个 fleet soul 解析结构 | 5.2 MB |
| Skill 注册表(127 个声明 skill) | 3.8 MB |
| MQTT broker(心跳) | 12 MB |
| HTTP 服务器 + 路由 | 8 MB |
| 每智能体 goroutine 开销 | 280 MB |
| 对话缓冲区(139 个智能体) | ~1.85 GB |
| 总计 | ~2.2 GB |
表 3:139 个活跃智能体的内存分解(2026-04-27 测量)。每智能体平均常驻集 16.6 MB,相比 75 智能体规模时的 12.5 MB 有所上升——增长主要由更深度交互产生的对话缓冲区分配主导,而非智能体定义开销(解析后 soul 结构仍稳定在 2-4 KB 级别)。即使在 16 MB/智能体的水平上,该架构相比 Python 框架仍保持 12-19 倍的内存效率(见表 1)。
10.2 比较分析
| 指标 | LocalKin | AutoGen | CrewAI | MetaGPT |
|---|---|---|---|---|
| 编程语言 | Go | Python | Python | Python |
| 每智能体内存 | ~16 MB | ~250 MB | ~200 MB | ~300 MB |
| 最大智能体数(16GB) | 139(实测) | ~8 | ~10 | ~6 |
| Skill 定义 | SKILL.md + 脚本 | Python 函数 | Python 工具类 | Python 动作 |
| 智能体定义 | .soul.md(文本) | Python 类 | Python 类 | Python 类 |
| 热交换智能体 | 是(编辑文件) | 否(重启) | 否(重启) | 否(重启) |
| 机器可修改 | 是(文本修补) | 困难(AST) | 困难(AST) | 困难(AST) |
| 确定性工具 | 是(子进程) | 部分 | 部分 | 部分 |
| 每 Skill Token 开销 | ~120 个 | ~800 个 | ~600 个 | ~1000 个 |
表 4:多智能体框架架构比较。LocalKin 每智能体内存为 2026-04-27 实时测量数据(139 个智能体,平均常驻集 16.6 MB)。
10.3 Token 效率
我们测量了跨架构的股票价格查找工作流的 Token 消耗:
- ●提示嵌入(AutoGen 风格): 1,847 个 Token(系统提示含 API 文档、解析逻辑、错误处理、少样本示例)。
- ●精简 Soul + 丰富 Skill(LocalKin): 312 个 Token(提示中的 skill 描述 + 工具调用 + 结果解析)。
- ●减少: 每次工具使用交互减少 83% 的 Token。
对于每会话进行 20 次工具调用的智能体,这转化为约 30,000 个节省的 Token——按当前 Claude Sonnet 定价约为每会话 $0.09,或 30 会话工作量约 $2.70/天。
10.4 Skill 覆盖率
生产部署维护:
- ●258 个 fleet soul(含私有/实验共 294 个),跨越 23 个领域(TCM、灵修、Board/高管、营销、工程、QA、设计、空间计算、游戏开发、杂货、支持、付费媒体、产品、专家、增长、growth_us、集成、项目管理、量化、日内交易、研究、bible、csuite)。
- ●127 个声明 skill,涵盖网页抓取、数据检索、内容发布、智能体间通信、知识搜索、语言练习、硬件控制、图像/音频 I/O。
- ●6+ 个指挥官智能体,协调领域特定团队(TCM 指挥官、灵修指挥官、量化指挥官、Board 指挥官、增长指挥官、预测指挥官等)。
- ●自主调度: 多个指挥官每 2-8 小时无需人工干预地运行计划辩论和自我改进巡查。
11. 相关工作
AutoGen [Wu et al., 2023] 开创了带工具使用的可对话智能体,但将智能体逻辑与 Python 类耦合。CrewAI [Moura, 2024] 引入了基于角色的智能体设计,但在 Python 而非声明式文件中定义角色。MetaGPT [Hong et al., 2023] 将软件开发工作流建模为 SOP,但需要重量级 Python 进程。DSPy [Khattab et al., 2023] 编译声明式语言模型程序,但专注于提示优化而非多智能体编排。LangGraph [LangChain, 2024] 提供基于图的智能体编排,但继承了 LangChain 的 Python 依赖开销。
组织层抽象。 OneManCompany(OMC)[Yu et al., 2026; arXiv:2604.22446] 将多智能体协调提升到组织层面,引入"Talent Market"(人才市场)抽象用于动态智能体招募,以及 E²R(Explore-Execute-Review)树搜索循环统一规划与评估。OMC 与本文工作在互补层级上展开:OMC 在组织层(异构智能体如何被招募和重组),Thin Soul + Fat Skill 在架构层(每个智能体的身份如何与能力解耦)。Soul 文件是 Talent 表示的天然载体;组织层可以构建于本文所述架构之上。
记忆与检索。 Memanto [Abtahi et al., 2026; arXiv:2604.22085] 提出类型化语义记忆 + 信息论检索,在 LongMemEval 和 LoCoMo 基准上报告了无索引步骤、< 90ms 确定性检索。我们与之共享底层信念——在我们的姊妹工作 [The LocalKin Team, 2026] 中进一步探讨——检索不需要 LLM 推理。Memanto 通过 13 种预定义类型化记忆类别 + Moorcheh 信息论搜索引擎实现该洞见;我们的姊妹论文则用纯 grep 检索领域语料库。两种方案从不同起点收敛于同一架构原则:将确定性检索基底与生成推理层分离。
文档条件化生成。 BERAG [Chen et al., 2026; arXiv:2604.22678] 提出贝叶斯集成 RAG 用于视觉问答:不将检索文档拼接进单一上下文窗口,而是对每个文档独立条件化生成器,并在生成过程中按 Token 用贝叶斯规则逐 Token 更新文档后验概率。这为基于 Thin Soul + Fat Skill 构建的多专家系统提供了一条前进路径:每个 soul 定义的专家(例如部署中的 35 位中医大师之一)可以独立条件化生成器,通过指挥官介导的贝叶斯聚合,消除朴素多智能体辩论中的上下文窗口拥挤问题。
自我改进循环。 Reflexion [Shinn et al., 2023; arXiv:2303.11366] 与 Self-Refine [Madaan et al., 2023; arXiv:2303.17651] 确立了 LLM 智能体可以仅凭语言反馈迭代地批判并改进自身输出。我们的架构将这一能力从输出层扩展到身份层:因为 soul 文件是声明式文本,一个智能体(或同伴审计智能体)可以直接修改另一智能体的 soul 提示,使改进跨会话持久化。这一点与下一篇关于自主自我进化蜂群的工作直接相关。
精简 Soul + 丰富 Skill 架构共享 Terraform 和 Kubernetes 清单等基础设施即代码系统的声明式理念,将其应用于智能体定义。关键洞见——智能体身份和智能体能力具有根本不同的运行时特性——在多智能体文献中似乎是新颖的。
12. 局限性与未来工作
LLM 依赖路由。 指挥官智能体仍然依赖 LLM 调用将查询路由给团队成员。使用嵌入相似度的学习路由器可以将常见查询类型简化为确定性查找。
冷启动。 每个智能体的第一次请求需要加载 soul 文件并初始化对话缓冲区。预热策略可以降低首次请求延迟。
Skill 组合。 虽然复合 skill 处理线性流水线,但更复杂的 DAG 结构工作流将受益于正式的组合语言。
评估严谨性。 本文中的内存比较使用直接的运行时测量,而非相同工作负载下的受控基准测试。标准化的多智能体基准测试将加强这些主张。
上下文窗口压力。 随着对话增长,精简 soul 节省的 Token 被越来越大的对话历史摊销。对长对话集成摘要或检索增强方法仍是未来工作。
自主自我进化(后续工作)。 声明式 soul 文件的一个直接后果是:蜂群可以编辑自身的智能体定义。我们已在部署中实际观察到这一现象——2026-04-26 当天,一个同伴审计智能体识别出 quant_conductor v2.2.5 中存在过时的格式约定(中英文引用标签格式不一致),自动将 soul 文件 patch 至 v2.2.6、重启受影响进程、并对新产出的输出验证合规性——全程无人工干预。完整机制、安全边界以及多周部署研究将是后续配套论文的主题——关于自我进化多智能体蜂群的工作。
13. 结论
精简 Soul + 丰富 Skill 架构证明了多智能体系统的主要成本不是智能体本身,而是围绕它们的框架开销。通过将智能体简化为声明式文本文件并将工具作为确定性子进程执行,我们实现了每个智能体内存效率提升 12-19 倍,并使 139 个专业智能体 能在消费级硬件(Mac Mini M2, 16GB RAM,约 2.2GB 总系统内存;2026-04-27 实测)上并发运行。
该架构最重要的属性可能是其与自主自我进化的兼容性。由于 soul 文件是声明式文本——不是代码——它们可以被配置它们的 LLM 本身可靠地读取、理解和修改。这创建了一个蜂群改进其自身智能体的反馈循环,这种能力在智能体定义嵌入 Python 类层次结构时是不切实际的。
将身份与能力分离是一个简单的想法。其在内存、Token、安全性、热交换、机器修改和团队编排方面的复合效果表明它也是一个重要的想法。
参考文献
- ●Hong, S., et al. (2023). MetaGPT: Meta Programming for Multi-Agent Collaborative Framework. arXiv:2308.00352.
- ●Khattab, O., et al. (2023). DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714.
- ●Moura, J. (2024). CrewAI: Framework for orchestrating role-playing, autonomous AI agents. GitHub repository.
- ●Wu, Q., et al. (2023). AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv:2308.08155.
- ●LangChain. (2024). LangGraph: Building stateful, multi-actor applications. Documentation.
- ●Shinn, N., Cassano, F., Berman, E., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. arXiv:2303.11366.
- ●Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., et al. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651.
- ●Yu, Z., Fu, Y., He, Z., Huang, Y., Lee, K. Y., Fang, M., Luo, W., & Wang, J. (2026). OneManCompany: From Skills to Talent —— 异构智能体作为真实世界公司的组织. arXiv:2604.22446.
- ●Abtahi, S. M., Rahnema, R., Patel, H., Patel, N., Fekri, M., & Khani, T. (2026). Memanto: 类型化语义记忆 + 信息论检索面向长时程智能体. arXiv:2604.22085.
- ●Chen, J., Mei, J., Yang, G., & Byrne, B. (2026). BERAG: 贝叶斯集成检索增强生成(知识型视觉问答). arXiv:2604.22678.
- ●The LocalKin Team. (2026). Grep is All You Need: Zero-Preprocessing Knowledge Retrieval for LLM Agents. Zenodo. doi:10.5281/zenodo.19777260.
- ●The LocalKin Team. (2026). Self-Evolving Multi-Agent Swarms: Autonomous Quality Audit, Repair, and Verification Loops for Production AI Agent Systems. Zenodo. doi:10.5281/zenodo.20094223.
How to cite this paper
Three formats below — pick the one that matches your venue. Each has a one-click copy button.
@misc{localkin2026thin,
author = {{The LocalKin Team}},
title = {Thin Soul, Fat Skill: A Token-Efficient Architecture for Production Multi-Agent Systems},
year = {2026},
month = apr,
publisher = {Zenodo},
doi = {10.5281/zenodo.20094232},
url = {https://doi.org/10.5281/zenodo.20094232},
note = {Correspondence: contact@localkin.ai}
}The LocalKin Team. (2026). Thin Soul, Fat Skill: A Token-Efficient Architecture for Production Multi-Agent Systems. Zenodo. https://doi.org/10.5281/zenodo.20094232
LocalKin Team, The. 2026. "Thin Soul, Fat Skill: A Token-Efficient Architecture for Production Multi-Agent Systems." Zenodo, April. https://doi.org/10.5281/zenodo.20094232.
See also
Sibling papers in the same thematic cluster — same conceptual neighbourhood, often cite each other.