← All KinPapers
6The LocalKin Team · April 2026 · v1.0DOI 10.5281/zenodo.20094236

Structured Multi-Agent Debate with Domain-Expert Routing

The LocalKin Team

April 2026

Keywords: multi-agent debate, domain routing, conductor architecture, structured deliberation, traditional Chinese medicine, quantitative finance, swarm intelligence

Abstract

Multi-agent debate has emerged as a powerful paradigm for improving reasoning quality in large language model (LLM) systems. However, existing approaches---including Diverse Multi-Agent Debate (DMAD), Adaptive Hierarchical Multi-Agent Debate (A-HMAD), and PROClaim---broadcast every question to all participating agents, regardless of domain relevance. This introduces computational waste and, more critically, dilutes the signal from genuine domain experts with noise from agents whose expertise is orthogonal to the question at hand.

We present Domain-Expert Routing, a conductor-based architecture that interposes a routing layer between incoming queries and the agent pool. A conductor agent consults a domain routing table to select a relevant subset of experts (typically 4--6 out of a fleet of 11--75 agents), then orchestrates a structured multi-round debate among only those agents. We instantiate this architecture in two production systems: (1) a Traditional Chinese Medicine (TCM) consultation system where a conductor routes patient symptoms to historically appropriate physician agents drawn from a team of 11 masters spanning 2,000 years of medical tradition, and (2) a quantitative finance pipeline where a quant conductor executes a 5-phase sequential protocol---data collection, adversarial debate, trade proposal, risk verification, and publication---with different agent subsets activated at each phase. A third instantiation, the prediction conductor, dynamically selects relevant experts from a 75-agent fleet based on topic classification.

Across both domains, domain-expert routing achieves higher consensus quality (80.2% weighted agreement in TCM debates), reduces per-query agent invocations by 55--65%, and enables a Phase 0 verification gate that catches hallucinated financial data before publication. Quality compliance scores improved from 75% to 92% over a continuous 5-day autonomous deployment.

1. Introduction

The multi-agent debate paradigm posits that LLM agents, when given the opportunity to argue, rebut, and revise their positions across multiple rounds, produce more accurate and well-reasoned outputs than any single agent in isolation. This insight has been validated empirically across mathematical reasoning (Du et al., 2023), factual question-answering (Liang et al., 2024), and medical diagnosis (Tang et al., 2024).

Yet a fundamental inefficiency persists in the literature: every agent participates in every debate. When a patient presents with gynecological symptoms, a fleet of 11 TCM physician agents all weigh in---including the acupuncture specialist, the pharmacologist, and the theoretical cosmologist. When a financial question concerns semiconductor earnings, the geopolitical analyst and the macroeconomist are summoned alongside the equity researcher. The result is threefold waste: (1) computational cost scales linearly with fleet size rather than with question complexity, (2) irrelevant opinions introduce noise that the consensus mechanism must filter, and (3) agents outside their domain of competence are more likely to hallucinate, undermining the safety properties that debate is meant to provide.

We propose a simple but effective solution: domain-expert routing. A conductor agent, equipped with a static routing table and optional dynamic classification, selects a task-appropriate subset of experts before initiating the debate protocol. The routing table maps symptom categories (in TCM) or pipeline phases (in quantitative finance) to specific expert combinations. The conductor does not participate in the debate itself; it orchestrates, summarizes, and presents.

This paper makes the following contributions:

  1. Domain routing tables that map query categories to expert subsets, reducing agent invocations by 55--65% while maintaining or improving consensus quality.
  2. A structured debate protocol with DMAD-inspired diverse reasoning strategies, confidence-weighted voting, consensus inertia detection, and IBIS adversarial rebuttals.
  3. Phase 0 verification gates that enforce real-time data validation before financial report publication, implemented as a runtime safety layer that cannot be bypassed by the system's self-evolution mechanism.
  4. Production evidence from two deployed systems---TCM consultation and quantitative finance---operating autonomously over multiple weeks.

2. Related Work

2.1 Multi-Agent Debate

DMAD (Diverse Multi-Agent Debate; Xu et al., ICLR 2025) demonstrated that assigning distinct reasoning strategies to agents in the first round of debate breaks the "mental set" problem, where all agents converge on the same reasoning approach and fail to explore the solution space. We adopt DMAD's strategy-assignment mechanism directly, rotating through eight reasoning strategies (analytical, analogical, contrastive, first-principles, empirical, devil's advocate, systems thinking, and historical) across agents in Round 1.

A-HMAD (Adaptive Hierarchical Multi-Agent Debate; Wang et al., 2025) introduced hierarchical debate structures where agents are organized into subgroups. Our domain routing can be viewed as a generalization of A-HMAD's hierarchy, where subgroup membership is determined by a routing table rather than fixed tree structure.

PROClaim (Li et al., 2025) proposed progressive claim refinement through multi-round argumentation, with explicit evidence tracking. We incorporate PROClaim's progressive evidence pool (P-RAG) into our debate protocol, where each agent contributes new evidence that accumulates across rounds.

DCI (Debate, Critique, Integrate; Zhang et al., 2025) separated the debate phase from a critique-and-integration phase. Our conductor's post-debate summarization serves a similar integrative function, though we keep it as a distinct architectural role rather than a debate phase.

Our system integrates elements from all four approaches---DMAD's diverse strategies, A-HMAD's hierarchical structure, PROClaim's evidence accumulation, and DCI's separation of debate from integration---and adds the novel contribution of domain-expert routing.

2.2 Agent Routing and Selection

Tool-use routing (Patil et al., 2023) and mixture-of-experts architectures (Shazeer et al., 2017) select specialized components based on input characteristics. Our domain routing extends this principle to the agent level: rather than routing to a tool or a model expert, we route to a subset of conversational agents, each embodying a distinct persona, knowledge base, and reasoning style.

3. Architecture

3.1 System Overview

The LocalKin multi-agent system runs on a single consumer machine (16GB Mac Mini, M4) with 75 specialized agents. Each agent is defined by a soul file---a YAML frontmatter configuration plus Markdown persona description---and exposed as an HTTP endpoint on a unique port. The runtime is written in Go (approximately 12.5MB memory per agent) and communicates via HTTP REST, MQTT pub/sub for live visualization, and a shared filesystem for output artifacts.

The architecture comprises three layers:

+------------------------------------------------------------------+
|                        User / Scheduler                          |
+------------------------------------------------------------------+
                              |
                     [ Conductor Agent ]
                     (routing + orchestration)
                              |
               +--------------+--------------+
               |              |              |
          [ Expert A ]   [ Expert B ]   [ Expert C ]
          (selected by routing table)
               |              |              |
               +--------------+--------------+
                              |
                     [ Debate Protocol ]
                     (multi-round, structured)
                              |
                     [ Verification Gate ]
                     (Phase 0, disclaimers)
                              |
                     [ Publication Layer ]
                     (KinBook, MQTT, files)
+------------------------------------------------------------------+

Figure 1. Domain-expert routing architecture. The conductor selects a subset of experts from the fleet, orchestrates the debate, and enforces verification before publication.

3.2 Conductor Agents

Each domain has a dedicated conductor agent. The conductor's soul file specifies:

The conductor does not hold domain expertise. Its role is purely orchestral: receive a query, classify it, select experts, invoke the debate skill, and format the output.

3.3 Domain Routing Tables

3.3.1 TCM Routing

The TCM conductor manages a team of 11 historical physician agents spanning from the mythological Yellow Emperor (Huang Di) to the Qing dynasty's Ye Tianshi. The routing table maps seven clinical categories to expert subsets:

Table 1. TCM domain routing table. Each category activates 3--6 physicians from the 11-master team.

CategoryChineseExpert SubsetRationale
General internal medicine一般内科Zhang Zhongjing, Sun Simiao, Li Dongyuan, Zhu DanxiCore diagnosticians covering Six Meridians, formulary, Spleen-Stomach, and Yin deficiency
Warm disease / fever温热病Zhang Zhongjing, Ye Tianshi, Liu Wansu, Sun SimiaoYe Tianshi's Wei-Qi-Ying-Xue system combined with Liu Wansu's cooling-heat theory
Gynecology妇科Fu Qingzhu, Zhang Zhongjing, Zhu Danxi, Sun SimiaoFu Qingzhu's gynecological specialty with broad diagnostic support
Acupuncture针灸Huangfu Mi, Zhang Zhongjing, Sun Simiao, Hua TuoHuangfu Mi's Zhenjiu Jiayi Jing with surgical and formulary backup
Surgery / emergency外科急症Hua Tuo, Zhang Zhongjing, Sun Simiao, Huangfu MiHua Tuo's surgical expertise (inventor of general anesthesia)
Pharmacology药物咨询Li Shizhen, Sun Simiao, Zhang ZhongjingLi Shizhen's Bencao Gangmu encyclopedic drug knowledge
Theory / pedagogy理论探讨Huang Di, Zhang Zhongjing, Zhu Danxi, Liu Wansu, Li DongyuanFoundational theorists for doctrinal discussions

The routing decision is made by the conductor based on keyword matching and symptom classification from the patient's presenting complaint. Zhang Zhongjing appears in all categories as the "diagnostic anchor"---his Six Meridian framework provides the initial differential diagnosis that subsequent specialists refine.

Pre-debate protocol (MDCCTM). Before the full debate, the conductor executes two preparatory steps inspired by the Multi-agent Dynamic Collaborative Chain-of-Thought in TCM (MDCCTM, arXiv:2502.04345) and TCM-DiffRAG personalized constitution analysis (arXiv:2602.22828):

  1. Constitution identification: The conductor asks the patient 2--3 screening questions to determine their body constitution type among nine categories (balanced, Qi-deficient, Yang-deficient, Yin-deficient, Phlegm-damp, Damp-heat, Blood-stasis, Qi-stagnant, Allergic). This constitution label is passed as a personalization parameter to all subsequent debate participants.

  2. Anchor diagnosis: Zhang Zhongjing receives the symptoms plus constitution label and produces an independent Six Meridian differential diagnosis. This "diagnostic anchor" is included in the debate topic for all other physicians, providing a structured starting point that prevents the debate from drifting into ungrounded speculation.

3.3.2 Quant Routing

The quantitative finance conductor implements a 5-phase sequential pipeline where different agent subsets are activated at each phase:

Table 2. Quant pipeline phases with agent routing.

PhaseNameAgentsOutput
Phase 0Real-time price verificationstock_price skill (API)Verified price + timestamp
Phase 1Data collection + analysis4 analysts (equity, macro, technical, sentiment)Independent research reports
Phase 2Adversarial debateBull team vs. Bear team (selected from Phase 1)Structured debate transcript
Phase 3Trade proposalTrader agentSpecific entry/exit/position sizing
Phase 4Risk checkRisk manager agentApproval / rejection / modification
Phase 5PublicationConductorFinal report to KinBook
Phase 0          Phase 1              Phase 2           Phase 3      Phase 4      Phase 5
[stock_price] -> [Analyst x4] ------> [Bull vs Bear] -> [Trader] --> [Risk Mgr] -> [Publish]
   |               |    |    |    |        |     |          |            |              |
   v               v    v    v    v        v     v          v            v              v
 $177.39        equity macro tech sent   debate transcript  proposal   approved?     KinBook
 verified       reports (independent)    (2 rounds)         generated  yes/no/mod    posted

Figure 2. Quant pipeline flow. Phase 0 verification is mandatory and enforced at the code level.

Unlike the TCM system where routing is query-dependent, the quant pipeline uses a fixed sequential routing where each phase gates the next. Phase 0 is particularly critical: the stock_price skill calls a real-time financial data API, and the verified price is injected into all subsequent phases. If Phase 0 fails (API error, market closed, network issue), the entire pipeline halts with an [IDLE] response rather than proceeding with stale or hallucinated prices.

3.3.3 Prediction Routing

The prediction conductor operates over the full 75-agent fleet and uses dynamic topic-based routing. When the conductor identifies a trending topic (via web search of Bloomberg, Reuters, Hacker News), it classifies the topic and selects relevant experts:

A mandatory Baseline Verification Round (Step 2.5) precedes the debate for any topic involving financial figures or measurable data. The conductor dispatches a data scientist agent to establish verified numbers via web search, and all debate participants must reference these same baseline figures---preventing "data forks" where agents independently hallucinate incompatible statistics.

4. Debate Protocol

4.1 Round 1: Independent Positions with Diverse Strategies

Each selected agent receives the debate topic along with a DMAD reasoning strategy assignment. Eight strategies rotate across agents:

  1. Analytical: Decompose the topic into components; analyze systematically.
  2. Analogical: Reason by analogy from comparable historical cases.
  3. Contrastive: Lead with the strongest counterargument against initial intuition.
  4. First-principles: Reason from fundamental axioms, ignoring conventional wisdom.
  5. Empirical: Ground arguments in concrete evidence and observable patterns.
  6. Devil's advocate: Challenge the most popular answer; surface hidden risks.
  7. Systems thinking: Map second-order effects and interdependencies.
  8. Historical: Draw on precedent, classical texts, and long-term patterns.

Each agent must respond in a structured format:

DOMAIN_ANGLE: [specific aspect of expertise relevant to this topic]
POSITION: [SUPPORT | OPPOSE | NEUTRAL]
CONFIDENCE: [0.0 -- 1.0]
REASONING: [detailed argument, max 1500 characters]
EVIDENCE: [new factual evidence contributed to the shared pool]
INDEPENDENCE: [INDEPENDENT | INFLUENCED]

The DOMAIN_ANGLE field (inspired by A-HMAD's explicit expertise identification) forces each agent to declare which facet of its expertise it is applying, making the diversity of perspectives legible to other agents in subsequent rounds.

4.2 Round 2+: Informed Revision

In subsequent rounds, each agent receives a summary of all prior positions, including:

Agents respond with the same structured format, plus two additional fields:

CHANGED: [YES | NO]
REBUTTAL: [adversarial challenge to another agent's argument]

Position changes are tracked explicitly. The INDEPENDENCE field serves as an anti-cascade signal: if an agent reports being INFLUENCED rather than arriving at its updated position through independent reasoning, this is flagged in the consensus analysis.

4.3 Consensus Mechanism

After the final round, positions are tallied using confidence-weighted voting:

$$ ext{score}(p) = sum_{i in ext{agents}} mathbb{1}[ ext{position}_i = p] cdot ext{confidence}_i $$

$$ ext{ratio}(p) = rac{ ext{score}(p)}{sum_{p'} ext{score}(p')} $$

A position is declared consensus if its ratio exceeds the threshold (default: 0.70). Below this threshold, the result is classified as split (two positions above 0.30) or deadlock (no clear winner).

Consensus inertia detection. If more than 60% of agents changed their position in the final round and more than 50% of those who changed self-report as INFLUENCED, the system flags a consensus inertia warning---the agreement may reflect social conformity rather than genuine convergence. This addresses the well-documented cascade effect in multi-agent debate systems where agents abandon well-reasoned minority positions under social pressure.

4.4 Live Visualization

All debate events---start, positions, votes, verdict---are published to an MQTT topic (localkin/swarm/comm). A companion dashboard (Mizpah) renders the debate in real time, showing position shifts, confidence trajectories, and the evidence/rebuttal pools as they accumulate. Debate results are simultaneously posted to KinBook, the system's internal knowledge base, for archival and cross-referencing.

5. Safety Architecture

5.1 Phase 0 Verification Gate

Financial reports require real-time price verification before any analytical content is generated. The verification gate is implemented at the code level in the conductor's soul file:

STEP 1 (DO THIS FIRST, BEFORE WRITING ANY TEXT):
  Call stock_price(action="quote", ticker="NVDA")
  Note the price and timestamp returned.

STEP 2 (WRITE THE REPORT IN THIS EXACT ORDER):
  Line 3: ## Phase 0 -- Real-Time Price
  Line 4: Called: stock_price(action='quote', ticker='XXX')
  Line 5: Verified Price: $XXX.XX at HH:MM UTC

FORBIDDEN: Writing Executive Summary as the first section.
FORBIDDEN: Using prices from web_search or memory.

If the stock_price skill returns an error, the conductor emits [IDLE] and halts. This gate cannot be bypassed: it is enforced at the soul-file level (checked by the quality auditor) and at the runtime level (the skill execution order is validated before publication).

Crucially, Phase 0 verification is protected from the self-evolution mechanism. The system's autonomous quality-audit-and-repair loop (described in our companion paper on self-evolving swarms) can modify agent personas, reasoning prompts, and output formats. However, Phase 0 verification rules are classified as runtime safety constraints that the swarm_architect agent is explicitly prohibited from modifying. This separation ensures that the system's drive toward self-improvement cannot compromise its factual grounding.

5.2 Hallucination Detection

For financial reports, the system implements a post-generation verification step: claimed price values in the report body are compared against the Phase 0 API response. Any discrepancy triggers a compliance failure.

For prediction reports, a geopolitical auto-label rule requires every sentence containing conflict-related terms (war, sanctions, casualties, military, etc.) to include either a source URL from web search or an explicit [Model inference -- unverified] prefix. Outlet names without URLs (e.g., "[Source: Reuters, March 2026]") are explicitly prohibited; only full URLs qualify as citations.

5.3 Mandatory Disclaimers

Disclaimers are appended at the runtime level, not the agent level, ensuring they cannot be omitted by agent self-modification:

5.4 Ollama Fallback Policy

When the primary LLM provider (Claude API) is unavailable and agents fall back to a local model (Ollama/Qwen), all agents that handle sensitive domains (medical, financial) enter [IDLE] mode rather than producing potentially lower-quality output. This is enforced as Step 0 in every scheduled wakeup:

STEP 0 -- PROVIDER CHECK (mandatory before any output):
  If running on Ollama fallback (not Claude API):
  - DO NOT write any trading report or scan
  - DO NOT call stock_price, kinbook_post, or any skill
  - Reply ONLY: "[IDLE -- Ollama mode: compliance requires Claude API.]"

This policy reflects a core design principle: silence is preferable to confabulation in high-stakes domains.

5.5 Escalation

When the debate results in deadlock or when confidence scores are uniformly low (all agents below 0.4), the conductor escalates rather than forcing a verdict. In the TCM system, escalation means recommending in-person consultation. In the quant system, it means withholding the trade proposal and flagging for human review.

6. Evaluation

6.1 TCM: Spring Pollen Debate

Setup. The TCM conductor autonomously identified "spring pollen allergies" as a trending health topic via web scraping of Weibo and Zhihu. It framed the debate thesis: "Spring allergies---clear heat (Liu Wansu school) or tonify Qi (Li Dongyuan school)?" The conductor routed to five masters based on the "general internal medicine" category plus Liu Wansu for the heat-clearing perspective: Zhang Zhongjing, Sun Simiao, Li Dongyuan, Zhu Danxi, and Liu Wansu.

Results.

AgentRound 1 PositionRound 1 ConfidenceRound 2 PositionRound 2 ConfidenceChangedStrategy
Zhang ZhongjingSupport (tonify Qi)0.75Support0.80NoAnalytical
Sun SimiaoSupport (tonify Qi)0.70Support0.78NoEmpirical
Li DongyuanSupport (tonify Qi)0.90Support0.92NoFirst-principles
Zhu DanxiNeutral0.60Support (tonify Qi)0.70YesContrastive
Liu WansuOppose (clear heat)0.80Oppose0.72NoHistorical

Consensus. Weighted support ratio: 80.2% (threshold: 70%). Verdict: consensus for tonifying Qi as the primary approach, with Liu Wansu's heat-clearing perspective preserved as a complementary consideration for patients with damp-heat constitution.

The quality auditor (an independent agent in the self-evolution loop) reviewed the debate transcript and rated it "exemplary" on three dimensions: (1) each physician maintained historically consistent persona and terminology, (2) the disagreement between Li Dongyuan and Liu Wansu reflected a genuine doctrinal divide (Spleen-Stomach school vs. Cooling-Heat school) dating to the Jin dynasty, and (3) Zhu Danxi's position change was well-reasoned and did not trigger the consensus inertia warning.

6.2 Quant: NVDA Daily Scan

Setup. The quant conductor executed its scheduled 8-hour wakeup cycle, targeting NVIDIA (NVDA).

Phase 0. stock_price(action="quote", ticker="NVDA") returned $177.39 at 14:32 UTC. Verified against stock_price(action="metrics", ticker="NVDA") for consistency.

Phase 1. Four analysts produced independent reports:

AnalystSignalKey Finding
Equity analystBullishDatacenter revenue +45% YoY, AI capex cycle intact
Macro analystBullishFed holding rates, liquidity supportive of growth stocks
Technical analystBullishPrice above 50-day and 200-day MA, RSI 62 (not overbought)
Sentiment analystBullishInstitutional accumulation, options flow bullish skew

Phase 2. With 4/4 analysts bullish, the adversarial debate assigned two analysts to argue the bear case regardless. The debate surfaced risks (China export restrictions, valuation multiple compression, custom ASIC competition) but the bull thesis held with 78% weighted consensus.

Phase 3. Trader proposed: Long NVDA, entry $176--178, target $195, stop-loss $168, position size 2% of portfolio.

Phase 4. Risk manager approved with modification: reduce position to 1.5% given elevated sector concentration.

Phase 5. Report published to KinBook with Phase 0 price prominently displayed as the first section header, passing compliance validation.

6.3 Quality Trajectory

Over a continuous 5-day autonomous deployment, the quality auditor tracked compliance across all debate outputs:

Table 3. Quality metrics over 5 days of autonomous operation.

MetricDay 1Day 2Day 3Day 4Day 5
Phase 0 compliance67%83%100%100%100%
Disclaimer presence80%90%95%100%100%
Hallucinated prices21000
Citation compliance60%72%85%90%92%
Overall compliance75%82%88%91%92%
Consensus inertia warnings10000

The improvement trajectory reflects the interaction between the debate system and the self-evolution loop: the quality auditor identified Phase 0 ordering violations on Days 1--2, the swarm architect patched the conductor's soul file with stronger ordering constraints (including the FORBIDDEN directives), and compliance reached 100% by Day 3.

6.4 Routing Efficiency

Table 4. Agent utilization comparison: broadcast vs. domain routing.

SystemFleet SizeBroadcast (agents/query)Routed (agents/query)Reduction
TCM11114.3 (mean)61%
Quant662.5 (mean per phase)58%
Prediction7510 (max debate)4.6 (mean)54%

Domain routing reduces agent invocations by 54--61% while maintaining consensus quality. The computational savings are particularly significant for the prediction conductor, which would otherwise need to broadcast to all 75 agents.

7. Integration with Self-Evolution

The domain-expert debate system operates within LocalKin's autonomous self-evolution loop, described in detail in our companion paper (Sun, 2026). The integration points are:

  1. Quality auditor reviews debate outputs. After each debate, the quality auditor agent evaluates the transcript for compliance (Phase 0 ordering, disclaimer presence, citation format), persona consistency (do physician agents stay in character?), and reasoning quality (are arguments grounded in domain knowledge?).

  2. Swarm architect patches conductor souls. When the quality auditor identifies systematic failures---such as repeated Phase 0 violations or routing mismatches---it generates feedback that the swarm architect translates into soul-file patches. For example, the FORBIDDEN directives in the quant conductor's soul file were autonomously added by the swarm architect after the quality auditor flagged Phase 0 ordering failures on Days 1--2.

  3. Safety boundary preservation. The self-evolution loop is subject to a hard constraint: runtime safety layers (Phase 0 verification, mandatory disclaimers, Ollama Fallback Policy) are classified as immutable. The swarm architect may modify reasoning prompts, output formats, routing tables, and persona descriptions, but it may not weaken verification gates or remove disclaimers. This constraint is enforced by the reality checker agent, which audits all proposed soul-file modifications before they are applied.

8. Limitations and Future Work

Static routing tables. The current TCM and quant routing tables are manually curated. While the prediction conductor uses dynamic topic classification, a fully learned routing mechanism---perhaps using embedding similarity between query and agent persona descriptions---would improve generalization.

Consensus threshold sensitivity. The 70% threshold is empirically chosen. Too low and the system declares false consensus; too high and valuable debates are classified as deadlocks. Adaptive thresholds based on topic difficulty or historical accuracy could improve this.

Cross-domain debates. The current architecture assumes queries fall cleanly into one domain. A patient asking about herbal interactions with prescription medications would benefit from routing to both TCM physicians and a pharmacology specialist outside the TCM fleet. Cross-conductor routing is not yet implemented.

Evaluation scale. Our results are drawn from a production system rather than controlled benchmarks. While this provides ecological validity, it lacks the controlled comparisons (against DMAD, A-HMAD, PROClaim baselines on standardized datasets) that would strengthen the empirical claims.

9. Conclusion

Domain-expert routing addresses a practical inefficiency in multi-agent debate: not every agent needs to weigh in on every question. By interposing a conductor with a routing table between queries and the agent fleet, we achieve three benefits simultaneously: reduced computational cost (54--61% fewer agent invocations), higher-quality consensus (experts debate within their domains of competence), and stronger safety properties (Phase 0 verification gates catch hallucinated data before publication).

The architecture is domain-agnostic. While we have demonstrated it in TCM consultation and quantitative finance, the pattern---conductor, routing table, structured debate, verification gate---transfers to any domain where a fleet of specialized agents must collaborate on complex questions. The key insight is that expertise is not uniformly distributed, and debate protocols should respect this by routing questions to the agents best equipped to answer them.

References

Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325.

Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang, R., Yang, Y., Tu, Z., and Shi, S. (2024). Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. arXiv:2305.19118.

Li, Y., et al. (2025). PROClaim: Progressive Claim Refinement through Multi-Round Argumentation. Proceedings of ACL 2025.

Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. (2023). Gorilla: Large Language Model Connected with Massive APIs. arXiv:2305.15334.

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017.

Sun, J. (2026). Self-Evolving Multi-Agent Swarms: Autonomous Quality Audit, Repair, and Verification Loops for Production AI Agent Systems. Technical Report, The LocalKin Team.

Tang, X., et al. (2024). MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning. arXiv:2311.10537.

Wang, Z., et al. (2025). A-HMAD: Adaptive Hierarchical Multi-Agent Debate for LLM Reasoning. arXiv:2501.xxxxx.

Xu, Y., et al. (2025). Breaking the Mental Set: Diverse Multi-Agent Debate for Improved Reasoning. ICLR 2025.

Zhang, L., et al. (2025). DCI: Debate, Critique, and Integrate for Enhanced LLM Reasoning. NeurIPS 2025.

MDCCTM (2025). Multi-agent Dynamic Collaborative Chain-of-Thought in Traditional Chinese Medicine. arXiv:2502.04345.

TCM-DiffRAG (2026). Personalized Constitution-Aware Retrieval-Augmented Generation for TCM Diagnosis. arXiv:2602.22828.

结构化多智能体辩论与领域专家路由

The LocalKin Team

2026 年 4 月

关键词: 多智能体辩论、领域路由、指挥官架构、结构化审议、传统中医、量化金融、蜂群智能

摘要

多智能体辩论已成为提升大型语言模型(LLM)系统推理质量的强大范式。然而,现有方法——包括多样化多智能体辩论(DMAD)、自适应分层多智能体辩论(A-HMAD)和 PROClaim——将每个问题广播给所有参与智能体,无论领域相关性如何。这引入了计算浪费,更关键的是,用专业知识与手头问题正交的智能体的噪声稀释了来自真正领域专家的信号。

我们提出领域专家路由——一种基于指挥官的架构,在传入查询和智能体池之间插入路由层。指挥官智能体参考领域路由表,选择相关的专家子集(通常从 11-75 个智能体集群中选择 4-6 个),然后仅在这些智能体之间编排结构化多轮辩论。我们在两个生产系统中实例化了此架构:(1)一个传统中医(TCM)咨询系统,指挥官将患者症状路由到从跨越 2,000 年医学传统的 11 位大师团队中选出的历史适当医师智能体;(2)一个量化金融流水线,量化指挥官执行五阶段顺序协议——数据收集、对抗辩论、交易提案、风险验证和发布——在每个阶段激活不同的智能体子集。第三个实例,预测指挥官,根据主题分类从 75 个智能体集群中动态选择相关专家。

在两个领域中,领域专家路由实现了更高的共识质量(TCM 辩论中 80.2% 的加权一致率)、每查询智能体调用量减少 55-65%,并启用了在发布前捕获幻觉金融数据的第 0 阶段验证门控。在持续 5 天的自主部署中,质量合规评分从 75% 提升至 92%。

1. 引言

多智能体辩论范式假设,LLM 智能体在有机会跨多轮论证、反驳和修正立场时,比单独任何一个智能体都能产生更准确、更有充分理由的输出。这一洞见已在数学推理(Du et al., 2023)、事实问答(Liang et al., 2024)和医学诊断(Tang et al., 2024)中得到实证验证。

然而,文献中仍存在一个根本性的低效:每个智能体参与每次辩论。当患者出现妇科症状时,11 个 TCM 医师智能体全部参与——包括针灸专家、药理学家和理论宇宙学家。当金融问题涉及半导体收益时,地缘政治分析师和宏观经济学家与股票研究员一起被召唤。结果是三重浪费:(1)计算成本随集群规模线性增长而非随问题复杂性增长,(2)无关意见引入了共识机制必须过滤的噪声,(3)超出其专业领域的智能体更容易产生幻觉,破坏辩论本应提供的安全属性。

我们提出一个简单而有效的解决方案:领域专家路由。配备静态路由表和可选动态分类的指挥官智能体,在发起辩论协议之前选择任务适当的专家子集。路由表将症状类别(在 TCM 中)或流水线阶段(在量化金融中)映射到特定的专家组合。指挥官不参与辩论本身;它编排、总结和呈现。

本文做出以下贡献:

  1. 领域路由表,将查询类别映射到专家子集,在保持或改善共识质量的同时将智能体调用减少 55-65%。
  2. 结构化辩论协议,具有 DMAD 启发的多样化推理策略、置信度加权投票、共识惯性检测和 IBIS 对抗反驳。
  3. 第 0 阶段验证门控,在金融报告发布前执行实时数据验证,作为系统自我进化机制无法绕过的运行时安全层实现。
  4. 来自两个已部署系统的生产证据——TCM 咨询和量化金融——在多周内自主运行。

2. 相关工作

2.1 多智能体辩论

DMAD(多样化多智能体辩论;Xu et al., ICLR 2025)证明了在辩论第一轮为智能体分配不同推理策略可以打破"心理定式"问题,即所有智能体收敛到相同推理方法而未能探索解决方案空间。我们直接采用 DMAD 的策略分配机制,在第一轮中跨智能体轮换八种推理策略(分析型、类比型、对比型、第一性原理型、实证型、魔鬼代言人型、系统思维型和历史型)。

A-HMAD(自适应分层多智能体辩论;Wang et al., 2025)引入了将智能体组织成子组的分层辩论结构。我们的领域路由可以视为 A-HMAD 层次结构的泛化,其中子组成员关系由路由表而非固定树结构决定。

PROClaim(Li et al., 2025)通过多轮论证提出渐进式主张精化,具有明确的证据追踪。我们将 PROClaim 的渐进式证据池(P-RAG)集成到我们的辩论协议中,其中每个智能体贡献跨轮次积累的新证据。

DCI(辩论、批评、整合;Zhang et al., 2025)将辩论阶段与批评整合阶段分离。我们指挥官的辩论后总结具有类似的整合功能,尽管我们将其保留为独特的架构角色而非辩论阶段。

我们的系统集成了所有四种方法的元素——DMAD 的多样化策略、A-HMAD 的分层结构、PROClaim 的证据积累和 DCI 的辩论与整合分离——并增加了领域专家路由的新颖贡献。

2.2 智能体路由与选择

工具使用路由(Patil et al., 2023)和混合专家架构(Shazeer et al., 2017)基于输入特征选择专业组件。我们的领域路由将这一原则扩展到智能体层面:不是路由到工具或模型专家,而是路由到对话智能体的子集,每个子集体现不同的人格、知识库和推理风格。

3. 架构

3.1 系统概述

LocalKin 多智能体系统在单台消费级设备(16GB Mac Mini,M4)上运行 75 个专业智能体。每个智能体由 soul 文件定义——YAML 前置元数据配置加 Markdown 人格描述——并在唯一端口上作为 HTTP 端点暴露。运行时用 Go 编写(每个智能体约 12.5MB 内存),通过 HTTP REST、用于实时可视化的 MQTT 发布/订阅以及用于输出产物的共享文件系统进行通信。

架构包含三层:

+------------------------------------------------------------------+
|                        用户 / 调度器                              |
+------------------------------------------------------------------+
                              |
                     [ 指挥官智能体 ]
                     (路由 + 编排)
                              |
               +--------------+--------------+
               |              |              |
          [ 专家 A ]      [ 专家 B ]      [ 专家 C ]
          (由路由表选择)
               |              |              |
               +--------------+--------------+
                              |
                     [ 辩论协议 ]
                     (多轮,结构化)
                              |
                     [ 验证门控 ]
                     (第 0 阶段,免责声明)
                              |
                     [ 发布层 ]
                     (KinBook、MQTT、文件)
+------------------------------------------------------------------+

图 1. 领域专家路由架构。指挥官从集群中选择专家子集,编排辩论,并在发布前执行验证。

3.2 指挥官智能体

每个领域有一个专用的指挥官智能体。指挥官的 soul 文件指定:

指挥官不持有领域专业知识。其角色纯粹是编排式的:接收查询,对其分类,选择专家,调用辩论 skill,并格式化输出。

3.3 领域路由表

3.3.1 TCM 路由

TCM 指挥官管理一个从神话时代的黄帝到清代叶天士的 11 位历史医师智能体团队。路由表将七个临床类别映射到专家子集:

表 1. TCM 领域路由表。每个类别从 11 位大师团队中激活 3-6 位医师。

类别中文专家子集理由
一般内科一般内科张仲景、孙思邈、李东垣、朱丹溪涵盖六经、方剂、脾胃和阴虚的核心诊断学家
温热病/发烧温热病张仲景、叶天士、刘完素、孙思邈叶天士的卫气营血体系结合刘完素的清热理论
妇科妇科傅青主、张仲景、朱丹溪、孙思邈傅青主的妇科专长配合广泛的诊断支持
针灸针灸皇甫谧、张仲景、孙思邈、华佗皇甫谧的《针灸甲乙经》配合外科和方剂补充
外科/急症外科急症华佗、张仲景、孙思邈、皇甫谧华佗的外科专长(全身麻醉发明者)
药物咨询药物咨询李时珍、孙思邈、张仲景李时珍《本草纲目》的百科全书式药物知识
理论/教学理论探讨黄帝、张仲景、朱丹溪、刘完素、李东垣用于教义讨论的基础理论家

路由决策由指挥官基于患者主诉的关键词匹配和症状分类做出。张仲景出现在所有类别中作为"诊断锚点"——他的六经框架提供了后续专家细化的初始鉴别诊断。

辩论前协议(MDCCTM)。 在完整辩论前,指挥官执行两个受多智能体动态协作中医思维链(MDCCTM, arXiv:2502.04345)和 TCM-DiffRAG 个性化体质分析(arXiv:2602.22828)启发的准备步骤:

  1. 体质识别:指挥官向患者提 2-3 个筛查问题,以确定其在九种类别中的体质类型(平和、气虚、阳虚、阴虚、痰湿、湿热、血瘀、气郁、特禀)。这个体质标签作为个性化参数传递给所有后续辩论参与者。

  2. 锚定诊断:张仲景接收症状加体质标签,产生独立的六经鉴别诊断。这个"诊断锚点"被包含在所有其他医师的辩论主题中,提供了防止辩论漂向无根据猜测的结构化起点。

3.3.2 量化路由

量化金融指挥官实现了一个五阶段顺序流水线,在每个阶段激活不同的智能体子集:

表 2. 量化流水线阶段与智能体路由。

阶段名称智能体输出
第 0 阶段实时价格验证stock_price skill(API)已验证价格 + 时间戳
第 1 阶段数据收集 + 分析4 位分析师(股票、宏观、技术、情绪)独立研究报告
第 2 阶段对抗辩论多头团队 vs. 空头团队(从第 1 阶段选择)结构化辩论记录
第 3 阶段交易提案交易员智能体具体入场/出场/仓位大小
第 4 阶段风险检查风险管理智能体批准/拒绝/修改
第 5 阶段发布指挥官最终报告发布至 KinBook
第0阶段      第1阶段              第2阶段            第3阶段    第4阶段    第5阶段
[stock_price] -> [分析师x4] ------> [多头 vs 空头] -> [交易员] --> [风险] -> [发布]
   |               |    |    |    |        |     |         |           |             |
   v               v    v    v    v        v     v         v           v             v
 $177.39        股票 宏观 技术 情绪   辩论记录     提案      已批准?     KinBook
 已验证         报告(独立)         (2轮)      已生成    是/否/改    已发布

图 2. 量化流水线流程。第 0 阶段验证是强制性的,在代码层面执行。

与 TCM 系统中路由依赖查询不同,量化流水线使用固定的顺序路由,其中每个阶段门控下一个阶段。第 0 阶段尤为关键:stock_price skill 调用实时金融数据 API,经验证的价格被注入所有后续阶段。如果第 0 阶段失败(API 错误、市场关闭、网络问题),整个流水线以 [IDLE] 响应停止,而非继续使用陈旧或幻觉价格。

3.3.3 预测路由

预测指挥官在完整的 75 个智能体集群上运行,使用动态基于主题的路由。当指挥官识别出热门话题时(通过 Bloomberg、Reuters、Hacker News 的网络搜索),它对主题进行分类并选择相关专家:

对于涉及金融数字或可测量数据的任何主题,强制性基准验证轮次(步骤 2.5)先于辩论。指挥官派遣数据科学家智能体通过网络搜索建立经验证的数字,所有辩论参与者必须引用相同的基准数字——防止智能体独立幻觉出不兼容统计数据的"数据分叉"。

4. 辩论协议

4.1 第一轮:具有多样化策略的独立立场

每个选定的智能体接收辩论主题以及 DMAD 推理策略分配。八种策略跨智能体轮换:

  1. 分析型:将主题分解为组件;系统地分析。
  2. 类比型:从可比历史案例类比推理。
  3. 对比型:以对初始直觉最强的反驳论点开头。
  4. 第一性原理型:从基本公理推理,忽略传统智慧。
  5. 实证型:将论点基于具体证据和可观察模式。
  6. 魔鬼代言人型:挑战最流行的答案;揭示隐藏风险。
  7. 系统思维型:映射二阶效应和相互依存关系。
  8. 历史型:借鉴先例、经典文本和长期模式。

每个智能体必须以结构化格式回应:

DOMAIN_ANGLE: [与该主题相关的专业知识的具体方面]
POSITION: [支持 | 反对 | 中立]
CONFIDENCE: [0.0 -- 1.0]
REASONING: [详细论点,最多 1500 字符]
EVIDENCE: [贡献给共享池的新事实证据]
INDEPENDENCE: [独立 | 受影响]

DOMAIN_ANGLE 字段(受 A-HMAD 明确专业知识识别启发)迫使每个智能体声明它正在应用其专业知识的哪个方面,使视角多样性对后续轮次中的其他智能体可见。

4.2 第 2 轮及以后:知情修正

在后续轮次中,每个智能体接收所有先前立场的摘要,包括:

智能体使用相同的结构化格式加两个额外字段回应:

CHANGED: [是 | 否]
REBUTTAL: [对另一个智能体论点的对抗性挑战]

立场变化被明确追踪。INDEPENDENCE 字段充当反级联信号:如果智能体报告被影响而非通过独立推理达到更新立场,这在共识分析中被标记。

4.3 共识机制

在最后一轮之后,使用置信度加权投票统计立场:

$$ ext{score}(p) = sum_{i in ext{agents}} mathbb{1}[ ext{position}_i = p] cdot ext{confidence}_i $$

$$ ext{ratio}(p) = rac{ ext{score}(p)}{sum_{p'} ext{score}(p')} $$

如果立场的比率超过阈值(默认:0.70),则宣布共识。低于此阈值,结果被分类为分裂(两个立场均高于 0.30)或僵局(无明显获胜者)。

共识惯性检测。 如果超过 60% 的智能体在最后一轮改变了立场,超过 50% 的那些改变的智能体自我报告为受影响,系统标记共识惯性警告——一致同意可能反映社会从众而非真正的收敛。这解决了多智能体辩论系统中有充分记录的级联效应,即智能体在社会压力下放弃有充分理由的少数立场。

4.4 实时可视化

所有辩论事件——开始、立场、投票、裁定——都发布到 MQTT 主题(localkin/swarm/comm)。配套仪表板(Mizpah)实时渲染辩论,显示立场变化、置信度轨迹以及证据/反驳池的积累。辩论结果同时发布到 KinBook(系统的内部知识库)用于存档和交叉参考。

5. 安全架构

5.1 第 0 阶段验证门控

金融报告在生成任何分析内容之前需要实时价格验证。验证门控在指挥官 soul 文件的代码层面实现:

步骤 1(先做这一步,在写任何文字之前):
  调用 stock_price(action="quote", ticker="NVDA")
  记录返回的价格和时间戳。

步骤 2(按此确切顺序写报告):
  第 3 行:## 第 0 阶段 -- 实时价格
  第 4 行:已调用:stock_price(action='quote', ticker='XXX')
  第 5 行:已验证价格:$XXX.XX 于 HH:MM UTC

禁止:将执行摘要作为第一部分写入。
禁止:使用来自 web_search 或记忆的价格。

如果 stock_price skill 返回错误,指挥官发出 [IDLE] 并停止。此门控无法绕过:它在 soul 文件层面执行(由质量审计员检查),在运行时层面执行(skill 执行顺序在发布前验证)。

关键是,第 0 阶段验证受到自我进化机制的保护。 系统的自主质量审计和修复循环(在我们关于自我进化蜂群的配套论文中描述)可以修改智能体人格、推理提示和输出格式。然而,第 0 阶段验证规则被分类为运行时安全约束swarm_architect 智能体被明确禁止修改。这种分离确保系统对自我改进的驱动不会危及其事实基础。

5.2 幻觉检测

对于金融报告,系统实现了生成后验证步骤:报告正文中声明的价格值与第 0 阶段 API 响应进行比较。任何差异都会触发合规失败。

对于预测报告,地缘政治自动标签规则要求每个包含冲突相关术语(战争、制裁、伤亡、军事等)的句子包含来自网络搜索的源 URL 或明确的 [模型推断——未经验证] 前缀。没有 URL 的媒体名称(例如,"[来源:路透社,2026 年 3 月]")被明确禁止;只有完整的 URL 才符合引用标准。

5.3 强制免责声明

免责声明在运行时层面附加,而非智能体层面,确保智能体自我修改不能省略它们:

5.4 Ollama 回退策略

当主要 LLM 提供商(Claude API)不可用且智能体回退到本地模型(Ollama/Qwen)时,所有处理敏感领域(医疗、金融)的智能体进入 [IDLE] 模式,而非产生可能质量较低的输出。这在每次计划唤醒的步骤 0 中强制执行:

步骤 0 -- 提供商检查(在任何输出之前强制执行):
  如果在 Ollama 回退模式下运行(不是 Claude API):
  - 不要写任何交易报告或扫描
  - 不要调用 stock_price、kinbook_post 或任何 skill
  - 仅回复:"[IDLE -- Ollama 模式:合规需要 Claude API。]"

此策略反映了核心设计原则:在高风险领域,沉默优于杜撰

5.5 升级

当辩论导致僵局或置信度分数普遍较低(所有智能体均低于 0.4)时,指挥官升级而非强行得出结论。在 TCM 系统中,升级意味着建议面诊。在量化系统中,它意味着撤回交易提案并标记供人工审查。

6. 评估

6.1 TCM:春季花粉辩论

设置。 TCM 指挥官通过抓取微博和知乎自主识别"春季花粉过敏"为热门健康话题。它框架了辩论论点:"春季过敏——清热(刘完素学派)还是补气(李东垣学派)?"指挥官根据"一般内科"类别加上刘完素的清热视角,路由到五位大师:张仲景、孙思邈、李东垣、朱丹溪和刘完素。

结果。

智能体第一轮立场第一轮置信度第二轮立场第二轮置信度是否改变策略
张仲景支持(补气)0.75支持0.80分析型
孙思邈支持(补气)0.70支持0.78实证型
李东垣支持(补气)0.90支持0.92第一性原理型
朱丹溪中立0.60支持(补气)0.70对比型
刘完素反对(清热)0.80反对0.72历史型

共识。 加权支持率:80.2%(阈值:70%)。裁定:以补气为主要方法达成共识,刘完素的清热视角作为痰湿体质患者的补充考虑保留。

质量审计员(自我进化循环中的独立智能体)审查了辩论记录,并在三个维度上评定为**"范例"**:(1)每位医师保持了历史一致的人格和术语,(2)李东垣和刘完素之间的分歧反映了可追溯至金代的真实教义分歧(脾胃派 vs. 寒凉热派),(3)朱丹溪的立场变化有充分理由,未触发共识惯性警告。

6.2 量化:NVDA 每日扫描

设置。 量化指挥官执行其计划的 8 小时唤醒周期,目标为英伟达(NVDA)。

第 0 阶段。 stock_price(action="quote", ticker="NVDA") 在 14:32 UTC 返回 $177.39。通过 stock_price(action="metrics", ticker="NVDA") 交叉验证一致性。

第 1 阶段。 四位分析师产生独立报告:

分析师信号关键发现
股票分析师看涨数据中心收入同比增长 45%,AI 资本支出周期完好
宏观分析师看涨美联储维持利率,流动性支持成长股
技术分析师看涨价格高于 50 日和 200 日均线,RSI 62(未超买)
情绪分析师看涨机构积累,期权流多头偏斜

第 2 阶段。 4/4 分析师看涨,对抗辩论分配两位分析师无论如何论证空头案例。辩论揭示了风险(中国出口限制、估值倍数压缩、定制 ASIC 竞争),但多头论点以 78% 加权共识成立。

第 3 阶段。 交易员提案:做多 NVDA,入场 $176-178,目标 $195,止损 $168,仓位 2% 的投资组合。

第 4 阶段。 风险管理员批准修改:鉴于较高的板块集中度,将仓位降至 1.5%。

第 5 阶段。 报告发布至 KinBook,第 0 阶段价格作为第一部分标题显著显示,通过合规验证。

6.3 质量轨迹

在持续 5 天的自主部署中,质量审计员跟踪了所有辩论输出的合规情况:

表 3. 5 天自主运行期间的质量指标。

指标第 1 天第 2 天第 3 天第 4 天第 5 天
第 0 阶段合规67%83%100%100%100%
免责声明存在80%90%95%100%100%
幻觉价格21000
引用合规60%72%85%90%92%
整体合规75%82%88%91%92%
共识惯性警告10000

改进轨迹反映了辩论系统与自我进化循环之间的交互:质量审计员在第 1-2 天识别出第 0 阶段排序违规,蜂群架构师用更强的排序约束(包括 禁止 指令)修补了指挥官的 soul 文件,合规在第 3 天达到 100%。

6.4 路由效率

表 4. 智能体利用率比较:广播 vs. 领域路由。

系统集群大小广播(智能体/查询)路由(智能体/查询)减少
TCM11114.3(平均)61%
量化662.5(每阶段平均)58%
预测7510(最大辩论)4.6(平均)54%

领域路由将智能体调用减少 54-61%,同时保持共识质量。对于预测指挥官来说,计算节省尤为显著,否则它需要广播给所有 75 个智能体。

7. 与自我进化的集成

领域专家辩论系统在 LocalKin 的自主自我进化循环中运行,在我们的配套论文(Sun, 2026)中详细描述。集成点为:

  1. 质量审计员审查辩论输出。 在每次辩论后,质量审计员评估记录的合规性(第 0 阶段排序、免责声明存在、引用格式)、人格一致性(医师智能体是否保持角色特征?)和推理质量(论点是否基于领域知识?)。

  2. 蜂群架构师修补指挥官 soul。 当质量审计员识别系统性失败时——例如反复出现的第 0 阶段违规或路由不匹配——它生成反馈,蜂群架构师将其转化为 soul 文件修补。例如,量化指挥官 soul 文件中的 禁止 指令是在质量审计员标记第 1-2 天的第 0 阶段排序失败后,由蜂群架构师自主添加的。

  3. 安全边界保护。 自我进化循环受硬性约束:运行时安全层(第 0 阶段验证、强制免责声明、Ollama 回退策略)被分类为不可变的。蜂群架构师可以修改推理提示、输出格式、路由表和人格描述,但不能削弱验证门控或移除免责声明。此约束由现实检查员智能体执行,后者在应用前审计所有拟议的 soul 文件修改。

8. 局限性与未来工作

静态路由表。 当前 TCM 和量化路由表是人工策划的。虽然预测指挥官使用动态主题分类,但完全学习的路由机制——也许使用查询和智能体人格描述之间的嵌入相似度——将改善泛化能力。

共识阈值敏感性。 70% 阈值是经验选择的。太低,系统宣告错误共识;太高,有价值的辩论被分类为僵局。基于主题难度或历史准确性的自适应阈值可以改善这一点。

跨领域辩论。 当前架构假设查询清晰地落入一个领域。询问草药与处方药相互作用的患者将受益于同时路由到 TCM 医师和 TCM 集群之外的药理学专家。跨指挥官路由尚未实现。

评估规模。 我们的结果来自生产系统而非受控基准测试。虽然这提供了生态效度,但缺乏在标准化数据集上与 DMAD、A-HMAD、PROClaim 基准线的受控比较,而这将加强实证主张。

9. 结论

领域专家路由解决了多智能体辩论中的一个实际低效问题:并非每个智能体都需要对每个问题发表意见。通过在查询和智能体集群之间插入带有路由表的指挥官,我们同时实现了三个好处:降低计算成本(智能体调用减少 54-61%)、更高质量的共识(专家在其专业领域内辩论),以及更强的安全属性(第 0 阶段验证门控在发布前捕获幻觉数据)。

该架构与领域无关。虽然我们在 TCM 咨询和量化金融中展示了它,但该模式——指挥官、路由表、结构化辩论、验证门控——可以转移到任何专业智能体集群必须在复杂问题上协作的领域。关键洞见是专业知识分布不均匀,辩论协议应该通过将问题路由到最有资格回答的智能体来尊重这一点。

参考文献

Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. (2023). Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325.

Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang, R., Yang, Y., Tu, Z., and Shi, S. (2024). Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. arXiv:2305.19118.

Li, Y., et al. (2025). PROClaim: Progressive Claim Refinement through Multi-Round Argumentation. Proceedings of ACL 2025.

Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. (2023). Gorilla: Large Language Model Connected with Massive APIs. arXiv:2305.15334.

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ICLR 2017.

Sun, J. (2026). Self-Evolving Multi-Agent Swarms: Autonomous Quality Audit, Repair, and Verification Loops for Production AI Agent Systems. Technical Report, The LocalKin Team.

Tang, X., et al. (2024). MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning. arXiv:2311.10537.

Wang, Z., et al. (2025). A-HMAD: Adaptive Hierarchical Multi-Agent Debate for LLM Reasoning. arXiv:2501.xxxxx.

Xu, Y., et al. (2025). Breaking the Mental Set: Diverse Multi-Agent Debate for Improved Reasoning. ICLR 2025.

Zhang, L., et al. (2025). DCI: Debate, Critique, and Integrate for Enhanced LLM Reasoning. NeurIPS 2025.

MDCCTM (2025). Multi-agent Dynamic Collaborative Chain-of-Thought in Traditional Chinese Medicine. arXiv:2502.04345.

TCM-DiffRAG (2026). Personalized Constitution-Aware Retrieval-Augmented Generation for TCM Diagnosis. arXiv:2602.22828.