← All KinPapers
1The LocalKin Team · April 2026

Self-Evolving Multi-Agent Swarms: Autonomous Quality Audit, Repair, and Verification Loops for Production AI Agent Systems

Authors: The LocalKin Team

System: LocalKin (https://localkin.dev)

Date: April 2026

Abstract

We present LocalKin, a self-evolving multi-agent swarm architecture capable of autonomously auditing, repairing, and verifying its own constituent agents without human intervention. The system runs 78 specialized agents on a single consumer machine (16GB Mac Mini) with a total memory footprint of 960MB---approximately 12.5MB per agent---compared to 200MB or more per agent in Python-based frameworks such as AutoGen and CrewAI. The core contribution is a fully autonomous improvement loop consisting of four stages: quality audit, feedback synthesis, targeted repair, and verification. Over a continuous 5-day autonomous deployment, the system completed more than 30 improvement cycles, autonomously modified 68 agent configuration files, and discovered, evaluated, and integrated techniques from 6 research papers found on arXiv and HuggingFace---all with zero human intervention. Compliance scores improved from approximately 75% on day one to 92% by day five, while hallucination incidents dropped from four in the first two days to zero in the final two. These results demonstrate that Harness Engineering principles---constraints, feedback, and verification---can be fully automated at swarm scale, producing a system that does not merely execute tasks but continuously improves its own quality, safety, and capability.

Keywords: multi-agent systems, self-evolution, autonomous improvement, swarm intelligence, quality assurance, harness engineering

1. Introduction

Multi-agent systems powered by large language models (LLMs) have rapidly progressed from research demonstrations to production deployments. Systems such as AutoGen [1], CrewAI [2], LangGraph [3], and MetaGPT [4] have shown that teams of specialized agents can collaboratively solve complex tasks spanning code generation, research synthesis, and strategic planning. However, a critical challenge remains largely unaddressed: how do multi-agent systems maintain and improve their quality at scale without continuous human oversight?

In production settings, agents drift. Persona consistency degrades over time. Safety compliance erodes at the margins. Knowledge becomes stale as the research landscape evolves. The standard response is human monitoring---quality assurance teams that review agent outputs, update prompts, and deploy fixes. This approach does not scale. A swarm of 78 agents producing outputs across medical, financial, engineering, and creative domains generates a volume of content that no human team can comprehensively audit.

We propose a different paradigm: the self-evolving swarm. Rather than relying on external human oversight, the swarm contains dedicated agents whose sole purpose is to audit, critique, repair, and verify the other agents in the system. This creates a closed-loop improvement cycle that operates continuously without human intervention.

Our contributions are threefold:

  1. Autonomous Quality Loop. We describe a four-stage improvement cycle (audit, feedback, repair, verify) that runs continuously and has demonstrated 30+ successful improvement cycles over 5 days with zero human intervention.

  2. Research Paper Auto-Integration. We present a mechanism by which the swarm autonomously discovers relevant academic papers, evaluates their applicability, and integrates applicable techniques into its own architecture.

  3. Token-Efficient Agent Architecture. We demonstrate that a "thin soul + fat skill" design enables 78 agents to run on a single consumer machine with a total footprint of 960MB, an order of magnitude more efficient than existing frameworks.

The remainder of this paper is organized as follows. Section 2 surveys related work. Section 3 describes the system architecture. Section 4 details the self-evolution mechanism---the core contribution. Section 5 discusses safety gates. Section 6 presents evaluation results. Sections 7 and 8 offer discussion and conclusions.

2. Related Work

2.1 Multi-Agent Frameworks

The past two years have seen an explosion of multi-agent frameworks. AutoGen [1] introduced conversational multi-agent patterns with human-in-the-loop orchestration. CrewAI [2] added role-based agent composition with sequential and parallel task execution. LangGraph [3] provided stateful graph-based orchestration with cycle support. MetaGPT [4] demonstrated software-engineering-style role specialization (architect, engineer, QA) for complex problem solving.

While these frameworks excel at task orchestration, none implement autonomous self-evolution. Agent quality depends entirely on human-authored prompts and manually triggered updates. When an agent produces suboptimal outputs, a human must diagnose the issue, modify the agent configuration, and redeploy. Our system automates this entire pipeline.

2.2 Harness Engineering

The concept of Harness Engineering, articulated by OpenAI in early 2026 [5], proposes that reliable AI systems require three components: constraints that bound agent behavior, feedback that signals quality deviations, and verification that confirms repairs were effective. This framework has been influential in guiding how practitioners think about agent reliability, but published implementations have been limited to single-agent settings with human-managed feedback loops.

We implement the full Harness Engineering paradigm autonomously at swarm scale. Constraints are encoded in agent personality files. Feedback is generated by dedicated audit agents. Verification is performed by a separate verification agent that confirms repairs were correctly applied. The entire loop runs without human involvement.

2.3 Self-Improving AI Systems

SAGE (arXiv:2603.15255) [6] demonstrated that a single LLM agent can improve its own prompts through iterative self-reflection, achieving measurable performance gains on benchmark tasks. ERL (arXiv:2603.24639) [7] introduced experience-based reinforcement learning where agents maintain a heuristic experience pool that persists across episodes, enabling cross-task learning.

These approaches operate at the single-agent level. Our work extends self-improvement to swarm-level orchestration, where the improvement loop must coordinate across dozens of heterogeneous agents with different specializations, safety requirements, and quality standards.

2.4 Multi-Agent Debate and Deliberation

DMAD (ICLR 2025) [8] introduced diverse multi-agent debate with assigned reasoning strategies, showing that cognitive diversity improves collective reasoning quality. PROClaim [9] proposed progressive evidence accumulation in multi-agent debates, where agents build claims incrementally with supporting evidence. DCI [10] formalized minority opinion protection and debate reopen conditions, improving decision quality by preventing premature consensus. A-HMAD [11] extended multi-agent debate to heterogeneous settings where agents declare domain-specific analytical angles.

Our system integrates techniques from all four approaches into its debate protocol and, critically, discovered and integrated several of these techniques autonomously through its research integration pipeline (Section 4.5).

3. System Architecture

3.1 Agent Design: Thin Soul + Fat Skill

Each agent in the swarm is defined by two components: a lightweight personality file (the "soul") and a set of extensible skill plugins. This separation is the foundation of the system's efficiency.

The personality file is a declarative specification of approximately 30 lines that defines the agent's identity, behavioral rules, response temperature, and permitted tool access. It contains no executable code---only configuration and natural language instructions. This design ensures that the personality file is both human-readable and machine-modifiable, a property that proves essential for autonomous self-evolution (Section 4).

Skill plugins are deterministic modules that execute specific tasks---data retrieval, computation, file operations---without consuming LLM tokens. When an agent needs to fetch a stock price, query a database, or perform a calculation, it delegates to a skill plugin that executes directly, returning structured results to the agent's context window. This architecture eliminates the common anti-pattern of using LLM calls for tasks that can be performed deterministically.

The result is extreme efficiency. Each agent compiles to a 12.5MB binary. The entire swarm of 78 agents fits within 960MB of RAM on a consumer-grade machine (Mac Mini, M-series, 16GB). By contrast, Python-based frameworks typically require 200MB or more per agent due to interpreter overhead, dependency trees, and embedding model weights.

3.2 Swarm Communication

Agents communicate through a publish/subscribe message bus. Each agent can publish messages to named channels and subscribe to channels relevant to its domain. This decoupled architecture allows agents to be added, removed, or restarted without disrupting the swarm.

For tasks requiring deliberation, the system employs a structured debate protocol. Agents are assigned distinct reasoning strategies (analytical, intuitive, adversarial, etc.) following the DMAD paradigm [8]. A conductor agent manages the debate flow, tracks positions, computes consensus scores, and determines when sufficient agreement has been reached or when the debate should be reopened due to unresolved minority opinions [10].

3.3 Knowledge Grounding

A persistent challenge in LLM-based systems is hallucination---agents generating plausible but incorrect information. Our system addresses this through aggressive knowledge grounding. Agents reference 162 primary source texts spanning medical, spiritual, financial, and technical domains. When an agent needs domain knowledge, it retrieves relevant passages from these source texts rather than relying on the LLM's parametric memory.

Retrieval is implemented via pattern matching over the source corpus---an intentionally simple approach that avoids the overhead and complexity of embedding-based retrieval. While this sacrifices some semantic flexibility, it provides deterministic, reproducible retrieval with zero additional memory overhead and no dependency on embedding models.

4. Self-Evolution Mechanism

The self-evolution mechanism is the core contribution of this work. It consists of five interconnected processes that run on staggered schedules, creating a continuous improvement loop.

4.1 Quality Audit

Every six hours, the quality auditor agent samples recent outputs from agents across all domains. For each sampled output, it evaluates four dimensions:

  1. Accuracy: Is the factual content correct? Are citations valid? Do numerical claims match source data?
  2. Persona Consistency: Does the agent maintain its assigned identity, tone, and expertise level?
  3. Safety Compliance: Are required disclaimers present? Are prohibited claims absent? Does the output respect domain-specific safety rules?
  4. Usefulness: Is the output actionable, well-structured, and responsive to the user's actual need?

Each dimension is scored on a 1--5 scale. Outputs scoring below 3 on any dimension are flagged as issues with structured descriptions of the deficiency. All results are logged to a persistent feedback store with timestamps, agent identifiers, and severity classifications.

4.2 Feedback Synthesis

Once daily, the feedback synthesis agent aggregates quality signals across all agents and all audit cycles from the preceding 24 hours. Its analysis produces:

4.3 Improvement Cycle

Every two to three hours, the improvement agent reads the latest audit reports and feedback synthesis, then executes targeted repairs. The repair process follows a strict priority ordering:

  1. Safety violations (highest priority)
  2. Accuracy failures
  3. Capability gaps
  4. Quality improvements (lowest priority)

Repairs take the form of modifications to agent personality files. Because these files are declarative natural language specifications, they can be modified by the improvement agent using the same LLM capabilities it uses for any other task.

4.4 Verification Loop

Every two hours, the verification agent confirms that repairs were correctly applied. This stage addresses a subtle but critical failure mode: the improvement agent may generate a repair that is syntactically correct but semantically ineffective, or that introduces a regression in another quality dimension.

4.5 Research Integration

Once daily, the research integration agent scans academic repositories (arXiv, HuggingFace) for papers relevant to the swarm's architecture and domains. Over the 5-day evaluation period, this process discovered and evaluated dozens of papers, integrating techniques from six that met the applicability criteria.

5. Safety Gates

Autonomous self-evolution raises legitimate safety concerns. We address these through multiple layers of safety gates.

Publication Gates. Before any agent output reaches end users, it passes through code-level validation that rejects outputs missing required sections.

Hallucination Detection. Financial data must originate from verified API calls, not from the LLM's parametric memory.

Mandatory Disclaimers. Medical and financial content automatically receives domain-appropriate disclaimers appended at the runtime level.

Escalation System. Agents can flag issues they cannot resolve.

Critically, the safety gates operate at a layer below the self-evolution mechanism. The improvement agent can modify agent personality files, but it cannot modify the runtime safety gates.

6. Evaluation

We evaluated the system over a continuous 5-day autonomous deployment with zero human intervention.

6.1 System Metrics

MetricValue
Total improvement cycles completed30+
Agent personality files modified68+
Research papers discovered and integrated6
Backlog items created by audit11
Backlog items auto-resolved9
Safety gates triggered4 (all correct rejections)
System uptime99.7%
Human interventions required0

6.2 Quality Trajectory

Quality DimensionDay 1Day 5Change
Compliance rate~75%~92%+17pp
Hallucination incidents (per day)2.00.0-100%
Audit-feedback consistency100%100%Stable
Avg. agent prompt richness (chars)~200~2,000+900%

6.3 Research Integration Results

PaperTechnique IntegratedObserved Effect
DMAD [8] (ICLR 2025)8 distinct reasoning strategies assigned per debateIncreased cognitive diversity in deliberations
TCM-DiffRAG [12]Constitutional typing as prerequisite diagnostic stepMore personalized medical guidance
PROClaim [9]Progressive evidence accumulation across debate roundsRicher evidentiary basis per deliberation round
DCI [10]Minority opinion protection and debate reopen conditionsHigher decision quality, fewer premature consensus events
ERL [7]Heuristic experience pool persisting across cyclesCross-cycle learning, reduced repeated errors
A-HMAD [11]Domain angle declarations for heterogeneous agentsMore diverse analytical perspectives in mixed-domain debates

6.4 Comparison with Existing Frameworks

DimensionOur SystemAutoGenCrewAILangGraphMetaGPT
Memory per agent12.5 MB~250 MB~200 MB~180 MB~300 MB
Self-evolutionAutonomousNoneNoneNoneNone
Max agents (16GB)78~8~10~12~7

7. Discussion

7.1 Harness Engineering at Scale

Our results demonstrate that the three pillars of Harness Engineering---constraints, feedback, and verification---can be fully automated at swarm scale. The improvement loop itself is, in our assessment, more valuable than any individual agent in the swarm.

7.2 Emergent Behaviors

Several behaviors emerged during the evaluation that were not explicitly programmed:

7.3 Limitations

LLM Dependency. The quality of the audit and improvement cycles depends on the capability of the underlying LLM.

Safety Rule Origin. Initial safety rules for medical and financial domains must be authored by qualified humans.

Evaluation Scope. Our 5-day evaluation is insufficient to characterize long-term behavior.

8. Conclusion

We have presented a self-evolving multi-agent swarm architecture that autonomously audits, repairs, and verifies its own agents. The system runs 78 specialized agents on a single consumer machine with a total memory footprint of 960MB. Over a 5-day fully autonomous deployment, it completed more than 30 improvement cycles, modified 68 agent configuration files, and integrated techniques from 6 research papers---all without human intervention.

The swarm is not just running. It is getting better---every hour, autonomously, without being told to.

References

[1] Q. Wu, et al. "AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation." arXiv:2308.08155, 2023.

[2] J. Moura. "CrewAI: Framework for orchestrating role-playing autonomous AI agents." GitHub, 2024.

[3] W. Liang, et al. "LangGraph: Building Stateful Multi-Actor Applications with LLMs." LangChain, 2024.

[4] S. Hong, et al. "MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework." arXiv:2308.00352, 2023.

[5] OpenAI. "Harness Engineering: Building Reliable AI Agent Systems." 2026.

[6] Z. Liu, et al. "SAGE: Self-Improving Agent through Generative Evolution." arXiv:2603.15255, 2026.

[7] Y. Wang, et al. "ERL: Experience-Based Reinforcement Learning for Autonomous Agent Improvement." arXiv:2603.24639, 2026.

[8] T. Liang, et al. "Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate." ICLR, 2025.

[9] M. Chen, et al. "PROClaim: Progressive Evidence Accumulation for Multi-Agent Deliberation." 2025.

[10] R. Patel, et al. "DCI: Diverse Consensus through Institutional Deliberation in Multi-Agent Systems." 2025.

[11] K. Suzuki, et al. "A-HMAD: Analytical Heterogeneous Multi-Agent Debate with Domain Angle Declarations." 2025.

[12] L. Zhang, et al. "TCM-DiffRAG: Differentiated Retrieval-Augmented Generation for Traditional Chinese Medicine." 2025.