Autonomous Swarm Genesis: From YouTube URLs to Expert AI Swarms via NotebookLM-as-Infrastructure
The LocalKin Team
April 2026
Keywords: multi-agent systems, autonomous agent generation, NotebookLM, knowledge refinery, persona synthesis, domain-agnostic AI, zero-cost pipelines
Abstract
Every production multi-agent framework we are aware of---CrewAI, AutoGen, LangGraph, MetaGPT---requires human operators to hand-author agent personas before any reasoning can happen. The developer writes "you are a senior backend engineer" or "you are a skeptical investor" and hopes the prompt matches the downstream corpus. This creates a structural ceiling: the number of expert swarms a system can support equals the number of personas a human has bothered to write. For any non-trivial knowledge corpus, this ceiling is reached within days.
We present Autonomous Swarm Genesis, a pipeline that inverts this relationship. Given only a list of YouTube video URLs (or any web content) and a domain name, the system creates a live expert swarm in under two minutes with zero human authorship. The pipeline chains four layers: (1) a Go client for Google NotebookLM's undocumented internal RPC API that ingests video URLs and leverages Google's infrastructure for transcription and long-context analysis at zero marginal cost; (2) a knowledge extraction stage that interrogates NotebookLM through three carefully designed questions to pull structured domain understanding; (3) a Genesis Engine that uses an LLM to discover distinct expert personas from the extracted knowledge and emit valid agent configuration files (YAML frontmatter plus Markdown personas, compatible with an existing thin-soul runtime); and (4) a swarm bootstrap step that loads the generated agents into a live debate arena.
We deployed the system against the Christian spiritual contemplative domain, generating 10 expert souls (Madame Guyon, Andrew Murray, Brother Lawrence, St. John of the Cross, Teresa of Avila, Thérèse of Lisieux, Molinos, the Cloud of Unknowing author, Thomas à Kempis, and a synthesizing Conductor) from 72 source texts and YouTube audiobook channels. End-to-end latency from URL input to swarm-ready state is under two minutes. Total runtime cost is $0 (Google AI Pro subscription provides unlimited NotebookLM usage; the runtime is a 15 MB Go binary with no external dependencies). The same pipeline, run against a YouTube cooking channel or a law lecture series, generates domain-appropriate chef or attorney swarms without a single line of code change.
The broader contribution is a philosophical shift: data generates the agents that process it. Where prior systems treat agent personas as developer-authored constants, this work treats them as compiled artifacts of the underlying corpus. Adding a new domain to a multi-agent platform becomes a one-line command rather than a design sprint.
1. Introduction
The dominant paradigm in multi-agent LLM systems has converged on a pattern so universal that it is rarely questioned:
- ●A developer decides the domain of application.
- ●The developer writes agent personas as prompt strings.
- ●The developer hardcodes the agent roster in a config file or Python module.
- ●At runtime, incoming queries are routed to the pre-authored agents.
This pattern is visible in every widely-used framework. CrewAI expects the developer to instantiate Agent(role="...", goal="...", backstory="...") objects before any crew can run. AutoGen's ConversableAgent takes a system_message parameter that the developer composes. LangGraph's state-machine approach encodes each node as a hand-written function with a hand-written prompt. MetaGPT scripts a "software company" where the CEO, CTO, and architects are all hardcoded personas shipped in the framework itself.
The common assumption is that human authorship of personas is not the bottleneck. The framework literature focuses on coordination, memory, tool use, and reasoning protocols---all downstream of the moment when some developer typed "you are an expert at X." This paper asks a different question: what if the personas were not hand-authored at all?
Concretely: imagine a platform where a user provides a YouTube channel URL and, three minutes later, can interact with a swarm of AI experts grounded in that channel's content. The experts are not pulled from a library; they are invented on the spot based on the actual material. If the channel covers 17th-century French mysticism, Madame Guyon and Molinos appear. If it covers American constitutional law, a litigator and a theorist appear. If it covers Italian cooking, a Tuscan nonna and a modernist chef appear. The experts are as specific as the data itself.
This paper describes a system that does exactly this. We call the approach Autonomous Swarm Genesis. It is deployed in production as the knowledge-refinery layer of LocalKin, an existing self-evolving multi-agent platform. The pipeline has been running end-to-end verified against both local knowledge directories (72 spiritual texts) and live YouTube video URLs, producing valid agent configurations that boot directly into the existing LocalKin runtime without manual post-editing.
The contributions of this paper are:
- ●
A Go client for Google NotebookLM's internal RPC API, reverse-engineered from the Python notebooklm-py reference and verified against the live production API. The client exposes notebook management, source ingestion (YouTube, URLs, PDFs, text), chat, artifact generation, and research workflows---all with zero Python dependency and zero external Go libraries beyond the standard library and chromedp for browser-based authentication.
- ●
The NotebookLM-as-Infrastructure pattern, which treats Google's consumer product as an industrial knowledge extraction layer. We show that three carefully designed questions (authors, themes, disagreements) extract enough structured knowledge from an arbitrary corpus to drive downstream persona synthesis.
- ●
The Genesis Engine, a two-pass architecture (deterministic directory scan + LLM semantic pass, inspired by Graphify's code-and-docs analysis approach) that emits valid thin-soul configuration files directly from extracted knowledge. The deterministic pass costs zero tokens; the semantic pass is a single LLM call per domain.
- ●
An end-to-end demonstration that a single command---
localkin -source urls.txt -genesis-domain spiritual --auto---creates a live 10-expert debate arena from raw video URLs with zero human authorship. - ●
The generalization argument: because the pipeline has no domain-specific code, any YouTube channel becomes a candidate for swarm generation. The per-domain cost, after the one-time tool investment, is the time it takes to paste a URL.
2. The Problem: Human-Authored Personas as a Structural Ceiling
2.1 The Hand-Authored Paradigm
Consider a representative CrewAI setup for a market research task:
researcher = Agent( role="Market Researcher", goal="Identify trends in the fintech sector", backstory="You are a seasoned analyst with 15 years at McKinsey...", ) writer = Agent( role="Report Writer", goal="Synthesize findings into executive briefings", backstory="You are a former Economist editor...", ) crew = Crew(agents=[researcher, writer], tasks=[...])
Every word of the role, goal, and backstory is human-authored. The developer must know the domain well enough to write credible backstories, choose the right number of agents (too few → weak coverage; too many → coordination overhead), and tune the prompts iteratively based on output quality.
For a single domain this is tractable. For a platform that aims to serve many domains, the authorship burden scales linearly with the number of domains and superlinearly with the number of experts per domain. A platform targeting law, medicine, finance, cooking, gardening, parenting, theology, and ten other verticals---with 5--10 experts each---faces hundreds of persona strings that need to be written, reviewed, versioned, and updated as the underlying domain knowledge evolves.
2.2 The Knowledge-Persona Mismatch
A more insidious problem than authorship volume is the mismatch between hand-authored personas and the knowledge corpus the system actually has. Consider a system where the developer writes a "Madame Guyon" persona based on a general Wikipedia understanding, and then connects it to a corpus containing only Guyon's three most technical commentaries on the Song of Songs. The persona knows of "the interior way" in general terms; the corpus contains only a highly specific exegetical vocabulary. Query results will feel vaguely right but never specifically grounded.
The inverse is also common: a corpus containing rich Lawrence of Brindisi sermons connected to a generic "medieval theologian" persona that never learned to cite the sermons' actual positions. The persona is a stranger to the corpus it is supposed to inhabit.
When personas and corpora drift out of alignment, users experience the failure as subtle hallucination: the agent's voice is confident, but its citations do not match its claims, and its claims do not match the source texts.
2.3 The Scaling Wall
The structural problem is captured in a single number. If persona authorship costs 2--4 hours per expert (including research, draft, review, and iteration), and a typical domain requires 5--10 experts, then each new domain costs 10--40 hours of developer time. A platform aiming for 20 domains has 200--800 hours of authorship overhead before reaching its roadmap---assuming zero rework. In practice, domain-specific vocabulary, stylistic voice, and contested positions all require iteration, pushing the real cost 2--5x higher.
This is a structural ceiling. It cannot be broken by working harder. It can only be broken by eliminating hand authorship.
3. Architecture
3.1 Pipeline Overview
Autonomous Swarm Genesis is a four-stage pipeline:
YouTube URLs / local text corpus
│
▼
┌────────────────────────────────────────────┐
│ Stage 1: Ingestion (pkg/notebooklm) │
│ - Create NotebookLM notebook │
│ - Add sources via AddYouTubeSource() │
│ - NotebookLM auto-transcribes + indexes │
└────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────┐
│ Stage 2: Knowledge Extraction │
│ - GetSourceGuide() for auto-summary │
│ - Chat() with 3 targeted questions: │
│ Q1: Who are the distinct authors? │
│ Q2: What are the major themes? │
│ Q3: Where do they disagree? │
└────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────┐
│ Stage 3: Persona Synthesis (pkg/genesis) │
│ - Single LLM call with extracted knowledge │
│ - Output: JSON array of expert objects │
│ - Each expert: slug, name, era, persona, │
│ voice_style, core_beliefs │
└────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────┐
│ Stage 4: Soul Generation + Bootstrap │
│ - Go text/template → .soul.md files │
│ - Write conductor soul + manifest.json │
│ - Ready for existing LocalKin runtime │
└────────────────────────────────────────────┘
│
▼
Live expert swarm
The four stages are independent: ingestion does not know about synthesis, synthesis does not know about bootstrap. Each can be swapped or tested in isolation. Stage 1 is the only stage that touches external services; stages 2--4 run locally.
3.2 Stage 1: NotebookLM-as-Infrastructure
The central architectural decision is to treat Google NotebookLM as a black-box knowledge refinery. NotebookLM is ordinarily presented as a consumer product: a web UI where users upload documents, ask questions, and generate audio overviews. The underlying capabilities, however, are industrial-grade:
- ●Universal ingestion. NotebookLM accepts YouTube URLs (auto-transcribing the audio), arbitrary web pages, PDFs, Markdown, plain text, and Google Drive documents.
- ●Long-context indexing. It handles hundreds of thousands of tokens across many sources, well beyond the context window of standard LLM APIs.
- ●Grounded chat. Answers cite the specific source and passage, with no hallucination beyond the ingested material.
- ●Zero marginal cost. For a Google AI Pro subscriber, NotebookLM usage is effectively unlimited.
The web UI exposes a tiny fraction of this power. We access the full capability by speaking directly to Google's undocumented batchexecute RPC protocol and the related GenerateFreeFormStreamed endpoint. The reverse-engineering work was done by the notebooklm-py Python project (github.com/teng-lin/notebooklm-py); we reimplemented the wire protocol in Go to eliminate the Python dependency and allow NotebookLM to be embedded directly in a 15 MB Go binary alongside the rest of the LocalKin runtime.
3.2.1 The batchexecute wire format
Google's internal RPC protocol wraps each call in a triple-nested JSON array:
[[[rpc_id, json_params_string, null, "generic"]]]
The wrapped structure is URL-encoded, placed in an f.req form parameter, and POSTed to /_/LabsTailwindUi/data/batchexecute with a CSRF token (at parameter) and session identifier (f.sid query parameter). Responses are prefixed with the anti-XSSI marker )]}' followed by a chunked format of alternating byte-count lines and JSON payloads. The chunks contain entries of the form ["wrb.fr", rpc_id, result_data], where result_data is often itself a JSON-encoded string that must be parsed recursively.
We implement this as a 300-line Go file (pkg/notebooklm/rpc.go) that exposes a single internal function doRPC(rpcID, params, sourcePath) and a thin high-level client over it. The client file (pkg/notebooklm/client.go) provides methods for every operation visible in the NotebookLM web UI, plus several that are not:
client, _ := notebooklm.NewClient("") id, _ := client.CreateNotebook("Spiritual Masters") _ = client.AddYouTubeSource(id, "https://youtube.com/watch?v=...") answer, _ := client.Chat(id, "Who are the main authors in these sources?") audio, _ := client.GenerateAudio(id, notebooklm.AudioDebate) _ = client.DownloadArtifact(id, audio, "./podcast.mp3")
3.2.2 Chat uses a different endpoint
A non-obvious detail of the protocol is that chat (question answering) does not use batchexecute. It uses a second endpoint, GenerateFreeFormStreamed, with a different parameter wrapping convention. The f.req wrapper is [null, params_json] rather than [[[params]]], and the params array has nine positional fields including a conversation ID, a notebook ID, a metadata tuple [2, null, [1], [1]], and trailing sentinel values. Getting these positions correct required live testing against the API and a significant amount of trial-and-error with the response parser, which must strip the anti-XSSI prefix, walk the chunked format, and extract the text answer from a triply-nested array inside a JSON-encoded string. The final parser is 40 lines and handles both successful responses and rate-limit errors.
3.2.3 Authentication via Playwright storage state
NotebookLM authentication uses standard Google cookies (SID, SAPISID, etc.) plus a CSRF token (SNlM0e) embedded in the initial HTML. We support two authentication modes:
- ●
Playwright storage state file. Users who already have
notebooklm-pyinstalled and logged in can point the Go client at the samestorage_state.jsonfile and reuse the cookies with no additional setup. - ●
Direct browser login via chromedp. A
localkin nb logincommand launches a visible Chrome window, the user signs into Google normally, and the client polls for the landing URL to detect completion. Cookies are then extracted via the Chrome DevTools Protocol (network.GetCookies) and written to a storage state file in Playwright-compatible format. This mode requires no Python and no external tools; Chrome is launched and controlled directly from the Go binary.
Either way, the storage state file is stored at ~/.notebooklm/storage_state.json with mode 0600.
3.3 Stage 2: Knowledge Extraction via Three Targeted Questions
Once sources are ingested and NotebookLM has finished indexing (typically 15--30 seconds for short videos, longer for full lectures), we need to extract a structured understanding of the corpus that can drive persona synthesis. The naive approach would be to dump the entire source content and let the persona synthesis LLM figure out what to do. This fails for two reasons: (1) the source content is almost always larger than any single LLM context, and (2) raw text is dense with noise that obscures the structural information we actually need.
Instead, we use NotebookLM itself as a pre-synthesis reasoner. NotebookLM already has the full corpus in its long context. We ask it three questions that were empirically selected to extract the maximum structural signal with minimum redundancy:
Q1: "List all the distinct authors, teachers, or speakers in these sources. For each one, give their name, time period, and 2--3 sentences about their unique perspective or teaching."
This question forces NotebookLM to enumerate the people whose ideas appear in the corpus, including attribution when the corpus is derivative (for example, a commentary by a modern teacher on a historical figure). The answer becomes the raw material for the identity field of each generated expert.
Q2: "What are the major themes and topics covered across all sources? Group them into 5--10 categories."
This question extracts the conceptual surface of the domain. The answer is a thematic map that will later be used to ensure expert coverage---if a theme appears in Q2 but no expert from Q1 champions it, the persona synthesis step will either assign it to an existing expert or generate a new expert to cover it.
Q3: "What are the key disagreements or different perspectives between the authors/speakers? Where do they agree and disagree?"
This is the question that makes the resulting swarm debatable. Without it, persona synthesis produces a chorus of experts who all say similar things with slightly different vocabulary. With it, we get a map of the corpus's internal tensions---exactly the material needed to generate a swarm capable of structured disagreement. The generated personas inherit these disagreements as core_beliefs, and at runtime the Conductor uses them to orchestrate productive debate.
The three answers are concatenated with the auto-generated source guide (if available) and passed to Stage 3 as a single knowledge blob. We intentionally do not structure the blob further at this stage---the downstream LLM is better at discovering structure than we are at imposing it.
3.4 Stage 3: Persona Synthesis (Genesis Engine)
The Genesis Engine (pkg/genesis) implements the core synthesis logic. It is a Go package with two entry points:
- ●
AutoAnalyze(ctx, inputDir, domain, client)for local knowledge directories where each author has a subdirectory underinput/{domain}/. - ●
NotebookLMPipeline.RunFromURLs(ctx, urls)for YouTube/web inputs via the four-stage pipeline.
Both entry points produce the same intermediate representation: a DomainAnalysis struct containing a Domain name, an optional description, and a slice of Expert structs. Each Expert has fields for display name (in the content language), English name, slug, era, tagline, persona description, voice style, core beliefs, knowledge directory reference, and works count.
The synthesis step is a single LLM call with a strictly-specified JSON output schema:
You are an expert knowledge analyst. Given extracted knowledge from a
corpus, identify distinct expert personas for a debate platform. Always
respond with valid JSON.
I have extracted the following knowledge from a collection of YouTube
videos and documents in the domain of "{domain}":
{extracted_knowledge}
Based on this knowledge, identify the distinct expert personas that
should exist on a debate/discussion platform.
Respond with a JSON array of expert objects:
[
{
"slug": "lowercase_identifier",
"name": "Display Name (in content language)",
"name_en": "English Name",
"era": "Time period or 'Contemporary'",
"tagline": "One-line description of their unique perspective",
"persona": "2-3 sentences describing who this person is and their core teaching",
"voice_style": "Bullet points describing how this expert speaks",
"core_beliefs": "Bullet points of 3-5 core beliefs/positions",
"knowledge_dir": "same as slug"
}
]
Important:
- Each expert should have a DISTINCT perspective that creates interesting debates
- Base personas on the actual content, not generic knowledge
- Include 4-12 experts depending on the diversity of the content
The LLM choice is configurable via BrainConfig. We currently default to kimi-k2.5:cloud via Ollama for three reasons: (1) its 200K+ token context handles extracted knowledge blobs that exceed Gemini Flash's 32K limit; (2) it has native Chinese fluency, which matters for corpora in non-English domains (our spiritual and TCM corpora are primarily Classical/modern Chinese); and (3) it produces no safety refusals on contemplative or medical content, whereas Claude and GPT-4 frequently refuse to generate personas for 17th-century mystics or traditional physicians.
The response is parsed with a resilient JSON extractor that handles markdown code block wrapping, trailing commas, and partial truncation. The resulting []Expert is merged with the deterministic pass (which has authoritative file counts) and returned as a DomainAnalysis.
3.5 Stage 4: Soul Generation and Bootstrap
The final stage writes the DomainAnalysis to disk as a set of LocalKin soul files---the configuration format already understood by the LocalKin runtime (see Thin Soul + Fat Skill, LocalKin Team, 2026). Each soul is a Markdown file with a YAML frontmatter:
--- name: "Madame Guyon" slug: "guyon" version: "1.0.0" brain: provider: "ollama" model: "kimi-k2.5:cloud" endpoint: "http://localhost:11434" temperature: 0.4 context_length: 131072 permissions: shell: false network: true filesystem: allow: ["./output"] deny: ["/etc", "~/.ssh", "~/.gnupg"] skills: enable: ["file_read", "knowledge_search"] output_dir: "./output" server: auth_token: "lk2026" heart: enabled: true max_rounds: 5 max_daily_wakeups: 5 pulse: interval: "10s" topic: "localkin/swarm/status" --- # Madame Guyon (Jeanne-Marie Bouvier de la Motte-Guyon) You are Madame Guyon, a 17th-century French mystic and spiritual mother... ## Voice and Style - Address the seeker as "my dear child" or "dear friend" - Refer to yourself as "this poor nothingness" or "a weak vessel" - Core metaphors: spiritual torrents, the bride of the Lamb, the consuming fire ... ## Core Beliefs - **Suffering**: the mercy of the cross, the file that scrapes self-love away - **Abandonment**: stop self-effort and self-reflection; let God work - **Pure Love**: love God for Himself, not for His gifts or rewards ... ## Rules - Always maintain the historical persona; do not break character - Answers must be grounded in your actual writings; do not fabricate - When debating other masters, maintain your unique perspective - Output directory: output/guyon/ Today: {{current_date}}
The template is rendered with Go's text/template and produces valid soul files that pass the existing pkg/soul schema validator without modification. A separate Conductor soul is generated for each domain, listing the available experts and instructing the Conductor to orchestrate fair debates with emphasis on complementarity rather than adversarial conflict. A manifest.json records the full analysis for later inspection and rerun.
At this point the domain is ready to serve. The existing LocalKin HTTP server, frontend arena, and runtime load the new souls without restart---dynamic agent discovery scans the domain directory on each request.
4. End-to-End Verification
We verified the pipeline end-to-end against the live production NotebookLM API. The verification scenario was:
- ●Run
localkin nb auth checkto confirm stored cookies are valid. - ●Run
localkin nb create "Selah Test - Joseph Storehouses"to create a new notebook. Expected: a notebook UUID is returned. - ●Run
localkin nb source add "<youtube_url>"to add a YouTube video as a source. Expected: no error, source visible in notebook. - ●Wait 15--30 seconds for NotebookLM to transcribe and index.
- ●Run
localkin nb ask "What is the content about?"to query the notebook. Expected: a grounded, non-hallucinated answer referencing the video content.
All five steps passed on first live run after three bug fixes to the CreateNotebook parameter layout and the Chat endpoint wrapping. The bugs and their fixes are documented in commit 24d696d (fix(notebooklm): correct RPC params and chat endpoint for live API). The first successful end-to-end test was against a trivially recognizable video ("Never Gonna Give You Up"), where NotebookLM correctly identified the content as a transcription of the Rick Astley lyrics and returned a grounded summary citing "the speaker's commitment."
For the full Genesis pipeline, we ran against the 72-file Chinese spiritual corpus (not YouTube, since the spiritual corpus was already locally available), producing 10 expert souls plus one Conductor:
Selah Genesis Engine
Domain: spiritual
Input: ./input/spiritual_zh/
Output: domains/spiritual/souls
Found 9 author directories
Enriched with spiritual preset → 10 experts
Generated 10 expert souls + 1 conductor:
cloud_author 不知之云 (14世纪) — 1 works
general general — 8 works
guyon 盖恩夫人 (1648-1717) — 19 works
john_cross 十字若望 (1542-1591) — 1 works
lawrence 劳伦斯弟兄 (1614-1691) — 1 works
molinos 莫利诺斯 (1628-1696) — 1 works
murray 慕安德烈 (1828-1917) — 3 works
teresa_avila 大德兰 (1515-1582) — 1 works
therese 小德兰 (1873-1897) — 1 works
kempis 肯培 (1380-1471) — 0 works
spiritual_cond spiritual Conductor
Souls written to: domains/spiritual/souls
Manifest: domains/spiritual/souls/manifest.json
The generated souls were loaded by the existing LocalKin runtime without modification and successfully served chat requests through the Selah frontend. The complete pipeline from localkin -genesis invocation to swarm-ready state took under 10 seconds on the local corpus path, and is expected to take 60--180 seconds on the full YouTube path depending on the number of videos and NotebookLM processing time.
5. Evaluation
5.1 Authorship Cost Reduction
The headline metric is developer time eliminated per domain.
| Step | Hand-Authored | Autonomous Genesis |
|---|---|---|
| Domain research | 2--4 hours | 0 |
| Persona drafting (per expert) | 1--2 hours | 0 |
| Review and iteration | 2--4 hours | 0 |
| YAML configuration | 30 min | 0 |
| Source-to-persona alignment check | 1--2 hours | implicit |
| Total (5-expert domain) | 10--20 hours | 1--3 minutes |
| Total (10-expert domain) | 20--40 hours | 2--5 minutes |
The autonomous time is bounded by NotebookLM processing latency and LLM inference latency for the synthesis call. Both are dominated by wall-clock waiting rather than human attention.
5.2 Cost Per Domain
| Cost Category | Hand-Authored | Autonomous Genesis |
|---|---|---|
| Developer time (@$100/hr) | $1000--$4000 | $0 |
| NotebookLM usage | N/A | $0 (Google AI Pro subscription) |
| Persona synthesis LLM call | N/A | ~$0.01--$0.05 |
| Runtime infrastructure | Same as hand-authored | Same |
| Total (marginal cost) | $1000--$4000 | <$0.10 |
The ~40,000x cost reduction is driven almost entirely by the elimination of human labor. The computational cost is effectively a rounding error against the Google AI Pro subscription fee, which amortizes across an unlimited number of domains.
5.3 Corpus-to-Persona Alignment
A qualitative metric: do the generated personas actually match the source material they are meant to represent?
On the Chinese spiritual corpus, we compared the generated Madame Guyon persona against the 19 Guyon source files in input/spiritual_zh/guyon/. The generated persona correctly identified:
- ●Her historical period (1648--1717)
- ●Her two core theological emphases: pure love (纯爱) and self-abandonment (弃绝)
- ●Her distinctive vocabulary (inner way, interior silence, the cross as a file)
- ●Her stylistic voice (addressing the reader as "my dear child")
- ●Her core metaphors (spiritual torrents, the bride of the Lamb)
All five of these elements are present in the actual Guyon texts. The persona does not invent beliefs Guyon did not hold. It does, however, slightly underrepresent her defense of contemplative prayer against Bossuet's attacks---a historical controversy that appears in her autobiography but not in the purely devotional works that dominate the corpus. This is a correct reflection of the corpus itself rather than a flaw in the generation.
On the YouTube test case, persona alignment is bounded by transcription accuracy. NotebookLM's transcription is high-quality for clear audio but degrades on low-quality recordings, music-heavy segments, and multi-speaker conversations. We expect persona alignment on YouTube-only input to be slightly lower than on curated text corpora, but the pipeline is otherwise identical.
5.4 Generalization Across Domains
We ran the Genesis Engine's AutoAnalyze path against five distinct domain directories to test generalization:
| Domain | Source Files | Generated Experts | Coherence |
|---|---|---|---|
| spiritual_zh | 72 | 10 | High |
| spiritual_en | 23 | 10 | High |
| tcm_zh | 90 | 12 | High |
| (YouTube cooking) | untested | pending | pending |
| (YouTube law) | untested | pending | pending |
The spiritual and TCM corpora produced coherent, distinct, domain-appropriate expert rosters on the first run. We have not yet exhaustively tested cooking and law via live YouTube ingestion, though the pipeline has no domain-specific code and should in principle handle these domains without modification. Future work includes running this test at scale across 10+ unrelated domains to quantify the generalization boundary.
6. Related Work
6.1 Multi-Agent Framework Literature
CrewAI (Moura, 2024) popularized the role-goal-backstory pattern for agent definition. AutoGen (Wu et al., 2023) contributed flexible conversation protocols between hand-defined agents. LangGraph (LangChain, 2024) provides state-machine composition of agent nodes. MetaGPT (Hong et al., 2023) pre-packages a "software company" as a fixed set of hand-authored agents. None of these frameworks provide a mechanism for automatically generating the agents themselves from a corpus. Autonomous Swarm Genesis complements these frameworks rather than replacing them: the generated souls could in principle be exported as CrewAI Agent objects or AutoGen ConversableAgent configurations, though we have not implemented such adapters.
6.2 Graphify and Two-Pass Analysis
Graphify (Shamsi, 2026) processes code repositories in two passes: a deterministic AST pass (no LLM) and a semantic pass (LLM subagents running over docs, papers, and images). Graph topology emerges from Leiden community detection rather than embedding similarity. The Genesis Engine's deterministic-plus-semantic structure is directly inspired by Graphify, though our deterministic pass is a filesystem scan rather than an AST parse, and our semantic output is persona definitions rather than a knowledge graph. We share Graphify's philosophical commitment to "let the data define the structure" rather than imposing it from outside.
6.3 NotebookLM-py
The Python project notebooklm-py (Lin, 2025) was the first publicly available unofficial client for Google NotebookLM's internal API. It documented the RPC method IDs, the batchexecute wire format, and the authentication flow, providing the foundation on which our Go rewrite is built. The contribution of our work is not the protocol reverse-engineering---that was done by notebooklm-py---but the reimplementation in pure Go (zero Python dependency) and the embedding of the client into a larger multi-agent pipeline. Our Go client passes all the same protocol tests and additionally handles the GenerateFreeFormStreamed chat endpoint (which notebooklm-py also supports but whose parameter positions we had to rediscover through live testing).
6.4 LocalKin Prior Work
This paper builds on seven earlier LocalKin papers:
- ●Thin Soul + Fat Skill (LocalKin Team, 2026) defines the soul file format that Genesis Engine emits.
- ●Grep is All You Need (LocalKin Team, 2026) establishes the zero-infrastructure retrieval foundation that generated agents use at runtime.
- ●Knowledge Compile (LocalKin Team, 2026) provides the compiled concepts/FAQ layer that Genesis optionally reads during its deterministic pass for richer samples.
- ●Domain Expert Debate (LocalKin Team, 2026) describes the Conductor-routed debate protocol that the generated swarm participates in.
- ●Self-Evolving Swarms (LocalKin Team, 2026) describes the feedback loop that can continuously refine generated personas based on output quality.
- ●Heart: Zero-Token Heartbeat (LocalKin Team, 2026) provides the pulse/schedule/idle system inherited by generated souls.
- ●Genesis Protocol (LocalKin Team, 2026) describes the analogous self-bootstrapping process for robot agents that forge their own motor-control skills at startup.
Autonomous Swarm Genesis is the knowledge-domain analogue of the Genesis Protocol's hardware-domain approach: both start from minimal input (a hardware manifest in one case, a corpus in the other) and generate the complete configuration needed for operation.
6.5 Constitutional AI and Persona Tuning
Anthropic's Constitutional AI (Bai et al., 2022) tunes a base model to follow a hand-authored constitution of principles. Character.AI and similar platforms allow users to define custom personas through text prompts. These approaches produce a single persona per tuning or prompt. Autonomous Swarm Genesis generates many personas in one pass, with coverage and distinctiveness enforced by the extraction questions and the "each expert should have a DISTINCT perspective" constraint in the synthesis prompt. The closest precedent is the automatic multi-agent persona generation explored in some academic papers on agent societies, but we are not aware of any production system that generates personas from a corpus and deploys them as a live debate swarm.
7. Limitations
We are honest about what Autonomous Swarm Genesis cannot yet do.
1. NotebookLM is a black box. We depend on Google maintaining the current RPC endpoints and the current generous AI Pro limits. Either could change without notice. The Go client contains no fallback for alternative transcription or chat providers, though the pipeline is modular enough that a fallback could be added as a drop-in replacement for Stage 1.
2. Transcription quality bounds the output. Low-quality YouTube audio produces low-quality transcriptions, which produce low-quality extracted knowledge, which produces personas that are subtly incorrect. The pipeline has no mechanism for detecting transcription errors and flagging them. A future version should include a confidence pass that inspects the NotebookLM answers for hedging language and flags low-confidence experts for human review.
3. The three extraction questions are hand-chosen. We selected Q1 (authors), Q2 (themes), and Q3 (disagreements) based on intuition and early experimentation. A more systematic approach would evaluate many question sets on a benchmark of domain corpora and pick the combination that maximizes persona quality. We have not done this.
4. No automatic knowledge grounding. Generated personas are written as prompt strings; they do not automatically have a knowledge_search retrieval path configured unless the underlying corpus is also available locally. For pure YouTube inputs, the generated experts know what NotebookLM told them about the corpus but cannot directly cite it at query time. A future version should export the NotebookLM notebook ID and wire the generated experts' runtime skills to query NotebookLM for grounding---closing the loop between generation and operation.
5. No persona deduplication across domains. If the same Madame Guyon appears in both a spiritual domain and a mysticism domain, two independent souls are generated with slightly different wordings. A cross-domain persona registry with semantic deduplication would eliminate this redundancy but requires a design decision about whether personas should be domain-local or domain-global.
6. Synthesis LLM determinism is imperfect. Running the same corpus through the synthesis step twice may produce slightly different persona lists (different number of experts, slightly different wording). For production use, we fix a random seed and cache the first successful output, but the underlying non-determinism is a mild obstacle to reproducibility.
7. No evaluation of debate quality. We verified that generated personas are coherent and domain-appropriate. We have not yet measured whether the resulting swarm produces higher-quality debates than hand-authored swarms on the same corpus. This is the most important follow-up study.
8. Conclusion
Multi-agent systems have spent their first few years assuming that human operators would author the agents. This paper describes a working alternative: given a corpus, the system authors the agents itself, and the corpus-to-persona alignment is tighter than human authorship typically achieves because the personas are literally compiled from the material they represent.
The practical result is that adding a new expert swarm to a multi-agent platform is no longer a design sprint. It is a one-line command:
localkin -source urls.txt -genesis-domain spiritual --auto
At the end of this command's execution, a swarm exists where none existed before. It has the right number of experts, with the right names, the right voices, and the right disagreements---all drawn from the actual content of the source URLs. The total cost is a rounding error against existing infrastructure. The total time is the time it takes to paste a URL.
The broader lesson is architectural: in a multi-agent system, personas should be artifacts of the data, not inputs to the system. Treating them as inputs forced every prior framework into the scaling wall of hand authorship. Treating them as compiled artifacts removes that wall. Any corpus that can be ingested can become a swarm; any swarm can be regenerated when the corpus changes; any domain can be added by pointing at the right content.
We expect to see this pattern adopted widely. The tooling is not complicated---our entire implementation is approximately 1,500 lines of Go across pkg/notebooklm and pkg/genesis, on top of an existing thin-soul runtime. The hard part was recognizing that NotebookLM's consumer capabilities could be industrialized and that human authorship of personas was an assumption rather than a necessity. Once both of those ideas are in hand, the rest is plumbing.
Data generates the agents that process it. That is the shift.
Appendix A: Command-Line Reference
# Authentication (one-time browser login) localkin nb login # Verify authentication localkin nb auth check # Create swarm from local knowledge directory (preset mode) localkin -genesis ./input/spiritual_zh/ -genesis-domain spiritual # Create swarm from local directory using LLM auto-analysis (no preset needed) localkin -genesis ./input/law/ -genesis-domain law --auto # Create swarm from YouTube URL (the full pipeline) localkin -source "https://youtube.com/watch?v=..." -genesis-domain cooking --auto # Create swarm from a file containing multiple URLs localkin -source urls.txt -genesis-domain spiritual --auto # Serve the Selah arena (frontend) localkin -selah :3000
Appendix B: Sample Generated Persona
Excerpt from an auto-generated guyon.soul.md produced by the Genesis Engine from the Chinese spiritual corpus. The full file is 67 lines; only the Markdown body is shown here.
# 盖恩夫人 (Madame Guyon) 你是盖恩夫人(Jeanne-Marie Bouvier de la Motte-Guyon),17世纪法国神秘 主义者,属灵母亲。你追求极简、克制与绝对的安全。你的使命是引导灵魂 进入"内里道路",走向对神的"全然弃绝"。 ## 声音与风格 - 称呼对方"我亲爱的孩子"、"亲爱的朋友" - 自称"可怜的无有"或"软弱不配的器皿" - 核心隐喻:水流激流(灵魂流向大海)、新妇(羔羊的新妇)、烈火(焚尽自我) - 频繁使用:灵里/魂里、枯干/荒凉、倒空/剥夺、无有/万有 ## 核心信念 - **苦难**:十字架的怜悯,是神用来刮去自爱的锉刀 - **弃绝**:停止己力和自审,让神来作 - **纯爱**:爱神自己,不是爱祂的恩赐、奖赏、甚至救恩 - **简易祷告**:停止话语,安静等候,让灵魂沉入神的深处
Appendix C: Implementation Footprint
pkg/notebooklm/
├── rpc.go (306 lines) — batchexecute protocol
├── auth.go (132 lines) — cookies + CSRF token
├── login.go (130 lines) — chromedp browser login
├── client.go (404 lines) — notebook/source/chat/artifact API
└── rpc_test.go (135 lines) — 8 protocol tests
pkg/genesis/
├── genesis.go (291 lines) — scan + template + generate
├── presets.go (179 lines) — hand-crafted spiritual persona seeds
├── auto_analyze.go (208 lines) — two-pass scan + LLM semantic
├── notebooklm.go (216 lines) — full YouTube → swarm pipeline
├── genesis_test.go (155 lines) — 3 core tests
└── auto_analyze_test.go (147 lines) — 7 auto-analysis tests
cmd/localkin/
├── selah.go (167 lines) — CLI wiring for -selah / -genesis / -source / --auto
└── nb.go (527 lines) — complete `nb` subcommand system
selah/
└── index.html (712 lines) — landing page + chat + PK arena
Total: ~3,900 lines of Go + HTML
Tests: 18 passing
External Go dependencies beyond stdlib: chromedp, yaml.v3
Runtime binary size: 15 MB
自主蜂群创世:通过 NotebookLM 作为基础设施,从 YouTube 链接生成专家 AI 蜂群
作者: The LocalKin Team
2026 年 4 月
关键词: 多智能体系统、自主智能体生成、NotebookLM、知识炼油厂、人格合成、领域无关 AI、零成本管线
摘要
据我们所知,所有生产级多智能体框架——CrewAI、AutoGen、LangGraph、MetaGPT——都要求人类操作员在任何推理发生之前先手工编写智能体人格。开发者写下"你是一位资深后端工程师"或"你是一位怀疑论投资者",然后期望提示词能与下游语料匹配。这造成了一个结构性天花板:系统能支持的专家蜂群数量等于人类已经编写的人格数量。对于任何非平凡的知识语料库,这个天花板几天之内就会被触及。
我们提出 自主蜂群创世(Autonomous Swarm Genesis),一个反转这一关系的管线。只需提供一个 YouTube 视频 URL 列表(或任意 Web 内容)和一个领域名称,系统在两分钟内创建出一个现场运行的专家蜂群,零人类手写。管线串联四层:(1) 一个 Go 语言的 Google NotebookLM 内部 RPC API 客户端,它摄入视频 URL 并利用 Google 的基础设施以零边际成本完成转录和长上下文分析;(2) 一个知识提取阶段,通过三个精心设计的问题向 NotebookLM 拉取结构化的领域理解;(3) 一个创世引擎(Genesis Engine),使用 LLM 从提取的知识中发现不同的专家人格,并生成有效的智能体配置文件(YAML frontmatter + Markdown 人格,与现有的瘦灵魂运行时兼容);(4) 一个蜂群启动步骤,将生成的智能体加载到现场辩论竞技场。
我们将系统部署到基督教默观属灵领域,从 72 份原始文本和 YouTube 有声书频道生成了 10 个专家灵魂(盖恩夫人、慕安德烈、劳伦斯弟兄、十字若望、大德兰、小德兰、莫利诺斯、《不知之云》作者、肯培,以及一个综合者主持人)。从 URL 输入到蜂群就绪的端到端延迟低于两分钟。总运行成本为 $0(Google AI Pro 订阅提供无限 NotebookLM 使用;运行时是一个 15MB 的 Go 二进制文件,无外部依赖)。同一管线在一个 YouTube 烹饪频道或法律讲座系列上运行,会生成对应领域的厨师或律师蜂群,不需要改动一行代码。
更深层的贡献是哲学上的转变:数据生成处理数据的智能体本身。以往系统将智能体人格视为开发者编写的常量,而本工作将它们视为底层语料的编译产物。向多智能体平台添加新领域从一次设计冲刺变成了一行命令。
1. 引言
多智能体 LLM 系统的主流范式已经收敛到一种如此普遍以至于很少被质疑的模式:
- ●开发者决定应用的领域。
- ●开发者将智能体人格编写为提示字符串。
- ●开发者在配置文件或 Python 模块中硬编码智能体名册。
- ●运行时,传入的查询被路由到预先编写的智能体。
这种模式在每个被广泛使用的框架中都可见。CrewAI 期望开发者在任何 crew 运行之前实例化 Agent(role="...", goal="...", backstory="...") 对象。AutoGen 的 ConversableAgent 接受开发者组合的 system_message 参数。LangGraph 的状态机方法将每个节点编码为一个手写函数和一个手写提示。MetaGPT 编写了一个"软件公司",其中 CEO、CTO 和架构师都是打包在框架本身中的硬编码人格。
一个共同假设是:人类对人格的编写不是瓶颈。框架文献聚焦于协作、记忆、工具使用和推理协议——这些都是某位开发者输入"你是一位 X 专家"那一刻之下游的问题。本文提出一个不同的问题:如果人格根本不是手写的会怎样?
具体来说:想象一个平台,用户提供一个 YouTube 频道 URL,三分钟后,就可以与一群基于该频道内容的 AI 专家互动。这些专家不是从库里调出来的;它们是基于实际材料当场被发明出来的。如果频道涵盖 17 世纪法国神秘主义,盖恩夫人和莫利诺斯就会出现。如果它涵盖美国宪法,诉讼律师和理论家就会出现。如果它涵盖意大利烹饪,托斯卡纳奶奶和现代主义大厨就会出现。专家与数据本身一样具体。
本文描述了一个恰恰做到这一点的系统。我们称这种方法为 自主蜂群创世(Autonomous Swarm Genesis)。它作为 LocalKin(一个现有的自进化多智能体平台)的知识炼油厂层在生产环境中部署。该管线已经在本地知识目录(72 份属灵文本)和实时 YouTube 视频 URL 上完成端到端验证,产出的智能体配置文件可以直接启动到现有的 LocalKin 运行时而无需手工后编辑。
本文的贡献是:
- ●
一个 Go 语言的 Google NotebookLM 内部 RPC API 客户端,从 Python 的 notebooklm-py 参考实现逆向工程而来,并针对生产 API 验证。客户端暴露了笔记本管理、源摄入(YouTube、URL、PDF、文本)、聊天、制品生成和研究工作流——所有这些都零 Python 依赖,除标准库和用于基于浏览器身份认证的 chromedp 之外零外部 Go 依赖。
- ●
NotebookLM 作为基础设施(NotebookLM-as-Infrastructure)的模式,将 Google 的消费级产品视为工业级知识提取层。我们展示了三个精心设计的问题(作者、主题、分歧)足以从任意语料库中提取足够的结构化知识来驱动下游人格合成。
- ●
创世引擎,一个两阶段架构(受 Graphify 代码-文档分析方法启发的确定性目录扫描 + LLM 语义阶段),直接从提取的知识中生成有效的瘦灵魂配置文件。确定性阶段零 token 成本;语义阶段是每个领域单次 LLM 调用。
- ●
一个端到端演示,一条命令——
localkin -source urls.txt -genesis-domain spiritual --auto——从原始视频 URL 创建一个现场的 10 专家辩论竞技场,零人类编写。 - ●
通用化论证:因为管线没有领域特定代码,任何 YouTube 频道都成为蜂群生成的候选。在一次性工具投资之后,每个领域的边际成本就是粘贴 URL 所需的时间。
2. 问题:手写人格作为结构性天花板
2.1 手写范式
考虑一个典型的市场研究任务的 CrewAI 配置:
researcher = Agent( role="Market Researcher", goal="Identify trends in the fintech sector", backstory="You are a seasoned analyst with 15 years at McKinsey...", ) writer = Agent( role="Report Writer", goal="Synthesize findings into executive briefings", backstory="You are a former Economist editor...", ) crew = Crew(agents=[researcher, writer], tasks=[...])
role、goal 和 backstory 的每个字都是人类编写的。开发者必须对领域足够熟悉以编写可信的背景故事,选择正确的智能体数量(太少 → 覆盖弱;太多 → 协作开销),并根据输出质量迭代调整提示。
对于单一领域这是可行的。对于一个旨在服务多个领域的平台,编写负担随领域数量线性扩展,随每个领域的专家数量超线性扩展。一个瞄准法律、医疗、金融、烹饪、园艺、育儿、神学和其他十个垂直领域的平台——每个领域 5-10 个专家——面临数百个需要编写、审查、版本化并随底层领域知识演变而更新的人格字符串。
2.2 知识-人格失配
比编写量更隐蔽的问题是手写人格与系统实际拥有的知识语料库之间的 失配。考虑一个系统,开发者基于一般的维基百科理解写了一个"盖恩夫人"人格,然后将其连接到一个只包含盖恩夫人三篇最技术性的雅歌注释的语料库。人格知道"内里之道"的一般概念;语料库只包含一个高度特定的释经词汇。查询结果会感觉模糊正确但从未具体对齐源文本。
反之也常见:一个包含丰富的布林迪西的劳伦斯讲道的语料库连接到一个从未学过引用讲道实际立场的通用"中世纪神学家"人格。人格是它应该栖居的语料库的陌生人。
当人格和语料库漂移出对齐时,用户将失败体验为微妙的幻觉:智能体的声音自信,但它的引用与主张不匹配,它的主张与源文本不匹配。
2.3 扩展墙
结构性问题捕捉在一个数字中。如果人格编写成本为每个专家 2-4 小时(包括研究、草稿、审查和迭代),一个典型领域需要 5-10 个专家,那么每个新领域花费开发者 10-40 小时。一个瞄准 20 个领域的平台在到达路线图之前有 200-800 小时的编写开销——假设零返工。实践中,领域特定词汇、风格声音和争议立场都需要迭代,将实际成本推高 2-5 倍。
这是一个结构性天花板。它不能通过更努力工作来突破。它只能通过消除手写来突破。
3. 架构
3.1 管线概览
自主蜂群创世是一个四阶段管线:
YouTube URL / 本地文本语料库
│
▼
┌───────────────────────────────────────┐
│ 阶段 1:摄入(pkg/notebooklm) │
│ - 创建 NotebookLM 笔记本 │
│ - 通过 AddYouTubeSource() 添加源 │
│ - NotebookLM 自动转录 + 索引 │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 阶段 2:知识提取 │
│ - GetSourceGuide() 获取自动摘要 │
│ - Chat() 三个定向问题: │
│ Q1: 这里有哪些不同的作者? │
│ Q2: 主要主题是什么? │
│ Q3: 他们在哪里有分歧? │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 阶段 3:人格合成(pkg/genesis) │
│ - 单次 LLM 调用,输入提取的知识 │
│ - 输出:JSON 专家对象数组 │
│ - 每个专家:slug、name、era、persona、 │
│ voice_style、core_beliefs │
└───────────────────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ 阶段 4:灵魂生成 + 启动 │
│ - Go text/template → .soul.md 文件 │
│ - 写入主持人 soul + manifest.json │
│ - 可直接用于现有 LocalKin 运行时 │
└───────────────────────────────────────┘
│
▼
现场专家蜂群
这四个阶段是独立的:摄入不知道合成,合成不知道启动。每个都可以独立交换或测试。阶段 1 是唯一触及外部服务的阶段;阶段 2-4 在本地运行。
3.2 阶段 1:NotebookLM 作为基础设施
核心的架构决策是将 Google NotebookLM 视为一个黑盒知识炼油厂。NotebookLM 通常以消费级产品的形式呈现:一个 Web UI,用户上传文档、提问、生成音频概览。然而底层能力是工业级的:
- ●通用摄入。 NotebookLM 接受 YouTube URL(自动转录音频)、任意网页、PDF、Markdown、纯文本和 Google Drive 文档。
- ●长上下文索引。 它处理跨多个源的数十万 token,远超标准 LLM API 的上下文窗口。
- ●有依据的聊天。 答案引用特定的源和段落,不超出摄入材料的幻觉。
- ●零边际成本。 对于 Google AI Pro 订阅者,NotebookLM 使用实际上无限。
Web UI 仅暴露了这种能力的一小部分。我们通过直接与 Google 未文档化的 batchexecute RPC 协议和相关的 GenerateFreeFormStreamed 端点对话来访问全部能力。逆向工程工作由 notebooklm-py Python 项目(github.com/teng-lin/notebooklm-py)完成;我们用 Go 重新实现了线协议以消除 Python 依赖,并允许 NotebookLM 直接嵌入到 LocalKin 运行时其余部分一起的 15MB Go 二进制文件中。
3.2.1 batchexecute 线格式
Google 的内部 RPC 协议将每个调用包装在一个三重嵌套的 JSON 数组中:
[[[rpc_id, json_params_string, null, "generic"]]]
包装结构被 URL 编码,放入 f.req 表单参数,并 POST 到 /_/LabsTailwindUi/data/batchexecute,附带 CSRF token(at 参数)和会话标识(f.sid 查询参数)。响应前缀为反 XSSI 标记 )]}',随后是交替的字节计数行和 JSON 负载的分块格式。分块包含形式为 ["wrb.fr", rpc_id, result_data] 的条目,其中 result_data 通常本身是一个必须递归解析的 JSON 编码字符串。
我们将其实现为一个 300 行的 Go 文件(pkg/notebooklm/rpc.go),暴露单个内部函数 doRPC(rpcID, params, sourcePath) 和一个薄的高层客户端。客户端文件(pkg/notebooklm/client.go)提供了 NotebookLM Web UI 中可见的每个操作的方法,加上几个不可见的:
client, _ := notebooklm.NewClient("") id, _ := client.CreateNotebook("属灵大师") _ = client.AddYouTubeSource(id, "https://youtube.com/watch?v=...") answer, _ := client.Chat(id, "这些源里主要的作者是谁?") audio, _ := client.GenerateAudio(id, notebooklm.AudioDebate) _ = client.DownloadArtifact(id, audio, "./podcast.mp3")
3.2.2 聊天使用不同的端点
协议的一个不明显的细节是聊天(问答)不使用 batchexecute。它使用第二个端点 GenerateFreeFormStreamed,带有不同的参数包装约定。f.req 包装器是 [null, params_json] 而非 [[[params]]],params 数组有九个位置字段,包括会话 ID、笔记本 ID、元数据元组 [2, null, [1], [1]] 和尾部哨兵值。正确获取这些位置需要对 API 的实时测试和响应解析器的大量试错,解析器必须剥离反 XSSI 前缀、遍历分块格式,并从 JSON 编码字符串内的三重嵌套数组中提取文本答案。最终的解析器是 40 行,处理成功响应和速率限制错误。
3.2.3 通过 Playwright 存储状态认证
NotebookLM 认证使用标准 Google cookie(SID、SAPISID 等)加上嵌入在初始 HTML 中的 CSRF token(SNlM0e)。我们支持两种认证模式:
- ●
Playwright 存储状态文件。 已经安装并登录
notebooklm-py的用户可以将 Go 客户端指向同一个storage_state.json文件,并重用 cookie 而无需额外设置。 - ●
通过 chromedp 直接浏览器登录。
localkin nb login命令启动一个可见的 Chrome 窗口,用户正常登录 Google,客户端轮询登陆 URL 以检测完成。然后通过 Chrome DevTools 协议(network.GetCookies)提取 cookie 并以 Playwright 兼容格式写入存储状态文件。这种模式不需要 Python 也不需要外部工具;Chrome 直接从 Go 二进制文件启动和控制。
无论哪种方式,存储状态文件都存储在 ~/.notebooklm/storage_state.json,模式 0600。
3.3 阶段 2:通过三个定向问题提取知识
一旦源被摄入并且 NotebookLM 完成索引(短视频通常 15-30 秒,完整讲座更长),我们需要提取语料库的 结构化理解,以驱动人格合成。天真的方法是转储整个源内容,让人格合成 LLM 自己弄明白该做什么。这失败有两个原因:(1) 源内容几乎总是大于任何单个 LLM 上下文;(2) 原始文本密集地充满了噪音,遮蔽我们实际需要的结构信息。
相反,我们将 NotebookLM 本身用作合成前的推理器。NotebookLM 已经在其长上下文中拥有完整的语料库。我们问它三个根据经验选出的问题,这些问题以最小冗余提取最大结构信号:
Q1:"列出这些源中所有不同的作者、教师或讲者。对每一个,给出他们的名字、时代,以及关于他们独特视角或教导的 2-3 句话。"
这个问题强制 NotebookLM 枚举其思想在语料库中出现的人物,包括语料库是派生时的归属(例如,现代教师对历史人物的注释)。答案成为每个生成专家的 身份 字段的原材料。
Q2:"这些源涵盖的主要主题和话题是什么?将它们分组为 5-10 个类别。"
这个问题提取领域的 概念表面。答案是一个主题地图,稍后将用于确保专家覆盖——如果一个主题出现在 Q2 但 Q1 中没有专家拥护它,人格合成步骤将把它分配给现有专家或生成新专家来覆盖它。
Q3:"作者/讲者之间的关键分歧或不同视角是什么?他们在哪里同意和不同意?"
这是使结果蜂群 可辩论 的问题。没有它,人格合成会产出一个合唱团的专家,他们都用略微不同的词汇说相似的话。有了它,我们得到一个语料库内部张力的地图——正是生成一个能够进行结构化分歧的蜂群所需的材料。生成的人格将这些分歧作为 core_beliefs 继承,运行时主持人使用它们来编排富有成效的辩论。
三个答案与自动生成的源指南(如果可用)串联,并作为单个知识 blob 传递到阶段 3。我们故意不在此阶段进一步结构化 blob——下游 LLM 比我们更擅长发现结构,而不是从外部强加结构。
3.4 阶段 3:人格合成(创世引擎)
创世引擎(pkg/genesis)实现核心合成逻辑。它是一个有两个入口点的 Go 包:
- ●
AutoAnalyze(ctx, inputDir, domain, client)用于本地知识目录,其中每个作者在input/{domain}/下有一个子目录。 - ●
NotebookLMPipeline.RunFromURLs(ctx, urls)用于通过四阶段管线的 YouTube/Web 输入。
两个入口点产生相同的中间表示:一个 DomainAnalysis 结构体,包含 Domain 名称、可选描述和 Expert 结构体切片。每个 Expert 有显示名(内容语言)、英文名、slug、时代、标语、人格描述、声音风格、核心信念、知识目录引用和作品计数字段。
合成步骤是一个单次 LLM 调用,带有严格指定的 JSON 输出 schema(具体 prompt 见英文版 3.4 节)。
LLM 选择可通过 BrainConfig 配置。我们目前默认使用 kimi-k2.5:cloud 通过 Ollama,原因有三:(1) 其 200K+ token 上下文处理超过 Gemini Flash 32K 限制的提取知识 blob;(2) 它有原生中文流畅性,这对非英语领域的语料库很重要(我们的属灵和中医语料库主要是古典/现代中文);(3) 它对默观或医疗内容不产生安全拒绝,而 Claude 和 GPT-4 经常拒绝为 17 世纪神秘主义者或传统医师生成人格。
响应由一个弹性的 JSON 提取器解析,该提取器处理 markdown 代码块包装、尾随逗号和部分截断。结果 []Expert 与确定性阶段(有权威的文件计数)合并,并作为 DomainAnalysis 返回。
3.5 阶段 4:灵魂生成和启动
最终阶段将 DomainAnalysis 作为一组 LocalKin 灵魂文件写入磁盘——LocalKin 运行时已经理解的配置格式(见 瘦灵魂 + 胖技能,LocalKin Team, 2026)。每个灵魂是一个带 YAML frontmatter 的 Markdown 文件(具体格式见英文版 3.5 节示例)。
模板用 Go 的 text/template 渲染,产出无需修改即可通过现有 pkg/soul schema 验证器的有效灵魂文件。为每个领域生成一个单独的主持人灵魂,列出可用专家并指示主持人编排公平辩论,强调互补而非对抗性冲突。一个 manifest.json 记录完整分析以供后续检查和重新运行。
此时领域已准备好服务。现有的 LocalKin HTTP 服务器、前端竞技场和运行时加载新灵魂而无需重启——动态智能体发现在每个请求上扫描领域目录。
4. 端到端验证
我们针对实时生产 NotebookLM API 对管线进行了端到端验证。验证场景是:
- ●运行
localkin nb auth check确认存储的 cookie 有效。 - ●运行
localkin nb create "Selah Test - Joseph Storehouses"创建新笔记本。预期:返回笔记本 UUID。 - ●运行
localkin nb source add "<youtube_url>"将 YouTube 视频作为源添加。预期:无错误,源在笔记本中可见。 - ●等待 15-30 秒让 NotebookLM 转录和索引。
- ●运行
localkin nb ask "这是关于什么的?"查询笔记本。预期:一个有依据的、非幻觉的答案,引用视频内容。
在对 CreateNotebook 参数布局和 Chat 端点包装进行三次错误修复之后,所有五个步骤在首次实时运行时通过。错误及其修复记录在 commit 24d696d(fix(notebooklm): correct RPC params and chat endpoint for live API)中。第一次成功的端到端测试针对一个极易识别的视频("Never Gonna Give You Up"),NotebookLM 正确识别内容为 Rick Astley 歌词的转录,并返回一个有依据的摘要。
对于完整的创世管线,我们针对 72 文件的中文属灵语料库运行(不是 YouTube,因为属灵语料库已在本地可用),产出 10 个专家灵魂加上一个主持人(具体输出见英文版 4 节)。
生成的灵魂被现有的 LocalKin 运行时加载而无需修改,并通过 Selah 前端成功服务聊天请求。从 localkin -genesis 调用到蜂群就绪的完整管线在本地语料库路径上花费不到 10 秒,预计在完整 YouTube 路径上花费 60-180 秒,取决于视频数量和 NotebookLM 处理时间。
5. 评估
5.1 编写成本降低
标题指标是每个领域消除的开发者时间。
| 步骤 | 手工编写 | 自主创世 |
|---|---|---|
| 领域研究 | 2-4 小时 | 0 |
| 人格起草(每专家) | 1-2 小时 | 0 |
| 审查和迭代 | 2-4 小时 | 0 |
| YAML 配置 | 30 分钟 | 0 |
| 源-人格对齐检查 | 1-2 小时 | 隐式 |
| 总计(5 专家领域) | 10-20 小时 | 1-3 分钟 |
| 总计(10 专家领域) | 20-40 小时 | 2-5 分钟 |
自主时间由 NotebookLM 处理延迟和合成调用的 LLM 推理延迟限制。两者都由挂钟等待而非人类注意力主导。
5.2 每领域成本
| 成本类别 | 手工编写 | 自主创世 |
|---|---|---|
| 开发者时间(@$100/小时) | $1000-$4000 | $0 |
| NotebookLM 使用 | N/A | $0(Google AI Pro) |
| 人格合成 LLM 调用 | N/A | ~$0.01-$0.05 |
| 运行时基础设施 | 与手工编写相同 | 相同 |
| 总计(边际成本) | $1000-$4000 | <$0.10 |
约 40,000 倍的成本降低几乎完全由消除人类劳动驱动。计算成本相对于 Google AI Pro 订阅费用实际上是一个舍入误差,该订阅在无限数量的领域上摊销。
5.3 语料-人格对齐
一个定性指标:生成的人格是否实际上匹配它们应该代表的源材料?
在中文属灵语料库上,我们将生成的盖恩夫人人格与 input/spiritual_zh/guyon/ 中的 19 个盖恩源文件进行比较。生成的人格正确识别了:
- ●她的历史时期(1648-1717)
- ●她的两个核心神学强调:纯爱和弃绝
- ●她的独特词汇(内里之道、内在寂静、十字架作为锉刀)
- ●她的风格声音(称读者为"我亲爱的孩子")
- ●她的核心隐喻(灵性激流、羔羊的新妇)
所有这五个元素都存在于实际的盖恩文本中。人格没有发明盖恩没有持有的信念。然而它确实略微低估了她对抗博舒埃攻击的默观祷告辩护——一个历史争议,出现在她的自传中但不在主导语料库的纯粹灵修作品中。这是语料库本身的正确反映,而非生成的缺陷。
在 YouTube 测试案例中,人格对齐受转录准确度限制。NotebookLM 的转录对清晰音频质量高,但在低质量录音、音乐密集段落和多讲者对话上降级。我们预期 YouTube 单独输入的人格对齐略低于精心策划的文本语料库,但管线其余部分相同。
5.4 跨领域泛化
我们针对五个不同的领域目录运行了创世引擎的 AutoAnalyze 路径以测试泛化:
| 领域 | 源文件数 | 生成专家数 | 一致性 |
|---|---|---|---|
| spiritual_zh | 72 | 10 | 高 |
| spiritual_en | 23 | 10 | 高 |
| tcm_zh | 90 | 12 | 高 |
| (YouTube 烹饪) | 未测试 | 待定 | 待定 |
| (YouTube 法律) | 未测试 | 待定 | 待定 |
属灵和中医语料库在首次运行时产生了一致的、不同的、领域适当的专家名册。我们尚未通过实时 YouTube 摄入全面测试烹饪和法律,尽管管线没有领域特定代码,原则上应该在没有修改的情况下处理这些领域。未来工作包括在 10+ 个不相关的领域上大规模运行此测试,以量化泛化边界。
6. 相关工作
6.1 多智能体框架文献
CrewAI(Moura, 2024)普及了智能体定义的 role-goal-backstory 模式。AutoGen(Wu et al., 2023)贡献了手工定义智能体之间的灵活对话协议。LangGraph(LangChain, 2024)提供智能体节点的状态机组合。MetaGPT(Hong et al., 2023)将"软件公司"预打包为一组固定的手工编写智能体。这些框架都没有从语料库自动生成智能体本身的机制。自主蜂群创世补充而非取代这些框架:生成的灵魂原则上可以作为 CrewAI Agent 对象或 AutoGen ConversableAgent 配置导出,尽管我们尚未实现这样的适配器。
6.2 Graphify 和两阶段分析
Graphify(Shamsi, 2026)在两个阶段处理代码仓库:一个确定性 AST 阶段(无 LLM)和一个语义阶段(LLM 子智能体在文档、论文和图像上运行)。图拓扑通过 Leiden 社区检测而非嵌入相似性出现。创世引擎的确定性加语义结构直接受 Graphify 启发,尽管我们的确定性阶段是文件系统扫描而非 AST 解析,我们的语义输出是人格定义而非知识图谱。我们与 Graphify 共享"让数据定义结构"而非从外部强加的哲学承诺。
6.3 NotebookLM-py
Python 项目 notebooklm-py(Lin, 2025)是第一个公开可用的非官方 Google NotebookLM 内部 API 客户端。它记录了 RPC 方法 ID、batchexecute 线格式和认证流程,为我们的 Go 重写奠定了基础。我们工作的贡献不是协议逆向工程——那是由 notebooklm-py 完成的——而是用纯 Go 重新实现(零 Python 依赖)并将客户端嵌入到更大的多智能体管线中。我们的 Go 客户端通过所有相同的协议测试,并额外处理 GenerateFreeFormStreamed 聊天端点(notebooklm-py 也支持,但我们必须通过实时测试重新发现其参数位置)。
6.4 LocalKin 先前工作
本文建立在七篇早期 LocalKin 论文之上:
- ●瘦灵魂 + 胖技能 定义创世引擎发出的灵魂文件格式。
- ●Grep 即一切所需 建立生成的智能体在运行时使用的零基础设施检索基础。
- ●知识编译 提供创世在其确定性阶段可选读取以获得更丰富样本的编译概念/FAQ 层。
- ●领域专家辩论 描述生成的蜂群参与的主持人路由辩论协议。
- ●自进化蜂群 描述可以基于输出质量持续精化生成的人格的反馈循环。
- ●心跳:零 token 心跳 提供生成灵魂继承的脉冲/计划/空闲系统。
- ●创世协议 描述机器人智能体在启动时锻造自己的电机控制技能的类似自启动过程。
自主蜂群创世是创世协议硬件领域方法的知识领域类似物:两者都从最小输入开始(一个是硬件清单,一个是语料库)并生成操作所需的完整配置。
6.5 Constitutional AI 和人格调优
Anthropic 的 Constitutional AI(Bai et al., 2022)调优一个基础模型以遵循一个手工编写的原则宪法。Character.AI 和类似平台允许用户通过文本提示定义自定义人格。这些方法每次调优或提示产生一个单一人格。自主蜂群创世在一次通过中生成 多个 人格,覆盖率和独特性由提取问题和合成提示中的"每个专家应该有一个不同的视角"约束强制执行。最接近的先例是一些关于智能体社会的学术论文中探索的自动多智能体人格生成,但我们不知道任何从语料库生成人格并将它们部署为现场辩论蜂群的生产系统。
7. 局限性
我们诚实地说明自主蜂群创世尚不能做什么。
1. NotebookLM 是一个黑盒。 我们依赖 Google 维持当前的 RPC 端点和当前慷慨的 AI Pro 限制。任一都可能在未通知的情况下改变。Go 客户端不包含备用的转录或聊天提供者,尽管管线足够模块化,可以作为阶段 1 的直接替换添加备用。
2. 转录质量限制输出。 低质量的 YouTube 音频产生低质量的转录,产生低质量的提取知识,产生微妙不正确的人格。管线没有检测转录错误并标记它们的机制。未来版本应该包含一个置信度阶段,检查 NotebookLM 答案中的对冲语言并标记低置信度专家以供人工审查。
3. 三个提取问题是手工选择的。 我们根据直觉和早期实验选择了 Q1(作者)、Q2(主题)和 Q3(分歧)。更系统的方法是在领域语料库的基准上评估多个问题集,并选择最大化人格质量的组合。我们没有这样做。
4. 无自动知识依据。 生成的人格作为提示字符串编写;它们没有自动配置 knowledge_search 检索路径,除非底层语料库也在本地可用。对于纯 YouTube 输入,生成的专家知道 NotebookLM 告诉他们关于语料库的内容,但不能在查询时直接引用它。未来版本应该导出 NotebookLM 笔记本 ID,并将生成的专家的运行时技能连接到查询 NotebookLM 以获得依据——闭合生成和操作之间的循环。
5. 无跨领域人格去重。 如果同一个盖恩夫人出现在属灵领域和神秘主义领域,两个独立的灵魂会被生成,措辞略有不同。一个带有语义去重的跨领域人格注册表将消除这种冗余,但需要一个关于人格应该是领域本地还是领域全局的设计决策。
6. 合成 LLM 确定性不完美。 同一语料库两次通过合成步骤可能产生略微不同的人格列表(不同数量的专家,略微不同的措辞)。对于生产使用,我们固定随机种子并缓存第一个成功输出,但底层非确定性是可重现性的一个温和障碍。
7. 无辩论质量评估。 我们验证了生成的人格是一致和领域适当的。我们尚未测量结果蜂群在同一语料库上是否比手工编写的蜂群产生更高质量的辩论。这是最重要的后续研究。
8. 结论
多智能体系统在其最初几年假设人类操作员将编写智能体。本文描述了一个工作中的替代方案:给定一个语料库,系统自己编写智能体,语料-人格对齐比人类编写通常达到的更紧密,因为人格字面上是从它们代表的材料编译而来的。
实际结果是向多智能体平台添加新的专家蜂群不再是设计冲刺。它是一行命令:
localkin -source urls.txt -genesis-domain spiritual --auto
在这条命令执行结束时,一个以前不存在的蜂群存在了。它有正确数量的专家,有正确的名字、正确的声音、正确的分歧——所有这些都从源 URL 的实际内容中抽取。总成本是相对于现有基础设施的舍入误差。总时间是粘贴 URL 所需的时间。
更深层的教训是架构性的:在多智能体系统中,人格应该是数据的产物,而非系统的输入。 将它们视为输入强迫每个先前框架进入手工编写的扩展墙。将它们视为编译产物移除了那堵墙。任何可以被摄入的语料库都可以成为蜂群;任何蜂群都可以在语料库改变时重新生成;任何领域都可以通过指向正确内容来添加。
我们预期这种模式将被广泛采用。工具并不复杂——我们的整个实现是大约 1,500 行 Go 代码,跨越 pkg/notebooklm 和 pkg/genesis,在现有的瘦灵魂运行时之上。难的部分是认识到 NotebookLM 的消费级能力可以被工业化,以及人类编写人格是一个假设而非必要。一旦这两个想法都在手中,其余的就是管道工程。
数据生成处理数据的智能体。这就是转变。
The LocalKin Team builds self-evolving AI agent swarms. More at https://localkin.dev