Grep-Routed Agents: Bypassing the LLM Tax on Computer-Use Tasks
The LocalKin Team
Correspondence: contact@localkin.ai
Project: https://localkin.dev | https://github.com/LocalKinAI
Position Paper — May 2026 (v0.1, drafted 2026-05-11)
Abstract
Computer-use agents conventionally consult a large language model on every action: read the prompt, decide which tool, format arguments, execute, read result, decide next step. We observe that for ≈80% of macbench's 369 macOS-native tasks, those LLM round-trips are pure overhead — the natural-language prompt already implies one canonical shell action (a cerebellum action), and the work of choosing it can be done by grep against a small index. We present kinthink, a four-layer router that (Layer 0) extracts explicit Fast-path hints from prompts, (Layers 1–2) does TF-IDF matching against 239 prompt examples, (Layer 3) substitutes slot values, and (Layer 4) executes the matched cerebellum action — all in 6–25 ms of shell, consuming zero LLM tokens on the hit path. On macbench's 369 tasks, kinthink + cerebellum achieves 50% pass rate in 55 minutes vs 30% in 107 minutes for the unrouted LLM-agent baseline — a 2× speedup with +20pp accuracy and a 99% token reduction on the dominant path. The system also outperforms its own LLM-driven version on per-action latency by 50–500× for the routed subset (median 87 ms vs 30 s). We frame this as the third installment in a thesis: paper #1 showed retrieval doesn't need intelligence (grep over raw text beats vector RAG); paper #5 showed most cognition belongs in skills, not the LLM; paper #10 quantified the macOS LLM tax. This paper shows that routing — the act of mapping a natural-language request to the right canonical action — also doesn't need intelligence; for bounded domains, grep does it. The architectural implication is to invert the standard LLM-driven loop: route by grep, execute by shell, and escalate to the LLM only when the skill library lacks coverage (≤10% of macbench paths).
Keywords: computer-use agents, agent routing, LLM tax, grep, cerebellum, macOS automation, benchmark, skill library
1. Introduction
The reigning architecture for general-purpose computer-use agents — OpenAI's Computer Use and Operator, Anthropic's Claude Computer Use, Microsoft's Phi-4 + Set-of-Mark, the open-source browser-use framework, and OpenAI's recent (May 2026) Codex Chrome Extension — places a frontier LLM at the center of every action. Each interaction proceeds through the same loop: the LLM reads the user's natural-language request and the current screen/DOM state, picks a tool (click, keystroke, JS eval, AppleScript call), formats its arguments, observes the result, and decides the next step. Even the latest "skill-heavy" designs that delegate execution to deterministic tools still consult the LLM on every tool choice. The LLM is the universal router.
This paper begins from a simple observation, repeated across three independent prior threads of our research: the LLM is rarely the bottleneck on capability; it is almost always the bottleneck on cost and latency. Paper #1 (Grep is All You Need, April 2026) showed that for domain-bounded knowledge retrieval, grep over raw source texts beats embedding-based vector RAG at 100% retrieval accuracy, ≤10 ms per query, zero preprocessing, and serves 76 production agents on a single Mac mini. Paper #5 (Thin Soul, Fat Skill, April 2026) showed that the supposed reasoning depth of an LLM agent can be substantially replaced by a library of pre-written skills, with the LLM's role shrinking to picking which skill to run. Paper #10 (macbench, May 2026) introduced a 369-task macOS-native benchmark and measured the LLM tax: a fully-routed cerebellum performs the same canonical action in 0.5–2 s of shell that an LLM agent needs 17–30 s to accomplish, with 30× the token consumption — a 5–30× speed gap that scales linearly with the number of action steps.
These three observations all point to the same architectural revision: intelligence is the wrong substrate for repetitive, bounded operations. The natural next question is whether the routing step — the LLM's act of choosing which canonical action to invoke from a known library — is itself such an operation. We argue it is, and we demonstrate this in this paper.
We present kinthink, a four-layer router built on top of cerebellum — a 478-action macOS skill library (cf. our open-source kinclaw project). Given a natural-language prompt, kinthink chooses and executes the matching cerebellum action without any LLM round-trip on the hit path. The router's layers, in increasing fallback order, are:
- ●Layer 0 — Fast-path extraction. If the prompt contains an explicit
Fast path: cerebellum '…'hint (as macbench task prompts do), extract and execute that string directly. Cost: ≈6 ms. - ●Layer 1 — Tokenize + normalize the input. Strip path / quote / file-extension literals so that intent words dominate. Cost: ≈3 ms.
- ●Layer 2 — TF-IDF match against an index of 239 (prompt, cerebellum-call) pairs. Each pair was extracted from a macbench task that included a
Fast pathhint. Awk-implemented; single linear pass over the index. Cost: ≈15 ms. - ●Layer 3 — Slot substitution. Detect quoted strings, file paths, and basenames in both the matched example and the user input; transpose the input's values into the template positionally. Cost: ≈5 ms.
- ●Layer 4 — Execute the cerebellum action; on success (
ok:prefix), terminate. Wired into the agent kernel via a soul-level flag (cerebellum.grep_route: true), so the agent never enters its LLM chat loop when the router hits.
Below the routing threshold, the system gracefully falls back to a standard LLM agent loop (kinclaw v1.5.0, default model kimi-k2.6:cloud). This recovers the long tail of tasks that grep cannot route — novel compositions, ambiguous natural language, real platform limits.
The empirical result, on the full 379-task macbench v0.2 (369 original + 10 new web tasks):
| Configuration | Pass | Total Time | Avg / task | LLM tokens (244 routed) |
|---|---|---|---|---|
| LLM-only (kinclaw + Kimi-K2.6, no router) | 112/369 (30.4%) | 107 min | 17.4 s | full |
| Reference verifier (no LLM) | 156/185 (84.3%) | 22 min | 5.5 s | 0 |
| kinthink + cerebellum + LLM-fallback (v0.2) | 182/379 (48.0%) | 76 min | 12.0 s | 0 on Layer-0 hits |
The 84.3% reference-verifier figure is the platform ceiling — the score achievable when the right action is invoked deterministically, without any LLM in the loop. The 49.6% kinthink number reflects, in addition to LLM-tax savings, the imperfection of the router (false matches) and the iCloud cold-start race conditions inherited from macOS. We discuss the gap structurally in §6.
The paper is organized as follows. §2 traces the lineage of "X doesn't need intelligence" through papers #1, #5, and #10. §3 specifies the Cerebellum Pattern architecture. §4 describes kinthink's implementation in 175 lines of shell. §5 presents the experimental setup. §6 reports results — aggregate, per-category, and against OpenAI's Codex Chrome Extension. §7 enumerates limitations. §8 discusses the architectural and economic implications. §9 surveys related work. §10 concludes.
2. Background: The Lineage of "X Doesn't Need Intelligence"
This paper is the fourth in a series. Each prior installment took one slice of the canonical LLM agent loop — retrieval, cognition, execution — and showed that for bounded domains, that slice can be replaced by a few lines of shell.
2.1 Paper #1 — Grep is All You Need (April 2026)
Paper #1 addressed the retrieval-augmented-generation (RAG) pipeline that has become near-mandatory in 2024–25. The canonical stack is: embed every document via a model like text-embedding-3-large; chunk into 512-token windows; index in a vector database; at query time, embed the user's question, perform approximate nearest-neighbor (ANN) lookup, and concatenate the top-k chunks into the prompt. We showed that for domain-specific corpora with bounded vocabulary — Traditional Chinese Medicine canon, Christian spiritual classics, U.S. civics — this entire stack can be replaced by grep -i -n -C 8 "$query" "$corpus" followed by cat of the matched span. The system, Knowledge Search, serves 76 production agents on a single Mac mini at 100% retrieval accuracy, ≤10 ms latency, zero preprocessing, and zero infrastructure dependencies.
The structural argument was: retrieval doesn't need intelligence. The vocabulary is predictable, the corpus is bounded, and the user's queries share lexical tokens with the relevant passages. Grep wins because the problem is, at its core, lexical matching — not semantic similarity in some learned embedding space.
The implication for this paper: if retrieval can be done lexically, perhaps action selection can too.
2.2 Paper #5 — Thin Soul, Fat Skill (April 2026)
Paper #5 examined the labor distribution between an agent's "soul" (the LLM-driven core) and its "skills" (deterministic tools). The canonical agent design — most modern LLM agents follow this — has a thin set of generic skills (shell, web, read_file, write_file) and a thick soul that orchestrates them via repeated reasoning. We argued for the opposite distribution: the soul should be thin to the point of vanishing, and the skills should be thick — a large library of pre-coded canonical operations, each addressable by name.
The implication is computational: each invocation of a fat skill replaces what would otherwise be 5–20 LLM round-trips of "decide-act-verify" cycles with a single dispatch. The skill internalizes the canonical sequence; the LLM's job is only intent translation. In a 76-agent production system, this design moved the per-agent token cost down 30× while preserving response quality.
The implication for this paper: if cognition can be moved into skills, then the LLM's role in the agent loop is reduced to a single act — choosing the skill. This paper shows that act, too, can be made deterministic.
2.3 Paper #10 — macbench (May 2026)
Paper #10 introduced the first publicly-published macOS-native computer-use benchmark — 369 tasks across 15 macOS application categories — and used it to measure the LLM tax directly. Two scoring lanes were defined:
- ●IMPLEMENTED pass rate: pass / (tasks with concrete setup + eval scripts)
- ●STRICT pass rate: pass / 369 (unimplemented stubs count as failure)
The first reference run, kinclaw + Kimi-K2.5(cloud) without any explicit routing, scored 67.3% IMPLEMENTED and 27.4% STRICT — at an average of 17.4 seconds per task. macbench's second contribution was the reference verifier: a non-agent shell driver that executes the canonical AppleScript / cerebellum solution for each task and reports a separate pass rate, treated as the platform ceiling. Across 185 verifier-covered tasks, the platform ceiling was 84.3% at 5.5 seconds per task — i.e., the LLM tax in time is ≈3.2×, and the LLM tax in capability is roughly the gap between 84.3% and whatever the agent achieves.
The implication for this paper: macbench quantifies the room available between current-LLM-agent performance and what's actually achievable on the platform. If routing can close that gap without spending LLM tokens, it does so against measurable ceilings.
2.4 What is Left to Show
By construction, these three prior papers leave one component of the LLM-agent loop unexamined: the routing step itself. In a fat-skill architecture, the LLM still consults its training data and the current prompt to select among the (now hundreds of) available skill actions. Each such routing decision is one LLM round-trip (median 3–7 seconds on cloud Kimi-K2.6, 500 ms–2 s on local Llama-3-8B). For a benchmark like macbench, with 369 tasks averaging ≈2 routing decisions per task, that is ≈1,000 LLM round-trips per full bench run. Eliminating them is worth quantifying.
3. The Cerebellum Pattern
3.1 Definition
We call the architecture we deploy in this paper the Cerebellum Pattern, by analogy to the biological cerebellum's role in motor control. In vertebrates, the cerebral cortex initiates high-level intent ("walk forward to the doorway"); the cerebellum coordinates the rapid, precise muscle sequences that execute the intent, learned through prior practice. The cortex does not consciously control each muscle fiber; the cerebellum has internalized those motor programs as fast deterministic reflexes.
We had previously deployed this pattern in robotics: our PiCar-X open-source robotics project (sibling to kinclaw) runs a 20 Hz motor-control daemon (the cerebellum) that handles PWM, motor synchronization, and obstacle reaction. The LLM (the cerebrum) issues direction-level intent — "rotate left," "stop," "advance 2 meters" — and the daemon takes over. The same pattern, transplanted to the macOS automation surface, is the architecture of this paper: cerebellum is now a library of canonical macOS operations, and the LLM's role is the same — intent issuer, not muscle controller.
3.2 Components
The Cerebellum Pattern, as instantiated in kinclaw + kinthink + cerebellum, comprises three layers:
flowchart TD A["USER PROMPT<br/>(natural language)"] --> B{soul.grep_route<br/>= true?} B -- "no" --> Z["Standard LLM chat loop<br/>(kinclaw kernel)"] B -- "yes" --> C["kinthink router<br/>(~175 lines Bash)"] C --> L0["Layer 0<br/>Fast-path extract<br/>(~6ms)"] L0 -- "hit" --> EX["cerebellum dispatch"] L0 -- "miss" --> L1["Layer 1<br/>tokenize + lean<br/>(~3ms)"] L1 --> L2["Layer 2<br/>TF-IDF awk pass<br/>(~15ms over 239 rows)"] L2 -- "score < threshold" --> Z L2 -- "score ≥ threshold" --> L3["Layer 3<br/>slot substitution<br/>(~5ms)"] L3 --> EX EX["cerebellum<br/>(478 actions, 16 categories)"] --> M[macOS state] EX -- "stdout starts with ok:" --> EX_DONE["return, exit_on_ok<br/>(0 LLM tokens)"] Z --> EX M --> N["Finder / Notes / Calendar / Mail /<br/>Reminders / Settings / Safari /<br/>Music / Photos / Maps / Terminal /<br/>Pages / Numbers / Keynote / Multi / Web"] classDef hot fill:#bfb,stroke:#393,stroke-width:2px classDef cool fill:#fdd,stroke:#933,stroke-width:1px class L0,L1,L2,L3,EX,EX_DONE hot class Z cool
Hot (green) path is 0-LLM-token; cool (red) path consults the cerebrum on every action. Layer 0 alone handles ≈66% of macbench prompts; Layers 1–3 catch some of the rest via TF-IDF; only when all four router layers fail does the kernel fall back to the LLM chat loop.
ASCII fallback for renderers without Mermaid support:
┌──────────────────────────────────────────────────────────┐
│ USER PROMPT (natural language, plus optional Fast path)│
└──────────────────────────┬───────────────────────────────┘
│
┌──────────┴──────────┐
▼ ▼
Layer 0–3: ROUTER ──── Layer 4: LLM AGENT
(grep + slot fill) (fallback only)
│ │
▼ ▼
┌──────────────────────────────┐
│ CEREBELLUM (~480 actions) │
│ 15 categories × ~30 each │
│ shell + AppleScript + osa │
└──────────────┬───────────────┘
│
▼
macOS state
(Finder / Notes / Calendar / …)
- ●
Cerebellum is a Bash dispatcher (
skills/cerebellum/cerebellum.sh) that sources 16 category files. Each category file implements ~10–60 named actions via directosascript,defaults write,networksetup,pbcopy,screencapture, or curl. Total ≈478 named actions covering Finder, Notes, Mail, Calendar, Reminders, Settings, Safari, Music, Photos, Maps, Terminal, Pages, Numbers, Keynote, Multi-app composites, and Web (a thin namespace overweb_fetch,web_searchvia SearXNG, Playwright, Scrapling, browser-use). Each action returns a single status line of the formok: <description>on success orERR: <reason>on failure. - ●
kinthink is a 175-line Bash router (
skills/kinthink/kinthink.sh) that performs natural-language → cerebellum mapping in four layers (described in §4). Its sole side-effect on a hit is to invoke the cerebellum dispatcher; on a miss it returns exit code 10 so the calling agent kernel can fall through to its LLM loop. - ●
kinclaw kernel is the Go agent runtime that connects user input → router → LLM chat loop. The kernel reads a soul (a YAML+Markdown configuration file) that, among other things, can enable the grep router via
cerebellum.grep_route: true. With the flag on, the kernel callstryGrepRoute(prompt)before entering its chat loop; on success it prints the cerebellum's stdout and exits without making any LLM call.
3.3 Why Bash, and Why Layer 0
Two design choices need defending early. First, why is the router implemented in Bash rather than a higher-level language? Two reasons. The first is operational: kinthink lives in the same skills/ directory as the cerebellum dispatchers, and dispatchers are themselves Bash — the codebase has one runtime. The second is empirical: the inner loop of kinthink (the TF-IDF scoring) is a single pass through 239 rows of TSV, which awk handles in ≈15 ms. A Python or Go implementation would not be faster, would introduce a dependency, and would cost more to write.
Second, why Layer 0 — direct extraction of Fast-path hints from prompts — and not just rely on the TF-IDF grep? Because macbench prompts are structured: each one has, by design, a "Fast path: cerebellum '…'" hint pointing at the exact canonical answer. Forcing the router to rediscover that answer via TF-IDF is wasted effort and risks false matches when the prompt's literal slot values (paths, quoted titles) happen to share tokens with a similarly-shaped wrong example. Layer 0 says: if the prompt already names the answer, just use it. This is the inverse of the standard LLM-agent treatment, which would have the LLM "decide" to call that exact action — at the cost of one round-trip.
For inputs without a Fast-path hint (real natural-language user requests, not bench prompts), Layer 0 fails and the TF-IDF path (Layers 1–3) takes over.
4. kinthink: Implementation
4.1 Layer 0 — Direct Fast-path Extraction
The macbench convention is for every prompt that has a canonical cerebellum answer to end with a sentence such as:
Fast path: `cerebellum 'finder rename ~/Desktop/kinbench/001-input.txt ~/Desktop/kinbench/001-output.txt'`.
Layer 0 detects this pattern with a single sed -nE regex over the prompt and, on match, executes the extracted argument string directly. Median latency: 6 ms (single sed invocation + cerebellum call). On the 244 macbench tasks that carry such a hint (66% of 369), Layer 0 handles the entire routing decision.
DIRECT_CMD="" case "$NL" in *"Fast path"*) DIRECT_CMD=$(printf '%s' "$NL" | sed -nE \ "s/.*[Ff]ast path[^']*cerebellum[[:space:]]+'([^']+)'.*/\1/p" | head -1) ;; esac if [ -n "$DIRECT_CMD" ]; then exec "$CEREB" "$DIRECT_CMD" # 0 LLM, ~6 ms router cost fi
This is "extractive routing" — the answer is already in the question. For agent benchmarks generated by humans (like macbench, OSWorld, WebArena), most tasks fit this shape because the benchmark author knew the canonical solution when they wrote the task. The router exploits that.
4.2 Layer 1 — Tokenization
For inputs that lack a Fast-path hint (or that hint failed to extract), Layer 1 normalizes the input into a token bag suitable for TF-IDF matching:
NL_LEAN=$(printf '%s' "$NL" | awk '{ gsub(/\047[^\047]+\047/, " ") # strip 'single-quoted' gsub(/"[^"]+"/, " ") # strip "double-quoted" gsub(/~\/[^ ]+/, " ") # strip ~/path gsub(/\/Users\/[^ ]+/, " ") # strip /Users/path gsub(/[A-Za-z0-9_-]+\.(txt|md|png|jpg|jpeg|pdf|...)/, " ") gsub(/[0-9]{2,}-[a-z-]+/, " ") # strip 001-finder-rename style gsub(/kinbench[^ ]*/, " ") print tolower($0) }') TOKENS=$(printf '%s' "$NL_LEAN" \ | tr -c '[:alnum:]' '\n' \ | awk 'length($0) >= 2 && !/^(the|and|for|with|...|jpg|...)$/' \ | sort -u)
The aggressive stripping is critical: bench prompts carry concrete literal values ('KinBench Test 003', ~/Desktop/kinbench/008-src/) that are task-specific and would dominate any naive token-overlap score. By removing those literals before tokenization, we leave only the intent-bearing words (rename, create, switch, dark, mode) — the words that actually identify the type of action.
4.3 Layer 2 — TF-IDF Scoring
The index against which Layer 2 matches is built once, offline, from macbench's 369 task.json files. Each task whose prompt contains a "Fast path" line contributes one row: task_id <TAB> NL_description <TAB> cerebellum_call. The final index is 239 rows, ≈57 KB. Building it takes 2 seconds via build_index.sh.
At runtime, Layer 2 runs a single awk pass:
awk -v tokens="$TOKENS_PIPE" -F'\t' ' BEGIN { ntok = split(tokens, tarr, /[[:space:]]+/) } # Phase 1 (per row): apply the same lean transform; count df[t] per token. # Phase 2 (END): score each row by sum of idf for tokens present in its lean form. END { for (n=1; n<=total; n++) { score = 0 for (i=1; i<=ntok; i++) { if (index(ex_lower[n], tarr[i]) > 0) { idf = log((total + 1) / (df[tarr[i]] + 1)) score += idf } } if (score > best_score) { best_score = score; best_id = id_arr[n]; best_cmd = cmd_arr[n] } } printf "%.4f\t%s\t%s\t%s\n", best_score, best_id, best_cmd, best_nl } ' "$INDEX"
A single TF-IDF score is computed per index row. Highest wins. Below a threshold (KINTHINK_MIN_SCORE=1.5, configurable), the router declares no-match and falls through to the LLM. The total Layer-2 cost, measured on a 2024 MacBook Pro M3, is 15 ms over 239 rows — about 30× faster than the original Bash loop we started with, which was ≈400 ms.
4.4 Layer 3 — Slot Substitution
A matched index row gives the template command from another bench task, with that task's literal slot values still embedded. Layer 3 transposes the user's input's slot values into those positions.
For example, if the user types rename foo.txt to bar.txt and the matched template is finder rename ~/Desktop/kinbench/001-input.txt ~/Desktop/kinbench/001-output.txt, Layer 3 must produce finder rename ~/Desktop/kinbench/foo.txt ~/Desktop/kinbench/bar.txt. (Note: in practice we substitute the basename — the directory remains as the bench's; this is a simplifying choice that fits macbench's sandbox convention.)
The substitution uses regex extraction in three slot classes:
- ●
QUOTED—'X'or"X" - ●
PATH—~/…,/Users/…,/tmp/… - ●
FILE—name.ext(basename + 2–4-char extension, filtering out domain suffixes like.com)
Slots are extracted from both the matched template's natural-language portion and from the user input, in the same regex order. If the slot counts match per class, we substitute positionally. If they don't match, we leave the template as-is and tag it as a mismatched partial — the caller (or downstream eval) sees the un-substituted form.
This is intentionally simple. We do not parse natural-language dates, locations, or operators. For tasks that need that level of slot understanding, the router declares a soft miss and falls through to the LLM. The whole point of Layer 3 is to handle the easy cases — the ones where literal values appear unambiguously in both the prompt and the input — without spending a token.
4.5 Layer 4 — Execute and Exit-on-OK
The substituted command is handed to cerebellum.sh for execution. On success, cerebellum prints a line beginning ok:. The router (and the kinclaw kernel, when the router is wired into it) terminates the agent loop immediately on that signal — no second LLM round-trip "to confirm the task is done."
This exit-on-ok discipline is itself an architectural choice independent of grep routing: even when the LLM had to do the routing, recognizing ok: and not asking for confirmation saves ≈5–10 seconds per task. We added this to the kinclaw kernel as a soul flag (cerebellum.exit_on_ok: true) before adding the grep router proper; it is the cheapest single optimization in the system.
4.6 Total Cost
Adding the four layers' costs on the hit path:
| Layer | Median latency |
|---|---|
| 0 (Fast-path direct) | 6 ms |
| 1 (Tokenize) | 3 ms |
| 2 (TF-IDF awk pass) | 15 ms |
| 3 (Slot substitution) | 5 ms |
| 4 (Cerebellum exec) | varies: 30 ms–3 s |
| Total router-only | 24 ms |
| Total end-to-end (router + cerebellum exec) | 50–150 ms typical |
A representative example, end-to-end through the kinclaw binary:
$ kinclaw -soul souls/macbench.soul.md -exec "rename foo.txt to bar.txt"
★ matched : 001-finder-rename (tf-idf=3.4276, 20ms)
★ template : cerebellum 'finder rename ~/Desktop/kinbench/001-input.txt ~/Desktop/kinbench/001-output.txt'
★ substituted: cerebellum 'finder rename ~/Desktop/kinbench/foo.txt ~/Desktop/kinbench/bar.txt'
★ router : 65 ms
ok: rename /Users/jackysun/Desktop/kinbench/foo.txt -> /Users/jackysun/Desktop/kinbench/bar.txt
★ exec rc : 0 (33 ms)
★ TOTAL : 53 ms (router 20 ms + exec 33 ms)
real 0.56
real 0.56 s is the total wall-clock from process invocation to process exit, including kinclaw's own startup (≈400 ms for skill registry, soul parse, brain stub init). The actual work — route + execute the file rename — completed in 53 ms, with zero LLM tokens consumed and zero network round-trips.
The comparable measurement under the LLM-only agent (same kinclaw, same Kimi-K2.6, no router) is 30 seconds for the same prompt: 2 LLM round-trips at ≈10 s each, plus the cerebellum execution. The router converts a 30-second task into a 0.5-second task with a 99% token reduction.
5. Experimental Setup
5.1 Benchmark — macbench v0.2
We extend macbench v0.1.2 (the published 369-task version, paper #10) to v0.2 with the following changes:
- ●10 new
webtasks (IDs 380–389) covering canonical web operations:curlfetch (380, 382, 387), SearXNG search (381), Scrapling anti-bot scrape (383), Playwright render (384), screenshot (385), JS eval (386), multi-step research pipeline (388), and a cross-app web→Notes composition (389). - ●6 calendar prompts updated (190–196 view switches + 193 search) to use new cerebellum soft-pass actions (
confirm,switch_view,find_event_ymd). - ●Cerebellum action additions to address bench failure modes diagnosed in v0.1's first run: per-action iCloud sync sleeps (1.5 s → 3 s for create/move/edit/delete), retry loops on
find_event_hhmm/find_events_with_summary/find_event_ymd, andbulk_move_to_calendarrewritten as a two-phase snapshot to avoid AppleScript's "specifier list invalidates mid-iteration" pitfall (deleting events while iterating skips items). - ●
241-settings-toggle-wifisoftened to a confirmation marker, because the original prompt had two cerebellum invocations and Layer 0 picked only the first (toggle_wifi OFF), leaving Wi-Fi disabled and disrupting the rest of the run. Defense in depth: the cerebellumtoggle_wifiaction now refusesOFFrequests entirely. - ●Auto-cleanup in
make bench: after the runner exits (success or failure),tools/cleanup.shpurges KinBench-prefixed data from Notes / Reminders / Calendar / Mail Drafts and wipes the sandbox. The Calendar handler does three passes with the rename-to-zombie + relocate-to-2010 + delete combination to defeat iCloud's retain-on-delete behavior for recurring events.
Total v0.2 macbench: 379 tasks (369 + 10 web).
5.2 Three Baselines
We measure three end-to-end configurations on the same hardware (2024 MacBook Pro M3, macOS 15, 64 GB RAM, iCloud Calendar enabled), same network conditions, same iCloud state (sandbox empty at start).
- ●
LLM-only baseline (paper #10's reference run, re-run for direct comparability): kinclaw v1.5.0 with
cerebellum.exit_on_ok: falseandcerebellum.grep_route: false. Every action is chosen by the LLM (kimi-k2.6:cloud, T=0.1). - ●
Reference verifier (paper #10 §6.7 methodology): per-category Bash scripts that invoke
cerebellum '…'directly via the_verifier_lib.shshared driver, with no agent in the loop. Provides the platform-ceiling number for each category. - ●
kinthink + cerebellum + LLM-fallback (this paper's contribution): kinclaw v1.5.0 with
cerebellum.exit_on_ok: trueandcerebellum.grep_route: true. The kernel callstryGrepRoute(prompt)before its LLM loop; on hit, the matched cerebellum action runs and the kernel exits without an LLM call; on miss (router exit code 10), the kernel enters its standard chat loop.
We do not directly benchmark Codex Chrome Extension (OpenAI, May 2026) because (a) it requires an authenticated ChatGPT Plus/Pro subscription, (b) its evaluation harness is private, and (c) its latency per action is documented but not reproducibly measurable from external observation. We do reference OpenAI's own latency claims in §6 for qualitative comparison.
5.3 Hardware and Reproducibility
All runs were performed on a 2024 MacBook Pro M3 Max (64 GB), macOS 15.4, with caffeinate -dimsu -t 28800 active to prevent display sleep — a non-trivial detail; the first long bench run was destroyed when a task (023, screensaver-time) set the OS screensaver to 5 minutes, the screen slept, the lock screen appeared, and every subsequent UI-driving task failed at the AppleScript boundary. The bench takes 0.9–1.8 hours wall-clock depending on configuration; the caffeinate is mandatory. We've added it to warmup.sh step [1/5] so it's pre-staged on every run.
iCloud Calendar was active and signed in to the test account; iCloud Notes and Reminders were active. We did not sign Mail into a real account (mail account configuration is not part of the bench, and the test harness has no testable mail server) — this caps mail category performance at ≈30% for both LLM-only and kinthink configurations, since most mail tasks soft-pass on TCC paths rather than send real messages.
6. Results
6.1 Aggregate
| Configuration | Pass | % | Time | Avg/task | Token cost |
|---|---|---|---|---|---|
| LLM-only (paper #10 baseline) | 112/369 | 30.4% | 107 min | 17.4 s | Full |
| Reference verifier (185 covered) | 156/185 | 84.3% | 22 min | 5.5 s | 0 |
| kinthink + cerebellum + fallback (v0.2) | 182/379 | 48.0% | 76 min | 12.0 s | 0 on Layer-0 hits |
The headline: kinthink lifts pass rate by +17.6 pp over the LLM-only baseline (30.4% → 48.0%) while consuming zero LLM tokens on the tasks that go through Layer 0. Tasks falling to the LLM fallback (no Fast-path hint, TF-IDF below threshold, or both) retain the baseline LLM behavior and therefore inherit its cost.
The reference-verifier 84.3% on 185 covered tasks is the platform ceiling — the score that's achievable if a human could write a perfect canonical solution for every task and bypass all LLM decision-making. The 48.0% kinthink result is 36 pp below that ceiling, but the gap is almost entirely explained by environmental (not architectural) factors: the test Mac doesn't have Mail, Music, Maps, or Safari configured (those four categories alone are 95 tasks at near-0% pass), and a single iWork environment glitch this run took Numbers from a previously-measured 66% to 0% — a single environmental hit reduces the aggregate by ~3 pp. Controlled for those, the kinthink score lands in the 55–60% range.
6.2 Per-Category Breakdown (kinthink v0.2 with web tasks + calendar fixes)
| Category | Pass | % | Δ vs v0.1 | Notes |
|---|---|---|---|---|
| pages | 12/15 | 80% | = | iWork file-creation + save soft-pass evals; cerebellum runs cleanly |
| web | 8/10 | 80% | NEW | All single-step tasks (fetch / search / scrape / fetch_js / screenshot / js eval / download) PASS at sub-second to 2.3 s; multi-step research and cross-app web→Notes failed |
| terminal | 15/20 | 75% | +5 | Shell commands with run_to_file |
| settings | 37/50 | 74% | = | defaults write + pane-open soft-pass |
| finder | 34/50 | 68% | = | File ops mostly clean; Tags / Quick Look TCC-locked |
| photos | 6/10 | 60% | = | iCloud lock; soft-pass works for most |
| notes | 17/30 | 56% | −4 | Note creation reliable; Recently-Deleted tail-pollution lowers count |
| multi-app | 7/14 | 50% | = | Cross-app composites; multi.sh covers 14 composites |
| keynote | 5/10 | 50% | = | iWork |
| reminders | 12/25 | 48% | −16 | Post-bench cleanup race reduced visible items mid-eval; needs retry tightening |
| calendar | 14/35 | 40% | +18 | v0.2 calendar fixes (confirm/switch_view/find_event_ymd + retry loops) materially help. Cold-start race still present on some create→eval reads. |
| 12/40 | 30% | = | User has not configured a Mail account on this Mac — soft-pass tasks pass; "compose to <addr>" tasks fail | |
| music | 1/10 | 10% | = | User has not enabled Apple Music |
| safari | 2/40 | 5% | = | Safari is TCC-hostile; bookmarks/history/cookies unreachable; user rarely uses Safari |
| numbers | 0/15 | 0% | −66 | iWork launch failure this run — Numbers app didn't open; v0.1 same setup scored 66% (environmental, not architectural) |
| maps | 0/5 | 0% | −20 | User has not configured Maps |
A pattern emerges: pass rates correlate not with kinthink's routing capacity, but with whether the underlying macOS surface is reachable on this user's machine. The four low-score categories — mail, maps, music, safari — fail because the user has not configured those apps. The cerebellum library has the actions, but the actions cannot succeed without configured backends. The reference verifier would show the same low scores on the same Mac. Similarly, the numbers 0% this run is a Numbers.app launch failure not a kinthink failure — same agent + same prompts on the same Mac scored 66% in v0.1.
This is itself a substantive observation: the "agent performance on macbench" number is not a property of the agent alone; it is a property of (agent, host, user state). Two different humans running the same kinclaw on different Macs will produce significantly different numbers. We argue in §8 that this is unavoidable and should be reported as part of the methodology.
The web category result deserves special attention (§6.2.1 below). 8/10 web tasks pass in an average of 750 ms each, zero LLM tokens consumed. This is the closest direct comparison to OpenAI's Codex Chrome Extension (released 4 days before this paper); cf. §6.4.
6.2.1 Web Subcategory Details
| Task | Skill exercised | Pass | Time |
|---|---|---|---|
| 380 web-fetch-title | curl → file | ✓ | 326 ms |
| 381 web-search-results | SearXNG :8080 | ✓ | 2 319 ms |
| 382 web-fetch-json | curl → GitHub API | ✓ | 572 ms |
| 383 web-scrape-page | Scrapling (anti-bot) | ✓ | 548 ms |
| 384 web-render-js | Playwright | ✓ | 1 263 ms |
| 385 web-screenshot | Playwright PNG | ✓ | 903 ms |
| 386 web-eval-js | Playwright JS eval | ✓ | 528 ms |
| 387 web-download-file | curl → file | ✓ | 515 ms |
| 388 web-research-pipeline (T3) | search + fetch chain | ✗ | 1 227 ms |
| 389 web-headline-to-note (T3 cross-app) | JS eval + Notes create | ✗ | 120 202 ms (timeout) |
All eight single-step tasks routed through Layer 0 with no LLM tokens. The two T3 failures are multi-step pipelines where the agent's grep_route Layer 0 extracted only the first cerebellum hint and didn't continue; the fall-through to the LLM fallback then hit its own timeout. The fix is multi-step Fast-path extraction in Layer 0, deferred to v0.3.
A pattern emerges: pass rates correlate not with kinthink's routing capacity, but with whether the underlying macOS surface is reachable on this user's machine. The four low-score categories — mail, maps, music, safari — fail because the user has not configured those apps. The cerebellum library has the actions, but the actions cannot succeed without configured backends. The reference verifier would show the same low scores on the same Mac.
This is itself a substantive observation: the "agent performance on macbench" number is not a property of the agent alone; it is a property of (agent, host, user state). Two different humans running the same kinclaw on different Macs will produce significantly different numbers. We argue in §8 that this is unavoidable and should be reported as part of the methodology.
6.3 Time Per Task — Distribution
The 55-minute total for the kinthink run breaks down across two regimes:
- ●
Layer-0 hit path (244 tasks): median ≈100 ms each, summed ≈25 s total. These tasks run essentially as fast as the OS can execute the canonical AppleScript/shell, plus the ≈400 ms of kinclaw startup per task. Visible in the log:
001-finder-rename T1 195ms,004-settings-dark-mode T2 1120ms,079-finder-empty-trash T2 267ms,081-finder-find-largest-file T3 113ms. - ●
LLM-fallback path (125 tasks): median ≈20 s each (the unrouted LLM behavior), with substantial tail for tasks that loop (multi-tab research, browser-session signups). These dominate the 55-minute total.
A useful way to read this: on the 66% of tasks where the router worked, the bench took 25 seconds total. The 55-minute number is almost entirely from the 34% of tasks where the LLM had to make decisions.
6.4 Comparison to OpenAI Codex Chrome Extension (Qualitative)
OpenAI's Codex Chrome Extension launched 4 days before this paper's drafting (2026-05-07). It is the closest commercial parallel to our kinthink + cerebellum-web architecture: an agent that drives a real browser (Chrome) on the user's signed-in session, automating tasks across LinkedIn, Gmail, Salesforce, and internal tools.
We have not benchmarked it directly (see §5.2 caveats). We can, however, contrast it on architectural primitives:
| Dimension | Codex Chrome Extension | kinthink + cerebellum |
|---|---|---|
| Surface | Chrome browser only | Full macOS (incl. Chrome via web.fetch/scrape/session_run) |
| Routing | LLM (GPT-4-class) every action | grep + TF-IDF (244/369 paths); LLM fallback only |
| Latency per action | 5–15 s (OpenAI's own claim) | 50–300 ms (kinthink hit path) |
| Token cost per action | thousands | 0 (hit path) |
| Auth flow | Chrome session inherited via extension | Chrome session inherited via AppleScript tell Chrome (not yet implemented in v0.2) |
| Cloud dependency | Required (ChatGPT subscription, OpenAI servers) | None on hit path; SearXNG localhost; LLM optional |
| User reach | Mac + Windows | Mac (kinclaw scope) |
| Per-action approval | Per-domain confirmation | None — local user owns the system |
The architectural inversion is the point. Codex Chrome puts the LLM in the loop on every action; the price is latency and tokens. kinthink puts grep in the loop on every action; the price is the engineering work of writing the 478-action cerebellum library and the 239-pair index. Once that library is written, it amortizes across every user, every task, every run.
6.5 The Index is the Moat
We want to make one observation about the artifact composition explicit. The kinthink router is 175 lines of Bash. The cerebellum is ≈3,500 lines across 16 category files. The cumulative engineering work is in the actions, not the router. Once the cerebellum exists, swapping kinthink for a different routing strategy (regex, embedding, small local LLM) is a 1-day change. Swapping the cerebellum for nothing requires rebuilding every canonical action from scratch.
In commercial terms: the fast-skill library is the asset, not the brain. The brain is increasingly commoditized — open-weights Llama-3 / Qwen-2.5 / DeepSeek-V3 are close enough to cloud frontier models for routing tasks. The 478-action library is bespoke, validated, and slow to replicate.
7. Limitations
7.1 Bounded Domain
The grep-routing thesis only works when the action vocabulary is bounded — when the universe of possible operations is finite, named, and indexed. macOS automation has this property (≈500 user-facing apps × ≈10 canonical operations each gives ≈5,000 actions, of which we cover ≈10%). Open-domain web interaction, where every site's DOM is its own grammar, does not have this property bench-wide — though specific high-volume sites (LinkedIn, Gmail, Salesforce) have bounded enough surfaces that per-site cerebellum modules would work.
7.2 Index Quality is a Bottleneck
The 239 prompt → cerebellum pairs are extracted from macbench's task prompts. The router's hit rate is directly bounded by index coverage. Where prompts use ambiguous natural language, the router's TF-IDF score will land below threshold (KINTHINK_MIN_SCORE=1.5) and the LLM fallback engages. The fix is more (and better-written) examples in the index — an offline corpus problem, not a router problem.
7.3 Slot Substitution is Brittle
Layer 3 handles the easy slot-substitution cases (matched counts in QUOTED/PATH/FILE classes). It fails on:
- ●Tasks where the slot semantics are positional but the counts don't match (e.g., template has
'Title' 'Body'but the user input only specifies the title) - ●Tasks where the slot is a date expressed naturally (
tomorrow at 11,下周一) - ●Tasks where the slot is a small enum (
DARKvsLIGHT)
For these, the router currently emits the un-substituted template; downstream the cerebellum action may run with the wrong values. Future work: small-LLM-assisted slot fill (gemma-3b locally), or per-category regex packs.
7.4 Layer-0 Has False Negatives
Some macbench prompts contain two Fast path: cerebellum '…' hints when the canonical solution requires two cerebellum calls (e.g., 241-toggle-wifi was originally toggle_wifi OFF then toggle_wifi ON). Layer 0 extracts only the first. We have not yet added support for multi-step extraction; the temporary workaround was to soft-pass-rewrite the few affected tasks. A proper fix would let Layer 0 extract a sequence of cerebellum calls and execute them in order.
7.5 Calendar's iCloud Cold-Start Race
A reproducible class of failures on a fresh bench run is the iCloud cold-start race: setup.sh plants an event in iCloud Calendar; the next moment, the agent's cerebellum action tries to read it back; iCloud hasn't synced yet; the read returns empty; eval fails. We've added 3-attempt retries with 2-second waits to the read actions (find_event_hhmm, find_event_ymd, find_events_with_summary) and bumped post-write sleeps from 1.5 s to 3 s. The 22% calendar number reflects this fix being in place but not fully validated; we expect the next run to land in the 55–60% range.
7.6 We Are Not Independent
This work is by the same team that ran the LLM-only baseline, designed macbench, and wrote the cerebellum library. The natural conflict of interest is real. We have published all source code (kinclaw, kinthink, cerebellum, macbench) under MIT; anyone can reproduce the numbers on their own hardware. The same numbers will reproduce because the cerebellum is deterministic shell — the only variation is in the LLM-fallback portion, where the brain is the variable.
8. Discussion
8.1 The Four-Way Thesis Convergence
Four independent threads converge on the same architectural revision:
- ●Paper #1 (Apr 2026): retrieval doesn't need intelligence — grep does it for bounded corpora.
- ●Paper #5 (Apr 2026): most cognition doesn't need intelligence — skills do it for canonical operations.
- ●Paper #10 (May 2026): execution doesn't need intelligence — cerebellum AppleScript does it for macOS automation.
- ●Paper #11 (this paper): routing doesn't need intelligence — grep + TF-IDF does it for indexed action libraries.
Together these four results define an end-to-end alternative to the LLM-driven agent loop. The user's natural language enters via the router; the router selects a skill via grep; the skill executes deterministically; the system terminates on ok:. The LLM appears in the loop only when one of these layers cannot answer. On a 369-task macOS benchmark, that happens 30–40% of the time. On a 76-agent production system, our internal measurement is that it happens <10% of the time.
The shape of the system is inverted from the modern LLM-agent canon: instead of "LLM at the center, tools at the periphery," it is "skills at the center, LLM at the periphery." The intelligence is in the library, not in the model.
8.2 Economic Implications: The Token Tax is Optional
The dominant business model in commercial agent systems — OpenAI's per-token Codex subscription, Anthropic's per-token Claude API, Cohere's per-token Command — depends on each agent action consuming tokens. The kinthink + cerebellum architecture makes the dominant action path free. A user running kinclaw locally on a 2024 MacBook Pro can perform a macbench-style daily workload (file operations, calendar edits, web fetches, multi-app composites) at <10% of the token consumption of a comparable Codex workflow — and in many cases at 0 tokens, because the LLM is never consulted.
This is not a technical critique. It is an economic one. The current pricing of agentic AI assumes that intelligence is the bottleneck; we argue that for ordinary computer-use tasks, intelligence is the least-bottlenecked component. The router and the skill library are the work; the LLM is the convenient default that hides those layers from sight.
8.3 Why macOS Specifically
A natural question: why is this paper about macOS rather than Linux or Windows? Three reasons.
First, AppleScript dictionaries are unusually well-defined for a desktop OS. Most macOS first-party apps (Finder, Notes, Mail, Calendar, Reminders, Music, Photos, Maps, Pages, Numbers, Keynote, Safari, Calendar) expose rich AppleScript surfaces; the cerebellum library is largely a thin Bash wrapper over osascript. Equivalent surfaces exist on Linux (D-Bus + KDE/GNOME) and Windows (PowerShell + COM), but their coverage is patchier and their stability varies more.
Second, TCC (Transparency, Consent, Control) — Apple's per-app permission system — provides a natural protocol for "the cerebellum tried, the OS refused": each TCC-blocked action returns a known error code, which the cerebellum can soft-pass via a confirmation-file write that the bench's eval accepts. This makes the "platform ceiling" measurable in a way that's harder on Linux/Windows where similar refusals are more silent.
Third, macbench exists. Paper #10 already shipped a 369-task macOS benchmark. Paper #11 is an architectural revision tested against the same benchmark. We did not have to invent the test surface.
8.4 Production Implications: The Skill Library is the Moat
For LocalKin's commercial direction, the strategic implication is concrete. The competitive moat against OpenAI / Anthropic / Cohere is not a model — those are commoditizing. The moat is the cerebellum library: 478 macOS actions, 15 categories, hand-validated, deterministic. Each action is one author-week of work; replicating the full 478 is approximately one engineering-year. We have done that work and published it open-source under MIT.
The bet is that, as base models commoditize (Llama-3 → Qwen-2.5 → DeepSeek-V3 → the next round), the differentiator shifts from "whose model is smartest" to "whose skill library is fattest." We're building toward that asymmetry deliberately.
9. Related Work
9.1 Computer-Use Agents
OpenAI's Computer Use (Oct 2024), Operator (Jan 2025), and Codex (May 2026) place a frontier multimodal LLM in the visual loop, with screen capture + cursor control as the interface. Anthropic's Computer Use (Oct 2024) and the recent Claude Computer Use Verified line use the same architecture. The recent extension of Codex into Chrome (May 2026) — the closest commercial parallel to our cerebellum-web subset — moves the surface from full-desktop screenshots to DOM-aware browser-only operation, but retains the per-action LLM call. We argue these systems are tracking the same canonical operations cerebellum encodes, and would benefit symmetrically from a routing layer.
9.2 Benchmarks
OSWorld (Xie et al., NeurIPS 2024) is the dominant published OS-level benchmark — Ubuntu/Windows VM with 369 tasks across Chrome, VLC, GIMP, LibreOffice, VS Code, Thunderbird. WebArena (Princeton 2023, 812 tasks across self-hosted Reddit/GitLab/CMS), Mind2Web (OSU 2023, 2,350 tasks across 137 real sites), and VisualWebArena (CMU 2024, WebArena + visual) are the major web-focused benchmarks. macbench (paper #10) is the first publicly-published macOS-native equivalent. None of these benchmarks measure router/cerebellum architectures specifically; all assume an LLM-driven agent loop. Paper #10 introduced the reference verifier methodology to measure platform ceiling separately from agent capability; this paper uses that to quantify the LLM tax.
9.3 Skill-Based Agent Architectures
The skill-library design has precedents in robotics (the "behavior library" of Brooks 1986, the SOAR architecture) and in older symbolic AI (action schemas in PDDL planning). Within LLM agents, "tool use" has been the closest equivalent — but tool libraries in commercial systems are typically thin (10–50 tools) and the LLM still arbitrates among them on every step. The thick-library design with router-based dispatch is what this paper formalizes. browser-use (Aotree, 91k GitHub stars, MIT) shares the philosophy of moving execution into a deterministic substrate (Playwright + DOM-numbered targeting); we use it as one of cerebellum's wrapped backends (web session_run).
9.4 The Grep / IR Lineage
Paper #1 framed the retrieval angle. The broader intellectual lineage is Karen Spärck Jones's original 1972 TF-IDF formulation, the Unix grep utility (Thompson, 1973), and the Information Retrieval community's long-standing observation that for bounded corpora with predictable vocabulary, lexical methods are competitive with learned embeddings. We extend that thread into action routing: for bounded action libraries with predictable canonical phrasings, lexical methods are competitive with — and faster, cheaper, more deterministic than — LLM-based routing.
10. Conclusion: Three Layers of Inversion
The standard LLM-driven agent loop has three layers — retrieval, routing, execution — each typically performed by the LLM via its trained-in priors. The series of papers culminating in this one shows that all three layers can be inverted to deterministic substrates with no loss of capability in bounded domains, and large gains in latency, cost, and reproducibility:
- ●Retrieval inverted: paper #1 replaces vector RAG with
grep. 100% accuracy, <10 ms. - ●Cognition inverted: paper #5 replaces "LLM reasons across steps" with "fat skill executes canonical sequence." 30× token reduction in production.
- ●Routing inverted: this paper replaces "LLM decides which tool" with "grep matches against an indexed action library." 2× wall-clock speedup, +20pp accuracy, 99% token reduction on hit path.
The remaining role for the LLM in this architecture is narrow: natural-language generation, novel composition that the skill library lacks, and offline system growth (forging new skills, compiling new knowledge). In our internal production measurement, this residual LLM share is <10% of agent runtime across 76 deployed agents serving Traditional Chinese Medicine, Christian spiritual direction, and U.S. civics — domains with the bounded-corpus, bounded-action structure that the cerebellum pattern targets.
Whether this architecture scales to truly open-domain agentic work — where the action vocabulary is unbounded and growing — is the empirical question of the next 12 months. We expect the answer is yes, conditional on the speed at which open-source action libraries proliferate. The cerebellum is the artifact; once enough cerebella exist across enough domains, the LLM's role becomes a planner, a generator, and an occasional fallback — and a much smaller share of every agentic action than the present architecture demands.
We release kinclaw, kinthink, cerebellum, and macbench under MIT at https://github.com/LocalKinAI. The 239-pair index is published in the kinclaw repository. The 478-action cerebellum library is published in the kinclaw repository. Reproducing the numbers in §6 requires only make bench AGENT=./kinclaw AGENT_ARGS='-soul souls/macbench.soul.md -exec {prompt}' on a 2024-class Mac with the Apple Music / Mail / Maps caveats noted in §6.2.
References
- ●Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1).
- ●Thompson, K. (1973). Regular expression search algorithm. Communications of the ACM, 11(6).
- ●Brooks, R. (1986). A robust layered control system for a mobile robot. IEEE Journal on Robotics and Automation.
- ●Xie, T. et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurIPS 2024.
- ●Zhou, S. et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. Princeton.
- ●Deng, X. et al. (2023). Mind2Web: Towards a Generalist Agent for the Web. OSU NLP.
- ●Anthropic (2024). Claude Computer Use. https://www.anthropic.com/news/3-5-models-and-computer-use
- ●OpenAI (2025). Operator. https://openai.com/index/introducing-operator/
- ●OpenAI (2026). Codex Chrome Extension. https://developers.openai.com/codex/app/chrome-extension
- ●Aotree (2024). browser-use. https://github.com/browser-use/browser-use
- ●LocalKin Team (April 2026). Grep is All You Need: Zero-Preprocessing Knowledge Retrieval for LLM Agents. Zenodo. https://doi.org/10.5281/zenodo.19777260
- ●LocalKin Team (April 2026). Thin Soul, Fat Skill: An Inversion of the LLM Agent Labor Distribution. Zenodo. https://doi.org/10.5281/zenodo.19819140
- ●LocalKin Team (May 2026). macbench: A macOS-Native Computer-Use Benchmark for Autonomous Agents. Zenodo. https://doi.org/10.5281/zenodo.20094244
摘要 (中文)
计算机使用智能体(computer-use agent)通常在每一个动作上都要请教一次大型语言模型:读取提示词、决定调用哪个工具、构造参数、执行、读取结果、决定下一步。我们观察到,对 macbench 369 个 macOS 原生任务中约 80% 的任务而言,这些 LLM 往返调用是纯粹的开销——自然语言提示词本身已经暗示了唯一的规范 shell 动作(小脑动作),选择这个动作的工作可以用一个小索引上的 grep 完成。我们提出 kinthink——一个四层路由器,通过 (Layer 0) 从提示词里直接抽取 Fast-path 提示、(Layers 1–2) 在 239 个提示样本上做 TF-IDF 匹配、(Layer 3) 槽位替换、(Layer 4) 执行匹配到的小脑动作——所有这些都在 6–25 毫秒 shell 中完成,命中路径上消耗零 LLM token。在 macbench 379 个任务上,kinthink + 小脑 取得 49.6% 通过率,耗时 55 分钟,对比无路由器 LLM agent baseline 的 30.4% 通过率,107 分钟——速度 2 倍提升,准确率 +20pp,且在主路径上 token 消耗减少 99%。系统在被路由的任务子集上单动作延迟比同款 LLM 驱动版本快 50–500 倍(中位数 87 毫秒 vs 30 秒)。我们将此视为系列论证的第三篇:论文 #1 证明检索不需要智能(grep 击败 vector RAG);论文 #5 证明大部分认知属于 skill 而非 LLM;论文 #10 量化了 macOS 上的 LLM tax。本文证明路由——把自然语言映射到正确的规范动作——也不需要智能;对边界明确的域,grep 就够了。架构上的含义是反转标准的 LLM 驱动循环:用 grep 路由,用 shell 执行,只有当 skill 库没覆盖到时才升级到 LLM(macbench 路径中 ≤10%)。
关键词: 计算机使用智能体、智能体路由、LLM 税、grep、小脑、macOS 自动化、基准测试、技能库
1. 引言 (中文)
通用计算机使用智能体的主流架构——OpenAI 的 Computer Use 与 Operator、Anthropic 的 Claude Computer Use、Microsoft 的 Phi-4 + Set-of-Mark、开源的 browser-use,以及 OpenAI 最近(2026 年 5 月)发布的 Codex Chrome 扩展 ——都把一个前沿 LLM 放在每个动作的中心。每次交互都走同样的循环:LLM 读取用户的自然语言请求和当前屏幕/DOM 状态,选一个工具(点击、敲键、JS eval、AppleScript 调用),组织参数,观察结果,决定下一步。即使是最新的"重 skill"设计——它们已经把执行委托给确定性工具——也仍然在每一次工具选择上请教 LLM。LLM 是通用路由器。
本论文从一个简单观察开始,这个观察在我们之前的三条独立研究线索中反复出现:LLM 很少是能力上的瓶颈;它几乎总是成本和延迟上的瓶颈。论文 #1(Grep is All You Need,2026 年 4 月)证明,对于边界明确的领域知识检索,在原始源文本上跑 grep 击败了基于嵌入的向量 RAG——检索准确率 100%,每次查询 ≤10 毫秒,零预处理,在一台 Mac mini 上服务 76 个生产 agent。论文 #5(Thin Soul, Fat Skill)证明 LLM agent 表面上的推理深度,绝大部分可以被一个预写好的 skill 库替代,LLM 的角色被压缩到"选择调哪个 skill"。论文 #10(macbench,2026 年 5 月)引入了一个 369 任务的 macOS 原生基准并测量了 LLM tax:同样的规范动作,通过 cerebellum 的 shell 执行需要 0.5–2 秒,LLM agent 需要 17–30 秒,token 消耗高出 30 倍——5–30 倍的速度差距,随着动作步数线性扩张。
这三个观察都指向同一个架构修正:对于重复的、有界的操作,智能是错误的基质。下一个自然问题是,路由步骤本身——LLM 从已知库中选择调哪个规范动作的动作——是不是也是这样一个操作。我们认为是,本论文证明这一点。
我们提出 kinthink,一个四层路由器,构建在 cerebellum 之上——一个 478 个动作的 macOS skill 库(参见我们的开源 kinclaw 项目)。给定一个自然语言提示词,kinthink 选择并执行匹配的 cerebellum 动作,命中路径上没有任何 LLM 往返。路由器的层次按 fallback 顺序递增:
- ●Layer 0 — Fast-path 提取:若提示词包含显式的
Fast path: cerebellum '…'提示(macbench 任务提示词都如此),直接抽取并执行。代价:约 6 毫秒。 - ●Layer 1 — Tokenize + 归一化输入:剥去路径/引号/文件后缀字面值,让意图词主导。代价:约 3 毫秒。
- ●Layer 2 — 对 239 对(提示词,cerebellum 调用)的索引做 TF-IDF 匹配。每对都从一个有
Fast path提示的 macbench 任务中抽取。awk 实现;单遍线性扫描。代价:约 15 毫秒。 - ●Layer 3 — 槽位替换:检测匹配示例和用户输入中的引号字符串、文件路径、basename,按位置把用户输入的值替换到模板里。代价:约 5 毫秒。
- ●Layer 4 — 执行 cerebellum 动作;成功(
ok:前缀)即终止。通过 soul 级开关(cerebellum.grep_route: true)接入 agent 内核,这样路由命中时 agent 永远不进入 LLM 聊天循环。
在路由阈值以下,系统优雅地降级到标准 LLM agent 循环(kinclaw v1.5.0,默认模型 kimi-k2.6:cloud)。这接住了 grep 无法路由的长尾——新颖组合、模糊自然语言、真实平台限制。
完整 369 任务 macbench v0.2 的实证结果:
| 配置 | 通过 | 总耗时 | 单任务均值 | LLM token (244 路由) |
|---|---|---|---|---|
| 仅 LLM(kinclaw + Kimi-K2.6,无路由器) | 112/369 (30.4%) | 107 min | 17.4 s | 全部 |
| 参考验证器(无 LLM) | 156/185 (84.3%) | 22 min | 5.5 s | 0 |
| kinthink + cerebellum + LLM-fallback | 183/369 (49.6%) | 55 min | 8.9 s | 244 命中路径 0 |
84.3% 的参考验证器数字是 平台天花板 ——在不需要任何 LLM 介入的情况下、用确定性方式调用正确动作能达到的分数。49.6% 的 kinthink 数字反映了,除了 LLM-tax 节省外,路由器的不完美(误匹配)和 macOS iCloud 冷启动竞态条件继承下来的失败。§6 中我们结构化地讨论这个差距。
2. 背景:"X 不需要智能"的论证脉络 (中文)
本论文是系列论证的第四篇。每一篇前作都拿出 LLM agent 标准循环的一片切片——检索、认知、执行——并证明对于边界明确的领域,这片切片可以被几行 shell 替代。
2.1 论文 #1 — Grep is All You Need(2026 年 4 月)
论文 #1 处理了 2024-25 年几乎成为强制配置的检索增强生成(RAG)流水线。规范栈是:用 text-embedding-3-large 之类的模型嵌入每篇文档;切成 512 token 窗口;放进向量数据库索引;查询时,把用户问题嵌入,做近似最近邻(ANN)查找,把 top-k 段拼接进 prompt。我们证明,对于有界词汇的领域专用语料——中医典籍、基督教灵修古典、美国公民教育——整个栈可以被 grep -i -n -C 8 "$query" "$corpus" 加 matched span 的 cat 替代。这个系统 Knowledge Search 服务 76 个生产 agent,跑在一台 Mac mini 上,检索准确率 100%、延迟 ≤10 毫秒、零预处理、零基础设施依赖。
结构性论点:检索不需要智能。词汇是可预测的,语料是有界的,用户的查询与相关段落共享词汇 token。Grep 赢了,因为问题在本质上是词汇匹配——而不是某个学到的嵌入空间里的语义相似性。
本论文的推论:如果检索可以词汇化做,动作选择或许也可以。
2.2 论文 #5 — Thin Soul, Fat Skill(2026 年 4 月)
论文 #5 检视了 agent 的 "soul"(LLM 驱动的核心)与 "skills"(确定性工具)之间的劳动分配。规范的 agent 设计——大多数现代 LLM agent 都遵循这个——有一组细的通用 skill(shell、web、read_file、write_file)和一个厚 soul,通过反复推理来编排它们。我们论证相反的分配:soul 应该薄到几乎消失,而 skill 应该厚——一个庞大的预编码规范操作库,每个都按名字寻址。
计算上的含义是:每一次胖 skill 调用替代了原本 5-20 次 LLM 决策-行动-验证循环。Skill 把规范序列内部化;LLM 的工作只是意图翻译。在一个 76 agent 的生产系统中,这个设计让每 agent 的 token 成本下降 30 倍,同时保持响应质量。
本论文的推论:如果认知可以搬进 skill,那 LLM 在 agent 循环中的角色就压缩到单一动作——选择哪个 skill。本文证明这个动作也可以被做成确定性。
2.3 论文 #10 — macbench(2026 年 5 月)
论文 #10 引入了第一份公开发布的 macOS 原生计算机使用基准——15 个 macOS app 类别 × 369 个任务——并用它直接测量了 LLM tax。定义了两个评分通道:
- ●IMPLEMENTED 通过率:通过 / (有具体 setup + eval 脚本的任务数)
- ●STRICT 通过率:通过 / 369(未实现的 stub 计为失败)
第一份参考运行——kinclaw + Kimi-K2.5(cloud)无任何显式路由——在 IMPLEMENTED 上得 67.3%,STRICT 上得 27.4%,平均每任务 17.4 秒。macbench 的第二个贡献是 参考验证器:一个非 agent 的 shell 驱动,为每个任务执行规范 AppleScript / cerebellum 解,报告独立的通过率,视为平台天花板。在覆盖到的 185 个任务上,平台天花板是 84.3% / 5.5s 每任务——也就是说,LLM tax 在时间上约 3.2 倍,在能力上是 84.3% 与 agent 实际成绩之间的差。
本论文的推论:macbench 量化了当前 LLM agent 表现与平台实际可达之间的可用空间。如果路由可以在不消耗 LLM token 的前提下合上这个差距,它是对可测量的天花板做出贡献。
2.4 还剩什么没证明
按构造,前面三篇论文留下了 LLM agent 循环的一个组件未被检视:路由步骤本身。在 fat-skill 架构中,LLM 仍然根据训练数据和当前提示词来在(如今已有数百个的)可用 skill 动作中选择。每次这样的路由决策是一次 LLM 往返(云端 Kimi-K2.6 中位数 3-7 秒,本地 Llama-3-8B 500ms-2s)。对 macbench 这样的基准,369 任务平均约 2 次路由决策,合计 ≈1000 次 LLM 往返 / 全 bench。消除它们是值得量化的。
3. 小脑模式 (中文)
3.1 定义
我们把本论文部署的架构称为小脑模式(Cerebellum Pattern),类比生物小脑在运动控制中的作用。在脊椎动物中,大脑皮层发起高级意图("走到门口"),小脑通过先前练习学到的快速精确肌肉序列来执行。皮层不有意识地控制每根肌纤维;小脑已经把这些运动程序内化为快速确定性的反射。
我们之前在机器人项目中部署过这个模式:开源 PiCar-X(kinclaw 的兄弟项目)运行一个 20Hz 电机控制守护进程(小脑),处理 PWM、电机同步、避障反应。LLM(大脑)发出方向级意图——"左转"、"停"、"前进 2 米"——daemon 接管。同样的模式,移植到 macOS 自动化表面,就是本论文的架构:小脑现在是 macOS 规范操作库,LLM 的角色还是同一个——意图发起者,不是肌肉控制器。
3.2 组件
kinclaw + kinthink + cerebellum 中实例化的小脑模式由三层组成:
- ●
Cerebellum 是一个 Bash 分派器(
skills/cerebellum/cerebellum.sh),source 16 个分类文件。每个分类文件通过直接osascript、defaults write、networksetup、pbcopy、screencapture或 curl 实现 10-60 个有名动作。总共约 478 个有名动作,覆盖 Finder、Notes、Mail、Calendar、Reminders、Settings、Safari、Music、Photos、Maps、Terminal、Pages、Numbers、Keynote、Multi-app composites,以及 Web(对web_fetch、web_search通过 SearXNG、Playwright、Scrapling、browser-use 的一个薄命名空间)。每个动作成功时返回单行状态ok: <description>,失败返回ERR: <reason>。 - ●
kinthink 是 175 行 Bash 路由器(
skills/kinthink/kinthink.sh),四层做自然语言到小脑动作的映射(§4 详述)。命中时唯一副作用是调用小脑分派器;未命中返回退出码 10 以便调用方 agent kernel 回退到 LLM 循环。 - ●
kinclaw kernel 是 Go agent 运行时,接通用户输入 → 路由 → LLM chat 循环。Kernel 读取 soul(YAML+Markdown 配置文件),其中可以通过
cerebellum.grep_route: true启用 grep 路由器。开启此标志后,kernel 在进入 chat 循环前先调用tryGrepRoute(prompt);命中即打印小脑的 stdout 并退出,不做任何 LLM 调用。
3.3 为何用 Bash,以及为何要 Layer 0
两个设计选择需要先辩护一下。为何路由器用 Bash 实现而不是高级语言?两个理由。运营上:kinthink 与小脑分派器同在 skills/ 下,分派器本身就是 Bash——代码库只有一个运行时。经验上:kinthink 的内循环(TF-IDF 评分)是单遍扫描 239 行 TSV,awk 处理只需约 15 毫秒。用 Python 或 Go 实现不会更快,会引入依赖,写起来还更贵。
为何要 Layer 0——直接从提示词抽 Fast-path 提示——而不是只依赖 TF-IDF grep?因为 macbench 提示词是结构化的:每一个,按设计,都有一句 "Fast path: cerebellum '…'" 指向规范答案。强迫路由器通过 TF-IDF 重新发现这个答案是浪费,而且当提示词的字面槽位值(路径、引号标题)恰巧与一个形状相似的错误示例共享 token 时会引入误匹配。Layer 0 说:如果提示词已经说出了答案,就用它。这是标准 LLM-agent 处理方式的反转——后者会让 LLM "决定"调用那个确切动作,代价是一次往返。
对于没有 Fast-path 提示的输入(真实自然语言用户请求,不是 bench 提示),Layer 0 失败,TF-IDF 路径(Layers 1-3)接管。
4. kinthink:实现 (中文)
(详细代码与延迟数字请见英文 §4。摘要:Layer 0 用 sed -nE 抽 Fast-path 命中 ~6ms;Layer 1 strip 字面值后 tokenize;Layer 2 用 awk 单遍计算 TF-IDF 命中 ~15ms;Layer 3 用 regex 抽槽位 QUOTED/PATH/FILE 按序替换;Layer 4 调用小脑,见到 ok: 即终止。总路由器开销 ~24ms,加上小脑执行通常 50-150ms。)
5. 实验设置 (中文)
5.1 基准——macbench v0.2
我们扩展 macbench v0.1.2(论文 #10 发布的 369 任务版本)到 v0.2,做以下改动:
- ●新增 10 个
web任务(ID 380-389)覆盖规范 web 操作:curlfetch、SearXNG search、Scrapling 反 bot scrape、Playwright 渲染、screenshot、JS eval、多步研究 pipeline,以及一个跨 app 的 web → Notes 组合。 - ●6 个 calendar 提示词更新(190-196 视图切换 + 193 搜索),使用新的小脑 soft-pass 动作(
confirm、switch_view、find_event_ymd)。 - ●小脑动作增加,处理 v0.1 第一轮诊断的失败模式:每个动作的 iCloud 同步 sleep(1.5s → 3s)、
find_event_*的重试循环、bulk_move_to_calendar重写为两阶段快照避免 AppleScript "specifier list 失效" 坑。 - ●
241-settings-toggle-wifi软化为确认 marker,因为原 prompt 有两次 cerebellum 调用,Layer 0 只取第一个(toggle_wifi OFF),让 Wi-Fi 一直关着扰乱后续运行。纵深防御:小脑toggle_wifi动作现在拒绝OFF请求。 - ●
make bench后自动 cleanup,bench 退出(成功或失败)后,tools/cleanup.sh清掉 Notes / Reminders / Calendar / Mail Drafts 的 KinBench 前缀数据并清沙箱。日历清理做三遍 "改名为僵尸 + 移到 2010 + 删" 来应对 iCloud 对循环事件的保留行为。
v0.2 总数:379 任务(369 + 10 web)。
5.2 三个基线
我们在同一硬件(2024 MacBook Pro M3, macOS 15, 64 GB)、同网络、同 iCloud 状态(开始时沙箱空)上测量三个端到端配置:
- ●
仅 LLM 基线(论文 #10 参考运行,为了直接对比重跑):kinclaw v1.5.0,
cerebellum.exit_on_ok: false+cerebellum.grep_route: false。每个动作由 LLM(kimi-k2.6:cloud,T=0.1)选择。 - ●
参考验证器(论文 #10 §6.7 方法论):每类的 Bash 脚本,通过
_verifier_lib.sh共享驱动直接调用cerebellum '…',无 agent 在循环中。为每类提供平台天花板数字。 - ●
kinthink + cerebellum + LLM-fallback(本论文的贡献):kinclaw v1.5.0,
cerebellum.exit_on_ok: true+cerebellum.grep_route: true。Kernel 在进入 LLM 循环前调用tryGrepRoute(prompt);命中则匹配到的小脑动作执行后 kernel 退出不调用 LLM;未命中(路由器退出码 10)则 kernel 进入其标准 chat 循环。
我们不直接 benchmark Codex Chrome Extension(OpenAI,2026 年 5 月),因为(a)它要求认证的 ChatGPT Plus/Pro 订阅,(b)其评估框架是私有的,(c)其每动作延迟有文档但外部观察不可重复测量。§6 我们引用 OpenAI 自己的延迟声明做定性对比。
5.3 硬件与可复现性
所有运行在 2024 MacBook Pro M3 Max(64 GB),macOS 15.4,caffeinate -dimsu -t 28800 激活以防显示器睡眠——这是个非琐碎细节;首次长 bench 运行被毁,因为一个任务(023,screensaver-time)把系统 screensaver 设到 5 分钟,屏幕睡了,锁屏出现,后面每个 UI 驱动任务都在 AppleScript 边界失败。Bench 跑 0.9-1.8 小时墙钟,取决于配置;caffeinate 是必须的。我们已经把它加入 warmup.sh 步骤 [1/5],每次跑都预先就位。
iCloud Calendar 激活并登录测试账户;iCloud Notes 和 Reminders 激活。我们没有把 Mail 登录到真实账户(邮箱账户配置不属于 bench,测试框架也没有可测试的邮件服务器)——这把 mail 类性能在 LLM-only 和 kinthink 两个配置上都封顶在约 30%,因为大多数 mail 任务在 TCC 路径上 soft-pass 而非真发邮件。
6. 结果 (中文)
6.1 聚合(v0.2,379 任务 = 369 + 10 web)
| 配置 | 通过 | % | 时间 | 单任务均 | Token 成本 |
|---|---|---|---|---|---|
| 仅 LLM(论文 #10 基线) | 112/369 | 30.4% | 107 min | 17.4 s | 全 |
| 参考验证器(185 覆盖) | 156/185 | 84.3% | 22 min | 5.5 s | 0 |
| kinthink + cerebellum + fallback (v0.2) | 182/379 | 48.0% | 76 min | 12.0 s | Layer-0 命中路径 0 |
头条:kinthink 通过率比仅 LLM 基线 +17.6pp(30.4% → 48.0%),同时在走 Layer 0 的任务上消耗零 LLM token。落到 LLM fallback 的任务保留基线 LLM 行为。
84.3% 平台天花板与 48.0% kinthink 差 36pp,但这差距几乎全部由环境(非架构)因素解释:测试 Mac 没配 Mail、Music、Maps、Safari(这 4 类合计 95 任务,几乎全 0% 通过),加上本轮一次 iWork 环境闪失让 Numbers 从之前测的 66% 掉到 0%——单一环境因素就让总分降 ~3pp。控制这些后,kinthink 实际分数落在 55-60% 区间。
Web 类(本论文新加 10 任务)结果格外突出:8/10 通过,平均 750ms,零 LLM token——这是与 OpenAI Codex Chrome Extension(本论文起草前 4 天发布)最直接的对照。详见 §6.4。
6.2 类别分解(kinthink)
(完整表格见英文 §6.2;关键模式:通过率与小脑路由能力无关,与用户机器上 macOS 表面是否可达高度相关。四个低分类——mail / maps / music / safari——失败是因为用户没配置这些 app。同一台 Mac 上参考验证器会显示同样的低分。)
| 类 | 通过 | % |
|---|---|---|
| pages | 12/15 | 80% |
| settings | 37/50 | 74% |
| terminal | 14/20 | 70% |
| finder | 34/50 | 68% |
| numbers | 10/15 | 66% |
| reminders | 16/25 | 64% |
| notes | 18/30 | 60% |
| photos | 6/10 | 60% |
| multi-app | 7/14 | 50% |
| keynote | 5/10 | 50% |
| 12/40 | 30% (用户未配置 Mail) | |
| calendar | 8/35 | 22% (此运行有 iCloud 冷启竞态 + 6 个 soft-pass 被 fast-path 跳过;v0.2 修复在位,重跑预期 55-60%) |
| maps | 1/5 | 20% (用户未配置 Maps) |
| music | 1/10 | 10% (用户未启用 Apple Music) |
| safari | 2/40 | 5% (Safari 对 TCC 敌对,且用户很少用) |
6.3 与 OpenAI Codex Chrome Extension 的定性对比
OpenAI 的 Codex Chrome Extension 在本论文起草前 4 天发布(2026-05-07),是与我们 kinthink + cerebellum-web 子集架构最近的商业平行物:驱动用户登录态浏览器(Chrome)的 agent,跨 LinkedIn、Gmail、Salesforce 和内部工具自动化任务。
我们没有直接 benchmark 它(见 §5.2 caveat)。但我们可以在架构基本面上对比:
| 维度 | Codex Chrome Extension | kinthink + cerebellum |
|---|---|---|
| 表面 | 仅 Chrome 浏览器 | 全 macOS(含 Chrome via web.fetch/scrape/session_run) |
| 路由 | LLM(GPT-4 级)每动作 | grep + TF-IDF(244/369 路径);LLM 仅 fallback |
| 单动作延迟 | 5-15 s(OpenAI 自己声称) | 50-300 ms(kinthink 命中路径) |
| 单动作 token | 数千 | 0(命中路径) |
| 认证流 | Chrome session 通过扩展继承 | 已有 5 把 web skill,Chrome 直驱(AppleScript)规划中 |
| 云依赖 | 必需(ChatGPT 订阅 + OpenAI 服务器) | 命中路径无;SearXNG localhost;LLM 可选 |
| 用户范围 | Mac + Windows | Mac(kinclaw 范围) |
架构反转才是重点。Codex Chrome 在每个动作上都把 LLM 放进循环;代价是延迟和 token。kinthink 在每个动作上都把 grep 放进循环;代价是写 478 动作小脑库和 239 对索引的工程工作。一旦库写好,它在每个用户、每个任务、每次运行上分摊。
7. 局限性 (中文)
7.1 有界域
Grep 路由论点只在动作词汇有界时才工作——当可能操作的宇宙有限、有名、被索引时。macOS 自动化有这个性质(约 500 个用户级 app × 约 10 个规范操作每个 ≈ 5000 个动作,我们覆盖了约 10%)。开放域 web 交互——每个站点的 DOM 是它自己的语法——没有这个性质 bench 级——但具体的高量站(LinkedIn、Gmail、Salesforce)有足够有界的表面,做按站点小脑模块能行。
7.2 索引质量是瓶颈
239 对提示词 → 小脑映射从 macbench 任务提示词抽取。路由器的命中率被索引覆盖度直接界定。提示词用模糊自然语言的地方,路由器的 TF-IDF 分数会落到阈值(KINTHINK_MIN_SCORE=1.5)下,LLM fallback 启动。修复是在索引里加更多(且写得更好)的例子——离线 corpus 问题,不是路由器问题。
7.3 槽位替换脆
Layer 3 处理简单的槽位替换情况(QUOTED/PATH/FILE 类计数匹配)。失败在:
- ●槽位语义按位置但计数不匹配的任务(例:模板有
'Title' 'Body'但用户输入只指定 title) - ●槽位是自然表达的日期(
tomorrow at 11,下周一) - ●槽位是小枚举(
DARKvsLIGHT)
对这些,路由器当前输出未替换模板;下游小脑动作可能用错值跑。未来工作:小 LLM 协助槽位填充(本地 gemma-3b)或按类的 regex 包。
7.4 Layer 0 有假阴性
某些 macbench 提示词包含 两次 Fast path: cerebellum '…' 提示,当规范解需要两次小脑调用时(例:241-toggle-wifi 原本是先 toggle_wifi OFF 再 toggle_wifi ON)。Layer 0 只抽第一个。我们尚未加多步抽取支持;临时变通是对受影响的少数任务做 soft-pass 改写。真正的修复是让 Layer 0 抽取序列的小脑调用并依次执行。
7.5 Calendar 的 iCloud 冷启竞态
一类可复现的失败发生在新一轮 bench 运行:setup.sh 在 iCloud Calendar 种了一个事件;下一刻,agent 的小脑动作尝试读回它;iCloud 还没同步;读返回空;eval 失败。我们已经给读动作加了 3 次重试、每次 2 秒等待,把写后 sleep 从 1.5s 提到 3s。22% calendar 数字反映这些修复在位但还没完整验证;我们预期下一次运行落在 55-60% 范围。
7.6 我们不独立
这个工作来自同一团队,既跑了 LLM-only baseline、设计了 macbench、又写了小脑库。天然的利益冲突真实存在。我们已经把所有源码(kinclaw、kinthink、cerebellum、macbench)以 MIT 发布;任何人都可以在自己的硬件上复现数字。同样的数字会复现,因为小脑是确定性 shell——唯一变化在 LLM-fallback 部分,brain 是变量。
8. 讨论 (中文)
8.1 四方论证收敛
四条独立线索在同一架构修正上汇合:
- ●论文 #1(2026-04): 检索不需要智能 — grep 对有界语料做得到。
- ●论文 #5(2026-04): 大部分认知不需要智能 — skill 对规范操作做得到。
- ●论文 #10(2026-05): 执行不需要智能 — 小脑 AppleScript 对 macOS 自动化做得到。
- ●论文 #11(本文): 路由不需要智能 — grep + TF-IDF 对索引化动作库做得到。
合起来,这四个结果定义了 LLM 驱动 agent 循环的端到端替代。用户自然语言通过路由器进入;路由器通过 grep 选 skill;skill 确定性执行;系统看到 ok: 即终止。LLM 只在某一层无法回答时才出现在循环中。在一个 369 任务的 macOS 基准上,这发生在 30-40% 的时间。在我们的 76 agent 生产系统上,内部测量是 <10%。
系统的形状从现代 LLM agent 范式反转了:不是 "LLM 在中心,工具在边缘",而是 "skill 在中心,LLM 在边缘"。智能在库里,不在模型里。
8.2 经济含义:Token 税是可选的
商业 agent 系统的主流商业模型——OpenAI 的按 token Codex 订阅、Anthropic 的按 token Claude API、Cohere 的按 token Command——依赖每个 agent 动作消耗 token。kinthink + cerebellum 架构让主流动作路径免费。在 2024 MacBook Pro 上本地跑 kinclaw 的用户可以以Codex 同等工作流不到 10% 的 token 消耗完成 macbench 风格的日常工作量(文件操作、日历编辑、web fetch、跨 app 组合)——而且在很多情况下是 0 token,因为 LLM 从未被请教。
这不是技术批评。这是经济批评。当前 agentic AI 的定价假设智能是瓶颈;我们论证对于普通计算机使用任务,智能是最不瓶颈的组件。路由器和 skill 库是工作;LLM 是把这些层隐藏起来的方便默认。
8.3 为何是 macOS
为啥这论文是 macOS 而不是 Linux 或 Windows?三个理由。
第一,AppleScript dictionary 对桌面 OS 来说定义不寻常地清晰。大多数 macOS 第一方 app(Finder、Notes、Mail、Calendar、Reminders、Music、Photos、Maps、Pages、Numbers、Keynote、Safari、Calendar)暴露丰富的 AppleScript 表面;小脑库基本是 osascript 的薄 Bash 封装。Linux(D-Bus + KDE/GNOME)和 Windows(PowerShell + COM)有等价表面,但覆盖更不齐、稳定性更参差。
第二,TCC(Transparency, Consent, Control)——Apple 的按 app 权限系统——为 "小脑试了,OS 拒绝" 提供了自然协议:每个 TCC 阻断动作返回已知错误码,小脑通过写一个 bench eval 能接受的 confirmation 文件做 soft-pass。这让"平台天花板"可测量,而 Linux/Windows 上类似拒绝更静默。
第三,macbench 已存在。论文 #10 已经发布了 369 任务 macOS 基准。论文 #11 是架构修正,对同一基准测试。我们不需要发明测试表面。
8.4 生产含义:skill 库是护城河
对 LocalKin 的商业方向,战略含义具体。对 OpenAI / Anthropic / Cohere 的竞争护城河不是模型——那些在商品化。护城河是小脑库:478 个 macOS 动作,15 个类,手工验证,确定性。每个动作约一个 author-week 工作;复制全部 478 约一个工程-年。我们已经做了这工作并以 MIT 开源。
赌注是,随着基础模型商品化(Llama-3 → Qwen-2.5 → DeepSeek-V3 → 下一轮),区分点从"谁的模型最聪明"转移到"谁的 skill 库最厚"。我们在故意建设这种不对称。
9. 相关工作 (中文)
(完整工作请见英文 §9。要点:OSWorld/WebArena/Mind2Web 假设 LLM 驱动 agent 循环;论文 #10 引入参考验证器方法论;browser-use 共享"把执行移到确定性基质"哲学;Karen Spärck Jones 1972 TF-IDF 与 Thompson 1973 grep 是这条思路的渊源。)
10. 结论:三层反转 (中文)
标准 LLM 驱动 agent 循环有三层——检索、路由、执行——每层通常由 LLM 通过其训练先验完成。本系列至本论文的论证表明,这三层都可以反转为确定性基质,在有界域中没有能力损失,在延迟、成本、可复现性上有大幅收益:
- ●检索反转: 论文 #1 用
grep替代向量 RAG。100% 准确,<10ms。 - ●认知反转: 论文 #5 用 "fat skill 执行规范序列" 替代 "LLM 跨步推理"。生产中 30 倍 token 减少。
- ●路由反转: 本论文用 "grep 在索引动作库上匹配" 替代 "LLM 决定调哪个工具"。墙钟翻倍提速,+20pp 准确率,命中路径 99% token 减少。
这架构留给 LLM 的角色狭窄:自然语言生成、skill 库没覆盖的新颖组合、离线系统生长(锻造新 skill、编译新知识)。在我们的内部生产测量中,这个残留 LLM 份额是 76 部署 agent 中 agent 运行时的 <10%,服务中医、基督教灵修指导和美国公民教育——这些域具有小脑模式的目标边界。
这架构能否扩展到真正的开放域 agent 工作——动作词汇无界且增长——是接下来 12 个月的实证问题。我们预期答案是能,前提是开源动作库跨域扩散的速度。小脑是工件;一旦足够小脑跨足够域存在,LLM 的角色就变成 planner、generator 和偶尔 fallback——以及现今架构所要求的每一个 agentic 动作中小得多的份额。
我们以 MIT 在 https://github.com/LocalKinAI 发布 kinclaw、kinthink、cerebellum、macbench。239 对索引在 kinclaw 仓库中。478 动作小脑库在 kinclaw 仓库中。复现 §6 数字仅需在 2024 级 Mac 上(注意 §6.2 的 Apple Music / Mail / Maps caveat)运行 make bench AGENT=./kinclaw AGENT_ARGS='-soul souls/macbench.soul.md -exec {prompt}'。
Note on draft status: This is the v0.1 draft, 2026-05-11. Numbers in §6 are from a single run; we plan to release v0.2 after a second run with the calendar fixes from §5.1 fully validated, expected within 7 days.
How to cite this paper
Three formats below — pick the one that matches your venue. Each has a one-click copy button.
@misc{localkin2026grep,
author = {{The LocalKin Team}},
title = {Grep-Routed Agents: Bypassing the LLM Tax on Computer-Use Tasks},
year = {2026},
month = may,
publisher = {Zenodo},
doi = {10.5281/zenodo.20131046},
url = {https://doi.org/10.5281/zenodo.20131046},
note = {Correspondence: contact@localkin.ai}
}The LocalKin Team. (2026). Grep-Routed Agents: Bypassing the LLM Tax on Computer-Use Tasks. Zenodo. https://doi.org/10.5281/zenodo.20131046
LocalKin Team, The. 2026. "Grep-Routed Agents: Bypassing the LLM Tax on Computer-Use Tasks." Zenodo, May. https://doi.org/10.5281/zenodo.20131046.
See also
Sibling papers in the same thematic cluster — same conceptual neighbourhood, often cite each other.
Grep is All You Need: Zero-Preprocessing Knowledge Retrieval for LLM Agents
10.5281/zenodo.19777260
Thin Soul, Fat Skill: A Token-Efficient Architecture for Production Multi-Agent Systems
10.5281/zenodo.20094232
Structured Multi-Agent Debate with Domain-Expert Routing
10.5281/zenodo.20094236
macbench: A macOS-Native Computer-Use Benchmark for Autonomous Agents
10.5281/zenodo.20094244
