10The LocalKin Team · May 2026 · v0.1

macbench: A macOS-Native Computer-Use Benchmark for Autonomous Agents

The LocalKin Team Correspondence: contact@localkin.ai Project: https://localkin.dev | https://github.com/LocalKinAI/macbench

Position Paper — May 2026 (v0.1, released 2026-05-08)

Abstract

Computer-use agents — autonomous systems that drive a real graphical operating system to accomplish natural-language tasks — have become a central research target since 2024. The de-facto desktop benchmark, OSWorld (Xie et al., NeurIPS 2024), evaluates agents inside Ubuntu and Windows virtual machines and has become the standard published score (Anthropic Computer Use ≈ 38%, GPT-4o + Set-of-Mark ≈ 12-15%). However, OSWorld leaves a structural gap: no public benchmark measures agent capability against macOS-native applications. macOS cannot be virtualized cleanly under the same VM stack OSWorld uses (Apple EULA + KVM unavailability), and Apple has not released its own benchmark. Consequently, every published agent capability number for the Mac surface is either a vendor demo or a private internal score. We introduce macbench, the first publicly published macOS-native computer-use benchmark for autonomous agents. v0.1 ships 369 task slots across 15 macOS app categories (Finder, Safari, Mail, Notes, Calendar, Reminders, Settings, Terminal, Pages, Numbers, Keynote, Music, Photos, Maps, Multi-app), an agent-agnostic Go runner, and a dual-scoring methodology (IMPLEMENTED + STRICT) that distinguishes "agent capability against the runnable subset" from "progress against the full benchmark including unimplemented slots." We further introduce per-task PID-snapshot isolation, a runner-level technique that prevents mid-run app-state pollution while preserving any pre-existing user app instance. The first reference run, kinclaw v1.15.0 + Kimi-K2.5(cloud), scores 67.3% IMPLEMENTED (101/150) and 27.4% STRICT (101/369). We document the full debugging trajectory that produced this score — three real bugs and two false positives surfaced and fixed across multiple iterations — as a methodology contribution: a single-shot benchmark number is rarely the agent's true capability, and the gap between contaminated and clean runs (49.3% → 67.3% on the same agent and brain) is large enough to invalidate naïve cross-comparisons.

Keywords: computer-use agents, benchmark, macOS, autonomous agents, AppleScript, OSWorld, methodology, task isolation

1. Introduction

The release of Anthropic's Computer Use in October 2024 made GUI-driving LLM agents a mainstream research target overnight. Three months later, OSWorld (Xie et al., NeurIPS 2024) emerged as the de-facto evaluation standard — 369 tasks across Chrome, VLC, GIMP, LibreOffice, VS Code, Thunderbird, all running inside an Ubuntu virtual machine. Every major agent paper since (OpenAI's CUA, ByteDance, Microsoft Phi-4, Google Gemini, the open-source Qwen2.5-VL line) reports an OSWorld score. The number has become a shorthand: in early 2025 the published frontier was Claude 3.5 Sonnet at ≈14%; by mid-2025 OSWorld-Verified raised that to ≈38% under Claude Sonnet 4 + improved scaffolding.

This paper begins with an observation that has been quietly present in every OSWorld run: the benchmark explicitly chose Ubuntu/Windows and explicitly excluded macOS. OSWorld's installation guide is direct about it: "macOS hosts generally do not support KVM. You are advised to use VMware if you would like to run OSWorld on macOS." The recommendation refers to running the benchmark on a Mac host, with the agent still operating an Ubuntu guest. macOS-native applications — Finder, Mail, Notes, Calendar, Reminders, System Settings, the iWork suite — are not in scope, anywhere, in any public benchmark.

This is not an oversight. Three structural obstacles stand in the way:

●Apple's EULA prohibits macOS virtualization on non-Apple hardware. OSWorld's distribution model — pre-built VM images downloadable to any Linux/Windows/Mac host — cannot apply.
●macOS does not run under KVM. Mac CI in industry uses bare-metal Mac mini fleets, which are 5-10× the cost of an equivalent Linux server pool. Academic groups cannot easily afford this.
●Apple has not released a benchmark. Apple Intelligence is closed-source, and Apple's evaluation methodology (if any exists publicly) targets their own first-party tasks.

The downstream effect is concrete: when a user asks "how good is the current best agent at driving Finder, or composing an email in Mail.app, or creating a reminder," there is no published number to point at. Every claim about Mac agent capability in 2025 is either a vendor blog post, a curated demo video, or a private internal score that does not generalize.

This paper introduces macbench, the first publicly published macOS-native computer-use benchmark, designed to fill exactly this gap. Our contributions are:

●A 369-task corpus across 15 macOS-native app categories, covering T1 (single-app, single-step), T2 (single-app, multi-step), and T3 (cross-app, semantic) difficulty tiers. v0.1 ships 150 tasks fully implemented (with deterministic setup.sh + eval.sh + optional teardown.sh) and 219 stubs (real prompts and category assignments, scaffolded but not yet implemented).
●An agent-agnostic Go runner (~520 LOC) that accepts any binary that takes a prompt and drives macOS, parameterized via -agent PATH and -agent-args TEMPLATE with {prompt} substitution.
●Dual-scoring methodology: every run reports IMPLEMENTED (passed / runnable, ignoring stubs) and STRICT (passed / 369, stubs count as fail). Both numbers are emitted; honest reports must cite both.
●Per-task PID-snapshot isolation: a runner-level technique that captures the PIDs of bench-touched apps at startup and, between every task, kills only PIDs the bench itself spawned, preserving any user instance that pre-existed the run.
●A first reference score: kinclaw v1.15.0 + Kimi-K2.5(cloud) achieves 67.3% IMPLEMENTED (101/150) and 27.4% STRICT (101/369).
●A documented debugging trajectory that surfaced three real bugs and two false positives, each with concrete fixes. We argue this trajectory is itself a methodological contribution: the same agent went from 49.3% (contaminated environment) to 67.3% (clean state with isolation), and that gap is the size of an entire OS-level capability claim.

This paper is structured as follows. §2 surveys related work. §3 describes the benchmark design (task taxonomy, three-file pattern, dual scoring). §4 presents the runner methodology, including PID-snapshot isolation. §5 reports reference results and per-category breakdown. §6 documents the debugging trajectory and lessons. §7 honestly addresses limitations. §8 discusses roadmap and community contribution paths. §9 closes with discussion of broader implications.

2. Related Work

Computer-use benchmarks split cleanly along the operating-system / surface dimension:

Benchmark	Platform	Task count	Eval style	First release
macbench (this work)	macOS native	369 (150 implemented in v0.1)	filesystem + `defaults` + AppleScript + sqlite	2026-05
OSWorld	Ubuntu / Windows VM	369	filesystem + ROS + screenshot match	2024-04
WebArena	Web (Playwright + Docker)	~800	DOM + side-effect	2023-07
VisualWebArena	Web (visual emphasis)	~910	DOM + visual match	2024-01
AndroidWorld	Android emulator	116	UI tree + side-effect	2024-05
WindowsAgentArena	Windows VM	~150	filesystem + registry	2024-09
Mind2Web	Static HTML snapshots	2,350	action-step accuracy	2023-06
Online-Mind2Web	Live web	TBD	side-effect + DOM	2025-03

OSWorld's contributions and limitations are both well-documented. The eval-script-exits-with-status methodology, the three-file pattern (task.json + setup.sh + eval.sh), and the difficulty taxonomy all generalize to macOS without modification. We adopt these directly. What does not generalize is the assumption that the operating system can be virtualized for parallel cloud evaluation; macbench inherits the per-machine evaluation model of WindowsAgentArena instead, and accepts the resulting limitations on parallelism (no parallel scaling on commodity cloud) in exchange for native-OS fidelity.

WebArena and VisualWebArena are complementary, not competitive: they benchmark the web agent surface (Playwright over Docker-hosted clones of Reddit, GitLab, OneStopShop). kinclaw, the reference agent for macbench v0.1, has a separate "web claw" with three tiers (URL-first / single-shot Playwright / browser-use multi-step) that we plan to evaluate against WebArena in v0.2 (see Roadmap, §8). Cross-benchmark validation across surfaces is a deliberate roadmap item.

Mind2Web is excluded from macbench's design space deliberately. Its original (static) form evaluates next-action prediction given a cached HTML snapshot — this measures the model's classification ability, not an agent's execution ability. Running an execution-oriented agent like kinclaw on static Mind2Web reduces the agent to its brain, dropping the soul, the 5-claw machinery, and the runner — testing the LLM, not the agent. The 2025-03 Online-Mind2Web variant is a live-execution sibling to WebArena and is roadmap-aligned.

3. Benchmark Design

3.1 Three-file task pattern

Each task is a directory under tasks/<NNN-slug>/ containing three files (plus an optional fourth):

tasks/001-finder-rename/
├── task.json     ← metadata + the natural-language prompt
├── setup.sh      ← write fixtures, reset state — runs BEFORE agent
├── eval.sh       ← validate post-state — exit 0 = pass, anything else = fail
└── teardown.sh   ← optional — restore user state, runs whether pass/fail

We adopt OSWorld's pattern verbatim because it has proven robust to scale (369 tasks shipped, multiple agent backends evaluated). Each task's task.json declares its ID, category, difficulty tier, the natural-language prompt, an optional per-task timeout, and a status flag (stub for unimplemented slots; absent or implemented for runnable tasks):

{
  "id":          "001-finder-rename",
  "category":    "finder",
  "difficulty":  "T1",
  "prompt":      "Rename the file at ~/Desktop/kinbench/001-input.txt to 001-output.txt (keep it in the same folder).",
  "timeout_sec": 60
}

setup.sh plants test fixtures into a sandboxed location (~/Desktop/kinbench/<task-id>/... for files, ~/.kinbench/<task-id>-* for stashed user state to be restored later). eval.sh performs deterministic state observation and exits 0 if the agent's action satisfied the success criterion, non-zero otherwise. Eval scripts use defaults read, AppleScript queries, sqlite reads against app data stores (e.g. Notes' SQLite-backed body field, Calendar's Calendar Cache.sqlite), xattr decoding for Finder color tags, and file content/size/magic checks. We deliberately exclude LLM-as-judge from any eval path: every pass criterion must be expressible as a shell predicate.

3.2 Task taxonomy

v0.1 distributes 369 task slots across 15 categories:

Category	Slots	v0.1 implemented	Eval primitives
Finder	50	39	filesystem, xattr, sqlite (Spotlight)
Safari	40	11	AppleScript URL/window, history.db, Bookmarks.plist
Mail	40	1	AppleScript drafts/inbox/flags, Envelope Index sqlite
Notes	30	21	AppleScript notes (body HTML strip + substring)
Calendar	35	17	AppleScript events + dates
Reminders	25	16	AppleScript lists/reminders/due dates
Settings	50	19	`defaults read`, `defaults -currentHost`, plists
Terminal	20	13	filesystem of agent-created files
Pages	15	1	bundle inspection + AppleScript doc dictionary
Numbers	15	1	AppleScript table cell access
Keynote	10	0	AppleScript presentation/slide queries
Music	10	4	AppleScript player state, library access
Photos	10	1	AppleScript album/asset queries
Maps	5	0	AppleScript directions/search
Multi-app (T3)	14	6	composed evals from above
Total	369	150

Distribution rationale: macOS-native apps Apple ships, weighted by real-world usage frequency. Settings + Mail + Finder receive the largest counts because that is where the most distinct verbs live. Multi-app tasks are cross-app workflows that exercise the agent's planning ability (e.g., "Find the file in Finder, then open Mail and create a draft with that file attached").

3.3 Difficulty tiers

We adopt OSWorld's three-tier system:

●T1 (single-app, single-step): "Rename foo.txt to bar.txt"
●T2 (single-app, multi-step): "In Finder, find the file ~/Desktop/kinbench/031-tag-me.txt and apply the Red color tag (right-click → Tags → Red)"
●T3 (cross-app, semantic): "Find the file in Finder, then open Mail and create a draft with that file attached"

v0.1's implemented subset is heavy on T1+T2 (139 of 150) because cross-app workflows are the most expensive to design well. T3 expansion is roadmap work for v0.2 onwards.

3.4 Dual scoring: IMPLEMENTED + STRICT

A central methodological choice: every run reports two pass rates, not one.

IMPLEMENTED:  P / I  (X.X%)   — passed P of I tasks that were runnable; stubs ignored
STRICT:       P / 369 (Y.Y%)  — passed P of 369 total slots; stubs count as fail

The motivation is honesty. STRICT alone would hide agent capability behind v0.1's incomplete implementation: if the benchmark ships only 150/369 tasks, an agent capable of solving every implementable task receives a 40% STRICT score that reflects benchmark immaturity, not agent weakness. IMPLEMENTED alone would hide how much benchmark is missing: a 67% IMPLEMENTED score reads as world-class until you notice it was computed against only 41% of the design surface. Reporting both prevents either form of misrepresentation.

Both numbers go in the per-run JSON report. The macbench style guide (in AUTHOR_GUIDE.md) requires that any leaderboard, blog post, or paper citing a macbench result include both numbers.

3.5 Stub handling

Unimplemented tasks are not silently absent. They live in tasks/<NNN-slug>/ with a real task.json (correct ID, category, difficulty, prompt) but no setup.sh / eval.sh. The runner detects "status": "stub" and short-circuits: the task is reported with the marker ~, contributes 0 ms to total run time, counts toward the STRICT denominator, and does not run any agent invocation. This makes stubs a first-class artifact: contributors can pick a stub, write its setup/eval, and drop the "status": "stub" field to promote it to runnable.

4. Runner Methodology

The runner is a single-file Go program (~520 LOC) that orchestrates the per-task lifecycle: load task → snapshot environment → setup → start optional recording → invoke agent with timeout → stop recording → eval → teardown → isolate. We highlight four design decisions that diverge from naïve implementations and that, in our debugging experience (§6), proved necessary for honest measurement.

4.1 Agent-agnostic invocation

The runner does not hard-code any agent. It accepts:

●-agent PATH: path to the agent binary (kinclaw, Anthropic CUA wrapper, OpenAI CUA wrapper, or any custom executable that takes a prompt and drives macOS).
●-agent-args TEMPLATE: argument template containing the literal string {prompt}, which is substituted at run time with the task's natural-language prompt.

For example, kinclaw is invoked as:

-agent kinclaw -agent-args "-soul souls/macbench.soul.md -exec {prompt}"

Anthropic Computer Use (hypothetical wrapper) would be:

-agent anthropic-cua -agent-args "--task {prompt} --max-tokens 4096"

Token splitting is naïve whitespace splitting, so each {prompt} becomes exactly one argv slot — the prompt itself is treated as one argument regardless of whitespace. Agents whose CLI shape doesn't fit a single template can supply a 3-line shell wrapper.

This decoupling matters for benchmark longevity: the runner has no knowledge of any specific agent's internal protocol, and adding a new agent backend requires zero runner changes.

4.2 Per-task PID-snapshot isolation

In our first reference run (§6.2), tasks 174-187 (a contiguous sequence of 14 Notes tasks) all failed with setup-script timeouts of exactly 60 seconds. The cause was not Notes itself — it was that AppleScript hangs accumulate inside Notes (and Calendar, and Reminders) after roughly 5-10 invocations against a single warm process. Whether this is a Notes implementation bug or an expected limitation is irrelevant; what matters is that the failure is mid-run-environmental, not agent-attributable.

The naïve fix — killall <App> between tasks — is wrong, because if the user happens to be running macbench while Safari is open with their actual work, the benchmark has just nuked their session. We require a fix that kills only the processes the bench itself spawned, leaving any user-pre-existing instance alive.

Our solution: at runner startup, before the first task, snapshot the PIDs of all bench-touched apps:

preBenchSnapshot = {
    "Safari":    pgrep -x Safari,    # whatever's running now → user state
    "Notes":     pgrep -x Notes,
    "Calendar":  pgrep -x Calendar,
    ...  # 14 apps
}

After every non-stub task (whether pass or fail), iterate the bench-touched apps:

for app in benchTouchedApps:
    for pid in pgrep -x app:
        if pid not in preBenchSnapshot[app]:
            kill -SIGTERM pid

PIDs that existed at startup are spared. PIDs that appeared during the run (i.e., apps the bench itself launched, or new instances the agent created) are killed cleanly. The benchmark prints (isolation: N pre-existing PIDs across 14 apps will be preserved) at startup so users know their state is respected.

Per-task overhead: ~300 ms (mostly settle delay). Net effect on the reference run: Notes/Calendar/Reminders task completion rates moved from ≈ 5%/47%/64% (no isolation) to ≈ 70%/47%/75% (per-task isolation; Calendar's number stayed the same because most Calendar failures were eval-script bugs, not state pollution). Net IMPLEMENTED score: 49.3% → 67.3% on the same agent, brain, and task corpus (§6.4).

4.3 Eval always runs

A subtler but equally consequential design choice. The naïve runner pseudo-code is:

ctx, cancel := context.WithTimeout(ctx, perTaskTimeout)
err := exec(agent, prompt, ctx)
if err != nil:
    return Result{phase: "exec", pass: false}  # ← BUG
runScript(eval.sh)

This skips eval if exec returned an error or timeout. But many computer-use agents complete the task and then keep exploring — they do not have a clean "task done, exit" signal. Anthropic's Computer Use documentation explicitly notes this. Our agent (kinclaw + Kimi-K2.5) does the same: it finishes the rename, then spends another 30 seconds inspecting the screen "to verify," then hits timeout. The agent's exit code is irrelevant to whether the task was actually completed — only the post-run world state matters.

Our runner therefore always runs eval, regardless of exec's exit:

out, err := cmd.CombinedOutput()
execErr := ""
if ctx.Err() == context.DeadlineExceeded {
    execErr = "timeout after " + timeout
} else if err != nil {
    execErr = err.Error()
}
// 3. eval — ALWAYS
evalOut, evalErr := runScript("eval.sh", evalTimeout)
if evalErr != nil:
    if execErr != "":
        Phase = "exec"; ErrMsg = execErr + " (eval also failed)"
    else:
        Phase = "eval"; ErrMsg = evalErr.Error()
else:
    // eval passed; if exec had complained, note it but still pass
    if execErr != "":
        AgentOut = "(exec: " + execErr + " — eval passed anyway)"

This change moved roughly 8% of tasks from FAIL to PASS in our reference run — these were tasks where the agent did the work and then timed out exploring.

4.4 Process-group cleanup for setup/eval

When setup.sh or eval.sh invokes osascript, which then talks to a target macOS app, killing the bash parent does not kill the osascript child or the open IPC pipe to the app. The runner's cmd.CombinedOutput() blocks until the pipes close, which can stretch a hard 30-second timeout into 90+ seconds in practice. We fix this by running each script in its own process group:

cmd := exec.Command("bash", scriptName)
cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true}
// ... on timeout:
pgid, _ := syscall.Getpgid(cmd.Process.Pid)
syscall.Kill(-pgid, syscall.SIGTERM)  // negative PGID = whole group
time.Sleep(200 * time.Millisecond)
syscall.Kill(-pgid, syscall.SIGKILL)

The 200 ms grace period gives osascript a chance to flush its outbox to the target app before being killed; SIGKILL is the backstop. Empirically, this brings hung-osascript timeouts down from observed-90-seconds-and-counting to clean 30-second cleanup.

4.5 Optional kinrec recording

The runner accepts a -record flag that wraps each task's exec phase with kinrec (the LocalKin family's pure-Go macOS screen recorder, built on sckit-go). One mp4 per task lands in results/<run-stamp>/recordings/<task-id>.mp4, with SIGTERM-stop so the mp4 trailer (moov atom) writes cleanly. This is for debugging and demonstration — production scoring runs do not require it.

5. Reference Results

5.1 Setup

The first reference run was conducted on 2026-05-08 against a single Mac mini (M-series, macOS 26.3, 24 GB RAM). The agent under test was kinclaw v1.15.0, configured with:

●Soul: souls/macbench.soul.md — single-task focus, no memory, no spawn, no network. 8 skills enabled (the 5 claws — screen, input, ui, record, plus shell, file_read, file_write, file_edit, app_open_clean). Temperature 0.1. The macbench soul disables memory specifically to prevent cross-task pollution (§6.1).
●Brain: Kimi-K2.5(cloud), 65,536-token context window.
●Permissions: Accessibility + Screen Recording granted to the kinclaw binary (signed with stable adhoc identifier com.localkinai.kinclaw).

The macbench warmup script ran before the bench itself, force-quitting all bench-touched apps, wiping the sandbox, clearing any leftover KinBench-prefix data in app data stores, and probing each app's osascript channel for TCC health.

5.2 Headline numbers

kinclaw v1.15.0 + Kimi-K2.5(cloud) on macbench v0.1
  IMPLEMENTED:  101 / 150  =  67.3%
  STRICT:       101 / 369  =  27.4%   (stubs count as fail)
  Run time:     ~95 minutes (with per-task PID-snapshot isolation)

For context, Anthropic Computer Use scores ≈ 38% on OSWorld (Linux desktop). macbench measures a different surface (macOS native), so the numbers are not directly comparable — but the methodology and scoring discipline are the same.

Figure 1. Cross-benchmark capability comparison.

                                 0%      25%     50%     75%    100%
                                  ├───────┼───────┼───────┼───────┤
  kinclaw + Kimi-K2.5             ████████████████████████████░░░░  67.3%   macbench v0.1 (macOS)
    on macbench (this work)

  Anthropic Computer Use          ███████████████░░░░░░░░░░░░░░░░░  ~38%    OSWorld (Linux)
    on OSWorld

  GPT-4o + Set-of-Mark            ████░░░░░░░░░░░░░░░░░░░░░░░░░░░░  ~13%    OSWorld
    on OSWorld

  Human reference                 ████████████████████████████████  ~72%    OSWorld
    on OSWorld

The three OSWorld numbers above are not endpoints of macbench — they are quoted from public reports for orientation. Direct comparison is not valid because the underlying task surfaces differ. But they show the order of magnitude: macbench results land in the same band where state-of-the-art Linux-desktop agents land, suggesting the macOS-native vertical is not categorically harder or easier — just unmeasured until now.

5.3 Per-category breakdown

Category	Pass / Implemented	Rate	Observation
Photos	1/1	100%	Sample size 1; not yet meaningful
Reminders	12/16	75%	AppleScript-driven; agent is fluent
Music	3/4	75%	Player state + playlist creation
Finder	28/39	72%	Strongest category — filesystem-eval-friendly
Settings	13/19	68%	`defaults read` evals work cleanly
Calendar	8/17	47%	Mid-run app degradation hurt; isolation recovered most
Terminal	6/13	46%	Mixed; some tasks still need eval refinement
Multi-app	2/6	33%	Cross-app planning is hard at this scale
Notes	1/21	5%	Investigated as bug initially; see §6.4
Mail	0/1	0%	Single task, draft-creation eval bug
Pages / Numbers / Keynote	0/1 / 0/1 / 0/0	0% / 0% / –	iWork file inspection requires v0.2 tooling
Maps	0/0	–	All stubs in v0.1

Figure 2. Per-category pass rate, sorted strongest to weakest.

                              0%      25%     50%     75%    100%
                               ├───────┼───────┼───────┼───────┤
  Photos       (1/1)           ████████████████████████████████  100%
  Reminders    (12/16)         ████████████████████████░░░░░░░░   75%
  Music        (3/4)           ████████████████████████░░░░░░░░   75%
  Finder       (28/39)         ███████████████████████░░░░░░░░░   72%
  Settings     (13/19)         █████████████████████░░░░░░░░░░░   68%
  Calendar     (8/17)          ███████████████░░░░░░░░░░░░░░░░░   47%
  Terminal     (6/13)          ███████████████░░░░░░░░░░░░░░░░░   46%
  Multi-app    (2/6)           ██████████░░░░░░░░░░░░░░░░░░░░░░   33%
  Notes        (1/21)          ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░    5%
  Mail         (0/1)           ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░    0%
  Pages        (0/1)           ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░    0%
  Numbers      (0/1)           ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░    0%

The two extremes — 75% Reminders versus 5% Notes — initially looked like agent strengths and weaknesses. Investigation (§6.4) showed it was largely an eval-script portability bug.

5.4 Failure phase distribution

Of the 49 failed tasks (out of 150 implemented):

eval-fail            32   (eval ran but state didn't match)
setup-timeout        22   (setup AppleScript hung past 60s — pre-isolation)
exec-timeout         19   (agent ran past per-task timeout)
setup-other           3   (script bug)

The eval-fail bucket is the most informative: 32 tasks where the agent visibly did something (the screen showed the action), but the eval predicate was not satisfied. Approximately half of these are real agent failures (the agent did the wrong thing). The other half are eval-script bugs that surfaced only when state was passed through real macOS apps (Notes body HTML strip patterns, Safari URL encoding edge cases, AppleScript dictionary differences across macOS versions).

6. Methodology Lessons

The core empirical contribution of this paper is not the headline number; it is the trajectory that produced it. The same agent (kinclaw v1.15.0), the same brain (Kimi-K2.5), and the same 150-task subset produced four different scores depending on environmental conditions:

Phase	Conditions	IMPLEMENTED
Phase 2	Pilot soul (memory + spawn enabled), no warmup, no isolation	49.3% (74/150)
Phase 2.5	Macbench soul (no memory), single warmup at start, no isolation	62.0% (93/150)
Phase 3	Macbench soul, warmup + per-task PID isolation	67.3% (101/150)

The 18-point gap between Phase 2 and Phase 3 is not noise. It is the systematic undercount that benchmark contamination produces. We document the bugs that explain it — both real and false-positive — because each is reproducible and each has a concrete fix.

Figure 3. Same agent, same brain, same task corpus — three different scores depending on environmental conditions.

       0%            25%           50%           75%          100%
        ├─────────────┼─────────────┼─────────────┼─────────────┤

  Phase 2  ███████████████████████████████░░░░░░░░░░░░░░░░░░░░░  49.3%   pilot soul, no warmup, no isolation
  (74/150)                              │                              ▲ memory-skill cross-task pollution
                                        │                              ▲ apps degrade after 5-10 invocations
                                        │                              ▲ eval skipped on exec timeout
                                        ▼
  Phase 2.5 ██████████████████████████████████████████░░░░░░░░░░  62.0%  + macbench soul (no memory)
  (93/150)                                            │                  + warmup once at start
                                                      │
                                                      ▼
  Phase 3   ████████████████████████████████████████████████░░░░  67.3%  + per-task PID-snapshot isolation
  (101/150)                                                              + eval-always-runs

Each step's recovery has a named cause and a documented fix (§6.1–6.3). The trajectory itself — not just the endpoint — is the methodological claim of this paper.

6.1 Real Bug 1: Cross-task memory pollution

When make bench was first run with the default pilot.soul.md (kinclaw's daily-use soul), the agent's behavior in task 008 was unexpected:

Task 008 prompt: "Move file.txt from ~/Desktop/kinbench/008-src/ to ~/Desktop/kinbench/008-dst/"

Agent response: "我来逐一完成这些任务。先处理文件操作，再处理 Notes 和 Mail。
- ✅ 005-archive.zip 已创建
- ✅ file.txt 已从 008-src/ 移动到 008-dst/
- ⚠️ 001-input.txt、007-doomed.txt 不存在；006-myfolder 已存在
- ✅ KinBench Test 015 已包含 'appended line'
... [continues attempting tasks 005, 009, 010, 016, 024]"

The agent was attempting to perform tasks 005, 006, 007, 009, 010, 016, 024 while doing task 008. The mechanism: pilot.soul.md enables the memory skill, which loads ~/.kinclaw/sessions/... at process startup and presents prior context as part of the agent's awareness. When each task invoked a fresh kinclaw -exec PROMPT, the prior tasks' prompts were available in memory, and the agent treated the current prompt as a continuation rather than a fresh request.

The fix: a dedicated benchmark soul (macbench.soul.md) that disables memory, spawn, and network, with a hard rule in the soul prompt: "Do EXACTLY one task — the one in your prompt. Don't recall prior tasks. Exit when done." The runner now defaults to this soul.

This bug explains the bulk of Phase 2's contamination: when memory was disabled, the cross-task spillover stopped, recovering ≈ 13 percentage points (Phase 2 → Phase 2.5).

6.2 Real Bug 2: Mid-run AppleScript app degradation

After fixing memory pollution, a second pattern appeared in Phase 2.5 results: tasks 174-187 (Notes), 208/213/215 (Calendar), and 224 (Reminders) all hit 60-second setup timeouts. These are not the first Notes/Calendar/Reminders tasks in the run — earlier ones (003, 014, 017-020, 165, 166) succeeded in 5-25 seconds. The pattern: AppleScript invocations against Notes, Calendar, or Reminders that warmed up cleanly at the run's start hung after roughly 5-10 invocations against the same warm app process.

We confirmed the diagnosis by observing that running the warmup script (which force-quits all bench-touched apps) between any two failing tasks restored normal AppleScript response time. The bug is in macOS app-state accumulation, not in our scripts.

The fix: per-task PID-snapshot isolation (§4.2). Each task gets a fresh app process. ~300 ms per-task overhead in exchange for no degradation.

This recovered ≈ 5 more percentage points (Phase 2.5 → Phase 3).

6.3 Real Bug 3: Eval-skip-on-exec-timeout

The original runner skipped eval.sh if exec returned a timeout or non-zero exit. Tasks where the agent completed the action and then hit timeout while exploring were therefore counted as fail even when the world was in the success state. We fixed this by always running eval, regardless of exec outcome (§4.3).

In Phase 2 results, ~8 tasks were in this state. After the fix, they passed correctly.

6.4 False Positive 1: "kinclaw is bad at Notes"

Phase 2 reported a 5% pass rate on Notes (1/21), suggesting kinclaw + Kimi-K2.5 had a fundamental weakness with the Notes app. Investigation revealed something different.

Several Notes evals contained the construct:

NORM="$(...)"
if [[ "${NORM,,}" == *"hello from kinbench"* ]]; then
    echo PASS
fi

The ${NORM,,} lowercase-substitution syntax requires bash ≥ 4. macOS still ships bash 3.2 by default. On every test machine running with /usr/bin/env bash, this expansion failed silently with bad substitution, the eval exited non-zero, and the task was scored fail — even when the agent had completed the task correctly.

We found this by manually inspecting the eval output:

first line: 'hello from kinbench'
eval.sh: line 12: ${NORM,,}: bad substitution
FAIL: 'hello from kinbench' doesn't contain 'hello from kinbench'

The fix: tr '[:upper:]' '[:lower:]' (POSIX, works on bash 3.2). After the fix, Notes pass rate moved from 5% to 67%, and the apparent agent weakness vanished.

The lesson: a single bash-4-only construct in eval scripts can manufacture a 60-percentage-point capability gap. The benchmark must be portable to the same environment its agents are evaluated in, and macOS bash 3.2 is not optional terrain.

6.5 False Positive 2: "Safari TCC denied"

Phase 2 also reported all 11 Safari tasks failing. Each failure logged something like:

✗ 002-safari-open-url   T1   41106ms [eval] timeout after 30s

Our hypothesis was a real Safari permission issue. Investigation showed the eval script's osascript "tell application Safari to return URL of front document" was hanging for 30 seconds, then being killed by our 30-second eval timeout. When we checked Safari's TCC state externally:

osascript -e 'tell application "Safari" to running'
→ execution error: Not authorized to send Apple events to Safari. (-1743)

Safari Automation TCC for the macbench runner's parent process had been silently revoked at some point during the run, and our 30-second timeout was masking the real failure mode (TCC denial, not Safari hung).

The transient nature was the giveaway: re-running make warmup (which restarts Safari and re-triggers the Automation TCC dialog) restored Safari's responsiveness. We documented this as a one-time setup step in the README and added Safari to the warmup probe list.

6.6 Summary

The clean room conditions for an honest macbench score:

●Use the dedicated macbench.soul.md (not the daily-use pilot soul).
●Run make warmup immediately before the bench.
●Use the per-task PID-snapshot isolation in the runner (default in v0.1).
●Run eval-always-runs in the runner (default in v0.1).
●Use POSIX-portable shell idioms in eval scripts.
●Pre-grant AppleScript Automation TCC to the benchmark's parent shell (covered by warmup probes).

The same agent that scored 49.3% in the contaminated state scores 67.3% in the clean state. Future cross-agent comparisons require the same clean-state baseline; otherwise the comparison measures contamination, not capability.

6.7 The platform ceiling — separating agent capability from environmental limits (v0.1.1, 2026-05-10)

A subtle question hangs over any agent benchmark on a real OS: when the agent fails a task, how do we know it's the agent's fault and not the environment's? On a synthetic VM benchmark like OSWorld, the answer is mostly "it's the agent" — the VM is reset between tasks, the apps are open-source, and any flakiness can be debugged by reading source. On macbench, every task touches Apple's closed-source apps over Apple's iCloud sync layer, both of which have non-deterministic behavior outside the agent's control.

To make this distinction concrete, v0.1.1 ships tools/reference_verifier.sh: a runner that executes each Notes task with a pre-written canonical AppleScript / shell solution instead of an agent. Every task gets its best-known direct solution; setup.sh runs, the canonical action runs, eval.sh runs. No LLM, no inference latency.

$ tools/reference_verifier.sh    # 31 notes tasks
═══════════════════════════════════════════════
PASS:   21 / 31
FAIL:   10 / 31
TIME:   ~100 seconds
═══════════════════════════════════════════════

The 21/31 (67.7%) is the platform ceiling for the notes category on this hardware + iCloud setup. Any agent's score is bounded above by this number; the gap from agent peak (kinclaw + Kimi K2.5 = 17/31 = 54.8% on the notes-only run) to the platform ceiling (21/31) is the real agent capability gap, free of platform noise.

The 10 reference verifier fails decompose into four categories:

●AppleScript dictionary regression on macOS 14+ (3 tasks: 164 / 165 / 166). Apple removed the pinned property from Notes' AppleScript dictionary in macOS 14. Reading or setting pinned of note errors with Can't make pinned of note id "..." into type specifier. There is no workaround other than UI-level pin-via-File-menu, which is brittle to drive headless. We mark these tasks as having a soft-pass eval (note still exists + modification > creation = PASS) and document the limit alongside 166 (lock note).
●iCloud sync timing variance (4 tasks: 036 / 176 / 183 / 188). Notes operations dispatched via AppleScript return "success" before iCloud has replicated the change. eval.sh, which runs immediately after, reads the pre-replication state. Adding 5-second buffers helps but doesn't always close the race; the variance is 1-30 seconds depending on iCloud activity from other devices.
●Mail draft persistence semantics (2 tasks: 167 / 369). AppleScript's make new outgoing message followed by save produces an unsaved compose-state message that AppleScript's own every message of mailbox "Drafts" query does not see — the draft is in Mail's compose queue, not yet flushed to the Drafts mailbox. Closing the compose window with close saving yes works inconsistently; the timing required for the flush is account-dependent.
●Notes UI keystroke flakiness (1 task: 171 image attach + run-to-run variance on 169 / 170 / 172 / 173 / 174). The clipboard-paste path for image attachments works ~80% of the time; the failures cluster around iCloud-sync collisions during the paste. The format / heading / checklist UI keystroke paths work ~60-90% per attempt with no clear deterministic pattern.

This decomposition has methodological consequences:

●Future cross-agent comparisons should report PASS / platform ceiling not just PASS / total. An agent scoring 17/31 on raw counts is at 17/21 = 81% of the platform ceiling — a much fairer characterization of agent capability.
●The reference verifier is a per-Mac calibration tool, not a fixed number. Different Mac hardware, iCloud account states, and macOS versions produce different ceilings. A community-contributed ceiling per environment is the v0.3 vision.
●The tools/reference_verifier.sh runtime is ~100 seconds total, vs. ~30 minutes for a full agent run. Iterating on eval.sh changes against the reference (rather than the agent) gives a tight feedback loop without burning brain inference budget.

Alongside this discovery, v0.1.1 also lands 10 stub implementations (notes category 21→31 fully implemented, total 150→160 of 369) and 5 eval bug fixes (the most consequential: tasks/036-notes-delete/eval.sh was counting matches in Notes' "Recently Deleted" folder, so a successfully-deleted note still appeared as count = 1 and the eval never returned PASS — the same bug existed in 188-notes-bulk-delete-tagged).

7. Limitations

We commit to these explicitly in v0.1 and target each in roadmap (§8).

Coverage. v0.1 ships 150 of 369 task slots. The remaining 219 are stubs with real prompts but no setup/eval scripts. Categories most affected: Mail (1/40), Pages (1/15), Numbers (1/15), Keynote (0/10), Maps (0/5). These are partly implementation cost, partly genuine difficulty: iWork file formats are binary protobuf-in-zip bundles that resist clean external inspection, Maps state is opaque, and Mail evaluation against a real user account requires careful test-data isolation that v0.1 does not provide.

Single agent. Only kinclaw is wired up as a backend. Cross-agent comparison (Anthropic Computer Use, OpenAI's CUA, open-source Qwen2.5-VL line) is roadmap. The agent-agnostic runner makes this straightforward — each backend needs a 3-line shell wrapper plus a Soul-equivalent prompt template — but the wrapping is not done.

Single brain. Our reference number used Kimi-K2.5(cloud). Cross-brain comparison (Claude Sonnet 4.5, GPT-4o, DeepSeek V4, local Llama variants) requires a brain switcher inside the macbench soul that does not yet exist; v1.16 of kinclaw will add it.

No CI. macOS CI requires a Mac runner with TCC pre-granted. GitHub Actions macos runners cannot be granted programmatically. Self-hosted Mac mini in CI is the path; this is v0.3 or v1.0 work.

No parallelism. Unlike OSWorld's AWS deployment that runs tasks across many VM clones in parallel, macbench is bound to one Mac at a time. A 369-task run takes 95 minutes with isolation; longer runs are sequential. This is not fixable without breaching Apple's EULA on Mac VMs.

sudo blocking. Some macOS settings (firewall, login window) require sudo, which the agent attempts and which produces an interactive Password: prompt that the bench cannot answer. Affected tasks hang for the per-task timeout. v0.2 will add sudo: false to the macbench soul and rewrite affected task prompts to use UI-driven paths.

Opaque iWork file formats. Pages/Numbers/Keynote files are zip bundles containing protobuf-encoded binary blobs. Eval scripts cannot trivially inspect "did the agent insert the right text" without reverse-engineering Apple's protobuf schema. v0.1 evaluates these tasks softly (file existence + size) and accepts the lower fidelity.

Brain coupling on the agent side. kinclaw's reference score reflects kinclaw + Kimi-K2.5 as a system. We cannot decompose how much of the 67.3% is the agent architecture (5-claw + soul-skill separation + macOS-native AX) and how much is the brain. v0.2's cross-brain matrix begins to answer this.

8. Roadmap

8.1 Task fill schedule

v0.1 (2026-05, shipped)         150 implemented + 219 stubs = 369 slots
v0.2 (target 2026-06)           200 implemented + 169 stubs
v0.3 (target 2026-08)           300 implemented +  69 stubs
v1.0 (target 2026-12)           369 implemented +   0 stubs   ← parity reached

Target rate: 30-50 stubs implemented per month, roughly half by us, roughly half by community contributions (AUTHOR_GUIDE.md walks contributors through the three-file pattern with concrete examples).

8.2 Cross-agent + cross-brain matrix

v0.2 priorities:

●Wrap Anthropic Computer Use as a macbench agent backend.
●Wrap OpenAI CUA (formerly Operator).
●Add a brain-switcher inside macbench.soul.md so kinclaw can be re-scored under Claude Sonnet 4.5, GPT-4o, etc., without rebuilding.

The output is a leaderboard table with rows per (agent × brain × benchmark version) tuple. We will publish this on a dedicated leaderboard page and require both IMPLEMENTED and STRICT scores per row.

8.3 v0.2 stability commitments

These will not change between v0.1 and v0.2:

●Three-file pattern (task.json + setup.sh + eval.sh + optional teardown.sh) is stable.
●Eval scripts exit 0 on pass, non-zero on fail. No meta-evaluators. No LLM-as-judge.
●Agent contract: exec(prompt) → driver-on-macOS-for-some-bounded-time → exits. State is observed externally via filesystem / defaults / sqlite / AppleScript.
●Difficulty tiers (T1 / T2 / T3) are stable. Once a task ships at T2, it stays T2.

What can change:

●Specific eval logic of a task, if a bug is found in the evaluator (we bump major version when this happens, and prior scores are tagged with the version they were run on).
●Per-task default timeout (rare; we'd flag the change in CHANGELOG).
●Stub status promoted to implemented (this only moves scores, doesn't invalidate them).

8.4 Cross-benchmark validation

In parallel with macbench's growth, we plan adapters in kinclaw/benchmarks/ that score kinclaw on the existing public benchmarks:

●WebArena (live web): high feasibility, kinclaw's web claw is already Playwright-based. ETA v0.2.
●OSWorld (Linux desktop): vision-only fallback mode (kinclaw's AX claw cannot reach into a VM). ETA v0.3, with explicit caveats that this measures a degraded version of kinclaw.
●Online-Mind2Web (live web, real public sites): roadmap; investigation pending.
●Mind2Web (static): explicitly out of scope — architectural mismatch with execution-oriented agents.

These add cross-platform validation that complements macbench's macOS-specific score. Together, they enable claims like "kinclaw scores 67% on macbench AND 30% on WebArena AND 12% on OSWorld vision-only" — which is more useful than any single number alone.

9. Discussion

9.1 Why the macOS surface matters

Most published computer-use research targets Linux because Linux is virtualizable, scalable on cloud, and free. But computer-use deployment targets the OS the user is on, and at the time of writing roughly 15-20% of the productivity-knowledge-worker market runs macOS — that is hundreds of millions of users. The gap between published benchmark surface and deployment surface has produced a measurement vacuum that vendors fill with curated demos.

macbench does not solve this gap completely. It does establish that the gap can be closed by a single Mac mini and a 14-hour debugging session, and that the resulting numbers are at the same order of magnitude as published OSWorld scores. The bar for "we have a public number for macOS agents" is now: clone macbench, run make bench AGENT=..., report.

9.2 The price of honesty

The 49.3% → 67.3% trajectory we documented is uncomfortable to publish. It would have been much easier to run the bench once, get a number, ship the paper. The reason we did not is that the difference between contaminated and clean conditions is the size of an entire OS-level capability claim. If we had reported 49.3% as the headline, every reader would draw a conclusion that does not survive a clean run. If we had reported 67.3% without showing the trajectory, every replicator would attempt to reproduce and fail at the contaminated condition's number.

The methodology contribution we hope to leave is: single-shot benchmark numbers in computer-use evaluation are fragile, and the fragility is environmental, not stochastic. Publishing the trajectory protects future comparisons.

9.3 What this enables

Two near-term applications of macbench are worth naming:

Vendor capability claims. When Anthropic, OpenAI, or others publish "our model does X% better on Mac than the previous generation," macbench gives the community a way to verify. The agent-agnostic runner means any commercial agent that ships a CLI binary can be benchmarked by any third party in 95 minutes.

Agent architecture research. The 5-claw architecture in kinclaw (screen + input + ui + record + web) is one design choice among several. macbench provides a way to compare it against pure-vision agents (which would do significantly worse on tasks where AX semantic queries beat pixel inspection) or pure-AX agents (which would do worse on tasks that genuinely need pixel feedback). The per-category score breakdown makes the architectural tradeoff visible.

9.4 Limitations of the current scoring framework

We identify two limitations of dual scoring (IMPLEMENTED + STRICT) that may need future revision.

First, STRICT punishes the benchmark for being incomplete. A theoretical agent that solves all 369 tasks in v0.1 would still score 41% STRICT (because only 150 are runnable), which understates its capability. We accept this for v0.1 because the alternative (silently omit unimplemented slots from the denominator) hides the design surface entirely. By v1.0, when 369/369 are implemented, STRICT and IMPLEMENTED converge and the issue dissolves.

Second, IMPLEMENTED can be gamed by selectively dropping hard tasks from the implemented set. A benchmark maintainer who wanted to inflate IMPLEMENTED scores could leave the hardest stubs unimplemented and tout the easier-task IMPLEMENTED rate. We protect against this by publishing the per-category and per-difficulty breakdown alongside the headline (§5.3), and by committing in our methodology stability section (§8.3) not to retroactively reclassify task difficulty.

10. Conclusion

macbench v0.1 fills a structural gap in the public computer-use benchmark landscape: there is now a published, runnable benchmark for macOS-native agent capability. The first reference run produced a credible number (kinclaw v1.15.0 + Kimi-K2.5 scores 67.3% IMPLEMENTED), and the methodology choices that produced it — agent-agnostic invocation, dual scoring, per-task PID-snapshot isolation, eval-always-runs — are documented in source and in this paper.

We make three commitments. First, the benchmark and its runner are MIT-licensed and the entire 369-task corpus is published. Second, the methodology stability commitments in §8.3 will be honored across minor versions; agents scored on v0.1 retain their scores under v0.2 for the v0.1 task subset. Third, we welcome external agent backends and external task contributions equally — neither LocalKin's products nor kinclaw specifically have any preferred status in the benchmark.

If the macOS native surface is going to have measurable computer-use capability, it needs a measurable benchmark. macbench is the start. We hope to be one of many maintainers, not the only one.

Acknowledgments

The methodology debt to OSWorld (Xie, Zhang, Zhou, Lu, Chang, Xu, Wang, Su, Mao, Su, Hou, Cao, Awasthi, Wang, Xie, Hu, Hong, Liu, Lin, Liu, Cheng, Chen, Chao, Yu, Su, Yang, Yu, Wang, Xie, Liu, Yang, Wang, Yang, Chen, Yu, & Wang, NeurIPS 2024) is substantial: the three-file pattern, the difficulty taxonomy, the eval-script-exits-with-status contract are all theirs. macbench is OSWorld's methodology applied to a surface OSWorld did not target.

Implementation infrastructure built on top of kinax-go, sckit-go, input-go, and kinrec — pure-Go bindings to macOS frameworks documented in the Embedded Dylib paper.

The reference agent kinclaw inherits architecture from prior LocalKin work: the Thin Soul / Fat Skill separation, the Genesis Protocol bootstrapping pattern, and the Self-Evolving Swarms improvement loop.

References

●Xie, T., Zhou, B., Cheng, Z., Zhang, M., Chao, W., Yu, T., Zhao, Y., Hu, Y., Wang, J., Liu, J., Wang, J., Su, Y., Hong, P., Liu, J., Cao, Z., Xu, K., Liu, X., Yu, T., Wang, T., Mao, X., Lin, X., Cao, S., Chen, B., Yu, T., Su, Y., Liu, X., Awasthi, P., Xie, S., Wang, X., Liu, J., Yang, S., Wang, X., Yang, Y., Chen, X., Yu, T., Su, Y., Yang, Z., Liu, J., Cheng, Z., & Wang, X. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. NeurIPS 2024 (Spotlight). https://github.com/xlang-ai/OSWorld
●Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Bisk, Y., Fried, D., Alon, U., & Neubig, G. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. ICLR 2024. https://github.com/web-arena-x/webarena
●Rawles, C., Clinckemaillie, S., Chang, Y., Waltz, J., Lau, G., Fair, M., Li, A., Bishop, W., Li, W., Campbell-Ajala, F., et al. (2024). AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents. arXiv:2405.14573.
●Bonatti, R., Zhao, D., Bonacci, F., Dupont, D., Abdali, S., Li, Y., Wagle, J., Koyejo, S., Mendes, R. F. d. P., Rashidi, A., et al. (2024). Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale. arXiv:2409.08264.
●Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., & Su, Y. (2023). Mind2Web: Towards a Generalist Agent for the Web. NeurIPS 2023. https://github.com/OSU-NLP-Group/Mind2Web
●Anthropic (2024). Computer use (beta) — Introducing Claude 3.5 Sonnet's new ability to use computers like a human. https://www.anthropic.com/news/3-5-models-and-computer-use
●The LocalKin Team. Embedded Dylib: A Distribution Pattern for Pure-Go Bindings to System Frameworks. https://www.localkin.dev/papers/embedded-dylib
●The LocalKin Team. Thin Soul, Fat Skill: A Token-Efficient Architecture for LLM Agents. https://www.localkin.dev/papers/thin-soul-fat-skill

Repository: https://github.com/LocalKinAI/macbench License: MIT

How to cite this paper

Three formats below — pick the one that matches your venue. Each has a one-click copy button.

BibTeX

@misc{localkin2026macbench,
  author    = {{The LocalKin Team}},
  title     = {macbench: A macOS-Native Computer-Use Benchmark for Autonomous Agents},
  year      = {2026},
  month     = may,
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.20094244},
  url       = {https://doi.org/10.5281/zenodo.20094244},
  note      = {Correspondence: contact@localkin.ai}
}

APA

The LocalKin Team. (2026). macbench: A macOS-Native Computer-Use Benchmark for Autonomous Agents. Zenodo. https://doi.org/10.5281/zenodo.20094244

Chicago

LocalKin Team, The. 2026. "macbench: A macOS-Native Computer-Use Benchmark for Autonomous Agents." Zenodo, May. https://doi.org/10.5281/zenodo.20094244.