metaswarm

A self-improving multi-agent orchestration framework for Claude Code, Gemini CLI, and Codex CLI. 18 specialized agents coordinate through the full development lifecycle, from GitHub issue to merged PR, with TDD, cross-model adversarial review, and spec-driven development.

Just tell Claude what you want:

$ claude
> Read through https://github.com/dsifry/metaswarm and install it for my project.
Copied!

Claude reads the documentation, understands your project structure, installs the plugin, and configures everything for your stack. Supports TypeScript, Python, Go, Rust, Java, Ruby, and JavaScript.

Have Gemini CLI or Codex CLI installed? metaswarm can delegate implementation and review tasks to them automatically — the writer is always reviewed by a different model. Setup detects them and offers to enable cross-model orchestration.

Or install directly:

claude plugin marketplace add dsifry/metaswarm-marketplace
claude plugin install metaswarm
# then in Claude Code: /metaswarm:setup

One Prompt. Full App.

Install metaswarm and give Claude Code one prompt. No issue creation required. The system handles the rest.

Set up

mkdir my-app && cd my-app && git init && npm init -y
claude
# > Read through https://github.com/dsifry/metaswarm and install it for my project.

Start building

Run /metaswarm:start-task and describe what you want in plain English. Include a tech stack, Definition of Done items, and where you want human checkpoints. The more specific the spec, the better the agents perform.

Example — adapt for your project

/metaswarm:start-task I want you to build a real-time todo list with AI chat.

Tech stack: Node.js + Hono, React + Vite, SQLite, SSE, Claude SDK.

Definition of Done:
1. CRUD operations for todo items via REST API
2. Persistent storage in SQLite
3. Real-time sync across browser tabs via SSE
4. AI chat that can read and modify todos
5. 100% test coverage on backend
6. CI pipeline for tests and lint

Use the full metaswarm orchestration workflow:
research, plan, design review gate, decompose into work units,
and execute each through the 4-phase loop. Set human checkpoints
after the database schema and after the AI integration.
When all work units pass, create a PR.

The orchestrator takes over. It researches your project, plans the implementation, runs a pre-flight validation checklist, then six agents review the plan in parallel. It identifies external dependencies and prompts you for API keys. It breaks the plan into work units, and executes each through the 4-phase loop: implement with TDD, validate independently (with blocking coverage enforcement), adversarial review against the spec, and commit only after PASS. Quality gates are blocking state transitions — there is no path from FAIL to COMMIT. It pauses for your review at the checkpoints you specified. When everything passes, it creates and shepherds the PR.

You described what you wanted. The system figured out how to build it.

The Problem

Claude Code is good at writing code. It is not good at building and maintaining a production codebase.

Shipping a production codebase needs more than just code. It needs research into what already exists, a plan that fits the codebase, a security review, a design review, tests, a PR, CI monitoring, review comment handling, and someone to close the loop and capture what was learned. That is seven or eight distinct jobs. A single agent session cannot hold all of that context, and it definitely cannot review its own work objectively.

So you end up doing the coordination yourself. You are the orchestrator. You prime the agent with context, tell it what to build, review the output, fix what it missed, create the PR, babysit CI, respond to review comments, and then do it all again for the next feature. The agent is a fast typist, but you are still the project manager.

metaswarm fixes that. It is a full orchestration layer for Claude Code that breaks the work into phases, assigns each phase to a specialist agent, iterates through multiple reviews from other agents blocking until they approve, and coordinates the handoffs, all the way through PR creation and shepherding through external code agent review, integrating with tools like CodeRabbit, Greptile, and other external code review agents. You describe what you want built. The system figures out how to build it, reviews its own plan, implements it with TDD, shepherds the PR through CI and review, and writes down what it learned for next time.

The Pipeline

Every feature goes through eleven phases. Each phase is handled by a specialist agent (or a group of them). The Issue Orchestrator manages the handoffs.

1
Research Researcher agent explores codebase, finds patterns and dependencies
2
Plan Architect agent creates implementation plan with tasks
3
Plan Validation Pre-flight checklist: architecture, deps, API contracts, security, UI/UX, external deps
4
Design Review Gate PM, Architect, Designer, Security, UX Reviewer, CTO review in parallel 6 agents parallel
5
Decompose Break plan into work units with DoD items, file scopes, dependency graph
6
External Dependency Check Identifies required API keys/credentials, prompts user to configure them
7
Orchestrated Execution Per work unit: Implement → Validate → Adversarial Review → Commit. Can delegate to Codex/Gemini CLIs. 4-phase loop
8
Final Review Cross-unit integration check, full test suite, coverage enforcement
9
PR Creation Creates PR with structured description and test plan
10
PR Shepherd Monitors CI, handles review comments, resolves threads
11
Close + Learn Extracts learnings back into the knowledge base

The Design Review Gate is the part that surprised me. Six agents review the plan simultaneously, each from a different perspective — including a UX Reviewer that verifies user flows and integration work units. All six have to approve before implementation starts. If they do not agree after three rounds, the system escalates to a human. This catches real problems. Not theoretical ones.

It Gets Smarter Over Time

metaswarm maintains a JSONL knowledge base in your repo. Patterns, gotchas, architectural decisions, anti-patterns. After every merged PR, the self-reflect workflow analyzes what happened and writes new entries.

But the interesting part is conversation introspection. The system looks at your Claude Code session and watches for signals:

The knowledge base can grow to hundreds or thousands of entries without filling your context window, because agents do not load all of it. bd prime uses selective retrieval, filtered by the files you are touching, the keywords that matter, and the type of work you are doing. You get the five gotchas relevant to the auth middleware you are about to change, not the entire institutional memory of the project.

Trust Nothing. Verify Everything.

The hardest problem after getting agents to follow checklists is getting them to honestly report results. A coding agent that says "all tests pass" might have skipped the tests entirely, run the wrong suite, or misread the output. We learned this the hard way: agents self-certify success even when things are broken.

Orchestrated Execution fixes that. For complex tasks with a written spec, the orchestrator breaks the work into work units, each with enumerated Definition of Done items, and runs every unit through a 4-phase loop:

1. Implement

A coding agent builds against the spec using TDD. When it reports "done", the orchestrator does not believe it.

2. Validate

The orchestrator runs tsc, eslint, vitest, and coverage enforcement from .coverage-thresholds.json itself. Independently. Quality gates are blocking state transitions — not advisory. Never asks the coding agent "did the tests pass?"

3. Adversarial Review

A fresh review agent checks each DoD item with file:line evidence. Binary PASS/FAIL, not subjective quality vibes. If it fails, a new reviewer is spawned for re-review. No anchoring bias.

4. Commit

Only after adversarial PASS. The commit message includes the verified DoD items. If there is a human checkpoint, the system pauses and waits for you before continuing.

On failure: fix, re-validate, spawn a fresh reviewer (never the same one), and retry up to three times before escalating to a human with the full failure history. The fresh reviewer rule matters: without it, the reviewer checks "did they fix what I found?" instead of independently verifying the contract. Anchoring bias is real, even for AI agents.

This is not needed for every task. A typo fix or a small bug does not need a 4-phase loop. But for multi-unit features with a spec, risky schema changes, or anything where "it works, trust me" is not good enough, orchestrated execution is the difference between shipping and hoping.

Cross-Model Adversarial Review

A coding agent reviewing its own output has an inherent bias. metaswarm can delegate implementation and review tasks to external AI tools — OpenAI Codex CLI and Google Gemini CLI — with one rule: the writer is always reviewed by a different model.

Cross-Model Review

If Claude writes the code, Codex or Gemini reviews it. If Codex writes it, Claude or Gemini reviews. The reviewer never shares the writer's biases.

Availability-Aware Escalation

Model A (2 tries) → Model B (2 tries) → Claude (1 try) → user alert. If Codex is down, Gemini takes over automatically.

Shell Adapters

Each external tool has a shell adapter with health checks, implement, and review commands. The shared helper library includes timeout handling, worktree management, cost extraction, and error classification.

Opt-In, Per-Project

Configure via .metaswarm/external-tools.yaml. Each adapter is disabled by default. Enable the ones you have installed and authenticated.

This is not about replacing Claude. It is about eliminating the blind spots that any single model has when reviewing its own work. Cross-model review catches different classes of bugs because the models have different training biases.

Visual Review

Agents are good at reading code. They are bad at knowing if the UI looks right. The visual review skill uses Playwright to capture screenshots of your web UI at multiple viewports, then brings those screenshots into the Claude Code conversation for visual inspection.

Use it after implementing UI changes, before creating a PR, or anytime you need to verify that rendered output matches the spec.

Components

18 Agent Personas

Researcher, Architect, PM, Designer, Security, CTO, Coder, Code Reviewer, Security Auditor, PR Shepherd, Test Automator, Knowledge Curator, and more. Each has a defined role, process, and output format.

13 Orchestration Skills

Orchestrated execution, design review gate, plan review gate, PR shepherd, PR comment handling, external AI tools delegation, visual review, brainstorming extension, issue creation, interactive setup, migration, status diagnostics, and the main start workflow.

15 Slash Commands

/metaswarm:setup, /metaswarm:update, /metaswarm:status, /metaswarm:start-task, /metaswarm:start, /metaswarm:prime, /metaswarm:review-design, /metaswarm:brainstorm, /metaswarm:self-reflect, /metaswarm:pr-shepherd, /metaswarm:create-issue, /metaswarm:handle-pr-comments, /metaswarm:external-tools-health.

8 Quality Rubrics

Standardized review criteria for code, architecture, security, test coverage, implementation plans, adversarial plan review, adversarial spec compliance, and external tool review.

Coverage Enforcement

Configurable test coverage thresholds via .coverage-thresholds.json that block PR creation and task completion. Agents cannot ship code that drops coverage. Works with any test runner.

Knowledge Base Templates

Schema and example entries for patterns, gotchas, decisions, anti-patterns, codebase facts, and API behaviors. Seed it with your project's context.

Recursive Orchestration

Swarm Coordinators spawn Issue Orchestrators, which can spawn sub-orchestrators. Complex epics decompose into sub-epics automatically. Swarm of swarms.

Team Mode

Persistent teammates with context retention across sessions. When multiple Claude Code sessions are active, agents automatically coordinate through direct inter-agent messaging. No configuration needed.

Plan Review Gate

3 adversarial reviewers — Feasibility, Completeness, and Scope & Alignment — validate every implementation plan before it reaches the Design Review Gate. All 3 must approve.

Claude-Guided Setup

Auto-detects your language, framework, test runner, linter, formatter, type checker, package manager, CI system, and git hooks. Supports 7 languages, 15+ frameworks, and all major toolchains. Customizes everything interactively.

Self-Update

Run /metaswarm:update to check for new versions, view the changelog, update all component files, and re-detect your project context. User customizations are preserved.

6 Development Guides

Comprehensive reference guides for coding standards, testing patterns, git workflow, worktree development, build validation, and agent coordination. Agents load them automatically when relevant.

Workflow Enforcement

Mandatory intercepts at every handoff point ensure quality gates are never bypassed. After brainstorming → design review gate. After planning → plan review gate. Before PR → self-reflect + knowledge capture. Users choose execution method (orchestrated vs lightweight).

Context Recovery

Approved plans and execution state persist to .beads/ on disk. After context compaction or session interruption, bd prime --work-type recovery reloads everything: the plan, completed work, current position. No re-running expensive review gates.

The Agents

Each agent is a markdown file that defines a persona, responsibilities, process, and output format. They are prompts, not code. You can read them, edit them, and add your own.

AgentPhaseWhat It Does
Swarm CoordinatorMetaAssigns work to worktrees, manages parallel execution
Issue OrchestratorMetaDecomposes issues into tasks, manages phase handoffs
ResearcherResearchExplores codebase, discovers patterns and dependencies
ArchitectPlanningDesigns implementation plan and service structure
Product ManagerReviewValidates use cases, scope, and user benefit
DesignerReviewReviews API/UX design and consistency
Security DesignReviewThreat modeling, STRIDE analysis, auth review
CTOReviewTDD readiness, codebase alignment, final approval
CoderImplementTDD implementation with 100% coverage
Code ReviewerReviewDual-mode: collaborative (suggestions) or adversarial (spec compliance)
Security AuditorReviewVulnerability scanning, OWASP checks
PR ShepherdDeliveryCI monitoring, comment handling, thread resolution
Knowledge CuratorLearningExtracts learnings, updates knowledge base
Test AutomatorImplementTest generation and coverage enforcement
MetricsSupportAnalytics and weekly reports
SRESupportInfrastructure and performance
Slack CoordinatorSupportNotifications and human communication
Customer ServiceSupportUser support and triage

Agents Skip Checklists. Gates Don't.

The hardest problem in agent-driven development is not getting agents to write code. It is getting them to maintain standards. You can put "run coverage before pushing" in a checklist. Agents will skip it. They will misread thresholds, run the wrong command, or decide the step does not apply. We shipped multiple PRs with coverage regressions before we accepted that procedural enforcement is not enforcement. It is a suggestion.

The fix is twofold. First, quality gates in the orchestrated execution loop are defined as blocking state transitions, not advisory recommendations. There is no instruction path from FAIL to COMMIT. FAIL always means retry or escalate. Second, deterministic gates: automated checks that block bad code regardless of whether an agent follows instructions. metaswarm supports three enforcement points, all driven by a single config file:

Pre-Push Hook

A Husky git hook that runs lint, typecheck, format checks, and your coverage command before every git push. If coverage drops, the push is rejected. No agent can bypass it.

CI Coverage Job

A GitHub Actions workflow that reads the same config and blocks merge on failure. Even if an agent somehow pushes, it cannot merge.

Agent Completion Gate

The task-completion checklist reads the enforcement command from config. The weakest gate on its own, but combined with the other two, coverage regressions are caught at every level.

One Config File

.coverage-thresholds.json defines your thresholds and enforcement command. All three gates read from it. Change your test runner once, all gates update automatically.

The guided setup (/metaswarm:setup) detects your test runner and configures coverage enforcement automatically. For manual setup, copy the template:

cp templates/coverage-thresholds.json .coverage-thresholds.json

The thresholds work with any test runner. The setup skill maps your detected test runner to the correct coverage command automatically — pnpm test:coverage, pytest --cov, cargo tarpaulin, go test -cover, or whatever your project uses. See coverage-enforcement.md for the full setup guide.

Set It Up

Prerequisites

Claude Code

claude plugin marketplace add dsifry/metaswarm-marketplace
claude plugin install metaswarm

Then: /metaswarm:setup

Gemini CLI

gemini extensions install https://github.com/dsifry/metaswarm.git

Then: /metaswarm:setup

Codex CLI

curl -sSL https://raw.githubusercontent.com/dsifry/metaswarm/main/.codex/install.sh | bash

Then: $setup

Cross-Platform (auto-detect)

npx metaswarm init

Detects which CLIs you have installed and sets up metaswarm for each.

Setup detects your project's language, framework, test runner, linter, formatter, type checker, package manager, CI system, and git hooks. It creates the appropriate instruction file (CLAUDE.md, GEMINI.md, or AGENTS.md), configures coverage thresholds, and sets up .gitignore. Supports: TypeScript, Python, Go, Rust, Java, Ruby, JavaScript.

Updating

Claude Code: /metaswarm:update • Gemini CLI: /metaswarm:update • Codex CLI: cd ~/.codex/metaswarm && git pull

Upgrading from an Older Version

If you installed metaswarm via npx metaswarm init (v0.6–v0.8), upgrade to the plugin:

  1. Install the plugin: claude plugin marketplace add dsifry/metaswarm-marketplace && claude plugin install metaswarm
  2. Open Claude Code and run /metaswarm:migrate — this removes redundant npm-installed copies (your project files are never touched)
  3. Run /metaswarm:status to verify everything is clean
  4. Review and commit the cleanup

See INSTALL.md for the full upgrade guide.

How It Actually Works

Under the hood, this is all prompts and BEADS task tracking. No custom runtime. No server. No dependencies beyond Claude Code and the bd CLI.

Agent definitions are markdown files

Each agent in agents/ is a prompt that defines a role, responsibilities, and process. When the orchestrator needs a researcher, it spawns a subagent with that prompt. The agent does its work, returns results, and the orchestrator moves to the next phase. You can read every agent definition. You can edit them. You can add new ones.

BEADS tracks the work

Every feature starts as a BEADS epic with subtasks. Dependencies between tasks enforce ordering. The orchestrator checks bd ready to find unblocked work, updates task status as agents complete phases, and closes the epic when the PR merges. All of this is stored in a SQLite database inside your repo, synced through git.

Knowledge base is selective, not exhaustive

When an agent starts work, bd prime loads knowledge filtered by the files being touched and the type of work being done. A coder working on auth routes gets auth-related gotchas. A security reviewer gets the OWASP-related patterns. The knowledge base can grow to thousands of entries without any agent needing to read all of them.

Human checkpoints are proactive, not reactive

The design review gate gives agents three iterations to converge. If they cannot agree, or if requirements are ambiguous, the system stops and asks a human. But orchestrated execution goes further: you define planned checkpoints in the spec (after schema changes, after security-sensitive code, at natural boundaries). The orchestrator pauses at those points and presents a report. It waits for you. This is not a notification. It is a gate. The system does not guess when it should ask, and it does not continue without your explicit approval.

Works With Your Code Reviewers

metaswarm does not replace your automated code review tools. It works with them. The PR Shepherd agent monitors incoming review comments from whatever tools you have configured and handles them systematically.

Supported review tools

Out of the box, the PR comment handling skill knows how to parse and respond to:

The handling workflow categorizes each comment by priority, determines if it is actionable or out-of-scope, addresses the actionable ones, and resolves the threads. Comments from automated reviewers also feed into the self-reflect loop. When CodeRabbit catches something three times, that becomes a knowledge base entry so agents stop making that mistake.

The GTG Merge Gate

The last piece of the PR lifecycle is knowing when a PR is actually ready to merge. That is what GTG (Good-To-Go) does. It is a single CLI and GitHub Action that consolidates everything into one deterministic check:

# PR Shepherd polls this until it returns READY
gtg 42 --format json --exclude-checks "Merge Ready (gtg)"

The PR Shepherd agent uses GTG as its primary readiness signal. When GTG reports CI_FAILING, the shepherd investigates and fixes. When it reports ACTION_REQUIRED, it addresses review comments. When it reports UNRESOLVED_THREADS, it resolves them. When it returns READY, it notifies a human for final merge approval.

You set this up as a GitHub Action in your repo. The templates/ directory includes the workflow file. Combined with your repo's branch protection rules, this gives you a fully automated quality gate that agents cannot bypass.

Built On

BEADS by Steve Yegge. Git-native, AI-first issue tracking. The coordination backbone for all task management, dependency tracking, and knowledge priming. BEADS made it possible to treat issue tracking as part of the codebase instead of an external service.

Superpowers by Jesse Vincent and contributors. The agentic skills framework that provides foundational workflows for brainstorming, test-driven development, systematic debugging, and plan writing. Superpowers proved that disciplined agent workflows are not overhead. They are what make autonomous development reliable.

Get Started

The repo has everything: agent definitions, skills, commands, rubrics, knowledge templates, and full documentation.

View on GitHub

Star it. Clone it. File issues. MIT licensed.