A self-improving multi-agent orchestration framework for Claude Code. 18 specialized agents coordinate through the full development lifecycle, from GitHub issue to merged PR, with TDD, spec-driven development, and 100% test coverage.
cd your-project
npx metaswarm init
Copied!
Claude Code is good at writing code. It is not good at building and maintaining a production codebase.
Shipping a production codebase needs more than just code. It needs research into what already exists, a plan that fits the codebase, a security review, a design review, tests, a PR, CI monitoring, review comment handling, and someone to close the loop and capture what was learned. That is seven or eight distinct jobs. A single agent session cannot hold all of that context, and it definitely cannot review its own work objectively.
So you end up doing the coordination yourself. You are the orchestrator. You prime the agent with context, tell it what to build, review the output, fix what it missed, create the PR, babysit CI, respond to review comments, and then do it all again for the next feature. The agent is a fast typist, but you are still the project manager.
metaswarm fixes that. It is a full orchestration layer for Claude Code that breaks the work into phases, assigns each phase to a specialist agent, iterates through multiple reviews from other agents blocking until they approve, and coordinates the handoffs, all the way through PR creation and shepherding through external code agent review, integrating with tools like CodeRabbit, Greptile, and other external code review agents. You describe what you want built. The system figures out how to build it, reviews its own plan, implements it with TDD, shepherds the PR through CI and review, and writes down what it learned for next time.
Every feature goes through eight phases. Each phase is handled by a specialist agent (or a group of them). The Issue Orchestrator manages the handoffs.
The Design Review Gate is the part that surprised me. Five agents review the plan simultaneously, each from a different perspective. All five have to approve before implementation starts. If they do not agree after three rounds, the system escalates to a human. This catches real problems. Not theoretical ones.
metaswarm maintains a JSONL knowledge base in your repo. Patterns, gotchas, architectural decisions, anti-patterns. After every merged PR, the self-reflect workflow analyzes what happened and writes new entries.
But the interesting part is conversation introspection. The system looks at your Claude Code session and watches for signals:
The knowledge base can grow to hundreds or thousands of entries without filling your context window, because agents do not load all of it. bd prime uses selective retrieval, filtered by the files you are touching, the keywords that matter, and the type of work you are doing. You get the five gotchas relevant to the auth middleware you are about to change, not the entire institutional memory of the project.
Researcher, Architect, PM, Designer, Security, CTO, Coder, Code Reviewer, Security Auditor, PR Shepherd, Test Automator, Knowledge Curator, and more. Each has a defined role, process, and output format.
Design review gate, PR shepherd, PR comment handling, brainstorming extension, and issue creation. These are the coordination behaviors that tie agents together.
/project:prime, /project:start-task, /project:review-design, /project:self-reflect, /project:pr-shepherd, and more. These are your entry points.
Standardized review criteria for code, architecture, security, test coverage, and implementation plans. These are what the review agents score against.
Configurable test coverage thresholds via .coverage-thresholds.json that block PR creation and task completion. Agents cannot ship code that drops coverage. Works with any test runner.
Schema and example entries for patterns, gotchas, decisions, anti-patterns, codebase facts, and API behaviors. Seed it with your project's context.
Swarm Coordinators spawn Issue Orchestrators, which can spawn sub-orchestrators. Complex epics decompose into sub-epics automatically. Swarm of swarms.
Each agent is a markdown file that defines a persona, responsibilities, process, and output format. They are prompts, not code. You can read them, edit them, and add your own.
| Agent | Phase | What It Does |
|---|---|---|
| Swarm Coordinator | Meta | Assigns work to worktrees, manages parallel execution |
| Issue Orchestrator | Meta | Decomposes issues into tasks, manages phase handoffs |
| Researcher | Research | Explores codebase, discovers patterns and dependencies |
| Architect | Planning | Designs implementation plan and service structure |
| Product Manager | Review | Validates use cases, scope, and user benefit |
| Designer | Review | Reviews API/UX design and consistency |
| Security Design | Review | Threat modeling, STRIDE analysis, auth review |
| CTO | Review | TDD readiness, codebase alignment, final approval |
| Coder | Implement | TDD implementation with 100% coverage |
| Code Reviewer | Review | Pattern enforcement, test verification |
| Security Auditor | Review | Vulnerability scanning, OWASP checks |
| PR Shepherd | Delivery | CI monitoring, comment handling, thread resolution |
| Knowledge Curator | Learning | Extracts learnings, updates knowledge base |
| Test Automator | Implement | Test generation and coverage enforcement |
| Metrics | Support | Analytics and weekly reports |
| SRE | Support | Infrastructure and performance |
| Slack Coordinator | Support | Notifications and human communication |
| Customer Service | Support | User support and triage |
The hardest problem in agent-driven development is not getting agents to write code. It is getting them to maintain standards. You can put "run coverage before pushing" in a checklist. Agents will skip it. They will misread thresholds, run the wrong command, or decide the step does not apply. We shipped multiple PRs with coverage regressions before we accepted that procedural enforcement is not enforcement. It is a suggestion.
The fix is deterministic gates: automated checks that block bad code regardless of whether an agent follows instructions. metaswarm supports three enforcement points, all driven by a single config file:
A Husky git hook that runs lint, typecheck, format checks, and your coverage command before every git push. If coverage drops, the push is rejected. No agent can bypass it.
A GitHub Actions workflow that reads the same config and blocks merge on failure. Even if an agent somehow pushes, it cannot merge.
The task-completion checklist reads the enforcement command from config. The weakest gate on its own, but combined with the other two, coverage regressions are caught at every level.
.coverage-thresholds.json defines your thresholds and enforcement command. All three gates read from it. Change your test runner once, all gates update automatically.
Setting it up is one command:
npx metaswarm init --with-husky --with-ci
This initializes Husky, installs the pre-push hook, creates the CI workflow, and copies the coverage thresholds config to your project root. Each flag is opt-in: use --with-coverage alone for just the config file, --with-husky for the git hook, or --with-ci for the GitHub Actions workflow. Use all three for the full enforcement stack.
The thresholds work with any test runner. Set enforcement.command to pnpm test:coverage, pytest --cov, cargo tarpaulin, or whatever your project uses. See coverage-enforcement.md for the full setup guide.
bd) v0.40+gh)cd your-project
npx metaswarm init
# Or with coverage enforcement gates:
npx metaswarm init --with-husky --with-ci
That scaffolds all 18 agents, skills, commands, rubrics, knowledge templates, and scripts into your project. The flags optionally set up pre-push hooks and CI coverage enforcement. Existing files are never overwritten.
Give this prompt to Claude Code in your project. It will adapt metaswarm to your language, framework, and conventions:
Clone https://github.com/dsifry/metaswarm into /tmp/metaswarm-install
and set up the multi-agent orchestration framework in this project:
1. Copy agents, skills, commands, rubrics, and knowledge templates
into the right .claude/ directories (see metaswarm's INSTALL.md
for the exact paths).
2. Create the plugin.json registration file.
3. Initialize BEADS with bd init and set up the knowledge directory.
4. Read this project's config files (package.json, Cargo.toml,
pyproject.toml, go.mod, or whatever exists) to understand our
language, framework, test runner, and linter.
5. Customize the agent definitions and rubrics for our specific
stack. Replace generic test/lint/build commands with ours.
Add our framework's patterns to the architecture rubric.
6. Seed the knowledge base with 3-5 initial patterns, 2-3
architectural decisions, and 1-2 gotchas from this codebase.
7. Clean up the temp clone when done.
Do not change the orchestration workflow itself. Only adapt the
language-specific and project-specific details.
# Install BEADS
curl -sSL https://raw.githubusercontent.com/steveyegge/beads/main/scripts/install.sh | bash
# Clone and copy
git clone https://github.com/dsifry/metaswarm.git /tmp/metaswarm-install
mkdir -p .claude/plugins/metaswarm/skills/beads/agents
cp /tmp/metaswarm-install/agents/* .claude/plugins/metaswarm/skills/beads/agents/
cp /tmp/metaswarm-install/ORCHESTRATION.md .claude/plugins/metaswarm/skills/beads/SKILL.md
cp -r /tmp/metaswarm-install/skills/* .claude/plugins/metaswarm/skills/
cp -r /tmp/metaswarm-install/commands/* .claude/commands/
cp -r /tmp/metaswarm-install/rubrics/* .claude/rubrics/
# Initialize BEADS and knowledge base
bd init
mkdir -p .beads/knowledge
cp /tmp/metaswarm-install/knowledge/* .beads/knowledge/
# Clean up
rm -rf /tmp/metaswarm-install
See INSTALL.md for the full guide, including customization checklists for TypeScript, Python, Rust, and Go projects.
Under the hood, this is all prompts and BEADS task tracking. No custom runtime. No server. No dependencies beyond Claude Code and the bd CLI.
Each agent in agents/ is a prompt that defines a role, responsibilities, and process. When the orchestrator needs a researcher, it spawns a subagent with that prompt. The agent does its work, returns results, and the orchestrator moves to the next phase. You can read every agent definition. You can edit them. You can add new ones.
Every feature starts as a BEADS epic with subtasks. Dependencies between tasks enforce ordering. The orchestrator checks bd ready to find unblocked work, updates task status as agents complete phases, and closes the epic when the PR merges. All of this is stored in a SQLite database inside your repo, synced through git.
When an agent starts work, bd prime loads knowledge filtered by the files being touched and the type of work being done. A coder working on auth routes gets auth-related gotchas. A security reviewer gets the OWASP-related patterns. The knowledge base can grow to thousands of entries without any agent needing to read all of them.
The design review gate gives agents three iterations to converge. If they cannot agree, or if requirements are ambiguous, the system stops and asks a human. Tasks get marked as blocked in BEADS with a waiting:human label, and if you have Slack configured, you get a DM. The system does not guess when it should ask.
metaswarm does not replace your automated code review tools. It works with them. The PR Shepherd agent monitors incoming review comments from whatever tools you have configured and handles them systematically.
Out of the box, the PR comment handling skill knows how to parse and respond to:
The handling workflow categorizes each comment by priority, determines if it is actionable or out-of-scope, addresses the actionable ones, and resolves the threads. Comments from automated reviewers also feed into the self-reflect loop. When CodeRabbit catches something three times, that becomes a knowledge base entry so agents stop making that mistake.
The last piece of the PR lifecycle is knowing when a PR is actually ready to merge. That is what GTG (Good-To-Go) does. It is a single CLI and GitHub Action that consolidates everything into one deterministic check:
# PR Shepherd polls this until it returns READY
gtg 42 --format json --exclude-checks "Merge Ready (gtg)"
The PR Shepherd agent uses GTG as its primary readiness signal. When GTG reports CI_FAILING, the shepherd investigates and fixes. When it reports ACTION_REQUIRED, it addresses review comments. When it reports UNRESOLVED_THREADS, it resolves them. When it returns READY, it notifies a human for final merge approval.
You set this up as a GitHub Action in your repo. The templates/ directory includes the workflow file. Combined with your repo's branch protection rules, this gives you a fully automated quality gate that agents cannot bypass.
BEADS by Steve Yegge. Git-native, AI-first issue tracking. The coordination backbone for all task management, dependency tracking, and knowledge priming. BEADS made it possible to treat issue tracking as part of the codebase instead of an external service.
Superpowers by Jesse Vincent and contributors. The agentic skills framework that provides foundational workflows for brainstorming, test-driven development, systematic debugging, and plan writing. Superpowers proved that disciplined agent workflows are not overhead. They are what make autonomous development reliable.
The repo has everything: agent definitions, skills, commands, rubrics, knowledge templates, and full documentation.
View on GitHub