title: "Capability Debt: A System That Discovers and Installs Its Own Upgrades"
slug: 022-building-an-ai-that-improves-itself
date: 2026-02-17
category: systems
tags: [ai-development, evolution, open-source, claude-code, capability-discovery, self-improvement]
meansEndsRatio: 0.7
projects: [claude-evolution]
conversationExcerpts: false
description: "I built a system that discovers its own upgrades, scores them, and installs the ones that pass. Then I open-sourced it. The uncomfortable part is explaining why."
draft: false
---
# Building an AI That Improves Itself

Capability debt is the gap between the tools your AI development environment could use and the tools it actually has. Every MCP server you haven't evaluated, every workflow pattern you haven't adopted, every prompt technique sitting in a GitHub repo you haven't read: that's capability debt. It accumulates silently because you don't know what you're missing, and you can't search for what you don't know exists. The compound interest is negative: each missed capability makes the next one harder to find, because you're searching with yesterday's tools for tomorrow's improvements.

I built a system that pays it down automatically. It runs on cron. It searches RSS feeds, GitHub, and forums for new Claude Code capabilities. It scores what it finds against a rubric with five criteria. It integrates the items that pass and archives the ones that don't. Over the past two weeks it has run 16 daily discovery cycles, evaluated 155 items, rejected roughly a third of them outright, and integrated about 30. It costs approximately $3 per day in API calls. Today it's open source.

The system works. It also reflects my biases so completely that calling it general-purpose would be dishonest. Both things.

## The Architecture

The pipeline has four phases: discovery, evaluation, integration, and verification. Discovery agents search multiple sources, including Anthropic's engineering blog, Simon Willison's newsletter, GitHub repositories tagged with `claude-code` or `mcp-server`, Reddit, Hacker News, and AI newsletters. Each source gets classified into tiers by signal density: daily sources get checked every run, weekly sources on a slower cadence.

Evaluation is where the system earns its keep. Each discovery is scored on five weighted criteria:

| Criterion | Weight | What It Measures |
|-----------|--------|------------------|
| Integration complexity | 20% | How hard is it to install? |
| Token efficiency impact | 25% | Does it save or cost context tokens? |
| Capability expansion | 25% | Does it do something genuinely new? |
| Maintenance burden | 15% | Will it break next month? |
| Community validation | 15% | Has anyone else used this successfully? |

Items scoring 70 or above get approved for integration. Below 50, rejected. The middle band (50 to 69) gets flagged for research, which means a human looks at it eventually. In practice the distribution is bimodal: most items are either clearly useful or clearly not, and the middle band is thin.

Integration is the part that makes the system recursive. Approved items don't go into a queue for human review. They get written directly into the development environment: new skill files, new agent definitions, new MCP configurations, updates to CLAUDE.md instructions. The "code" this system writes to improve itself is not Python or TypeScript. It's markdown. Configuration as natural language, executed by an AI that reads its own instruction files on startup. The control plane is prose.

Verification tests that the new capability actually works, adds it to a capability registry for redundancy checking, and produces an integration report. The registry matters because the ecosystem produces enough git wrappers that without deduplication, the system would spend half its evaluation budget on tools it already has equivalents for.

## How It Got Here

The development history is instructive because it mirrors the pipeline's own structure: discovery of what's possible, evaluation of what's worth doing, integration of what works.

January 15: first manual discovery run. Six items found, four novel, one improvement over an existing capability, one duplicate. The improvement was Anthropic's official Tool Search Tool, which solved the exact problem I'd been solving manually with deferred MCP loading. The system's first discovery made part of the system obsolete. That felt like the right signal.

January 16: the system learned to say no. Every candidate from the second run was rejected or held. The most instructive rejection was claude-code-mcp, an MCP wrapper around the Claude Code CLI designed for external AI clients. Using it from within Claude Code creates a recursive loop (Claude calling MCP calling Claude calling CLI) with zero capability expansion and negative token efficiency. It scored 23.5 out of 100. The evaluation correctly identified that the tool was excellent for its intended purpose (letting Cursor or Windsurf delegate to Claude) and completely pointless for ours. Scoring 23.5 is not a failure of the tool. It's the system understanding context of use, which is harder than understanding functionality.

Late January through early February: batch evaluations. Fifteen or more items per cycle, most rejected. The rejection pile grew faster than the integration pile. This is correct behavior.

Late January: DSPy prompt optimization. I ran Stanford's optimization framework against 13 of the system's own prompt targets, meaning the instructions that tell subagents how to discover, evaluate, and integrate. Eleven improved measurably. The two that didn't were the ones with quality criteria spanning multiple dimensions: PR quality assessment and refactoring advice, where "better" means five different things simultaneously and an optimizer looking for a single fitness peak can't navigate the landscape. I wrote about this in detail in [Automating Prompt Engineering](/posts/003-automating-prompt-engineering). The relevant point here is that the system optimized its own instructions, and the optimization's failures were more informative than its successes.

February 2: daily heartbeat goes live on cron. The system becomes autonomous in the roomba sense, meaning it runs without being told to, within boundaries someone else drew. I wrote about what "autonomous" actually means in this context in [Supervised Autonomy](/posts/008-the-autonomous-development-stack).

February 5-6: largest integration batch. Multiple items evaluated and integrated in a single cycle. The bloat metrics snapshot from that day shows 50 agents, 31 skills, 150 completed evaluations. A week later: 57 agents, 34 skills, 155 evaluations. The system grows. Whether it grows in a direction that's useful depends entirely on the scoring rubric, which depends entirely on me.

February 16: Bayesian surprise scoring and experiments with multi-armed bandits go live, borrowed from the Allen Institute for AI's AutoDS platform. These are shadow experiments that record what they would have recommended without changing any actual decisions. The Bayesian experiment addresses the question of what the system should care about finding, since rare categories carry more information than common ones: a database tool at 6.4 bits of surprise versus another git wrapper at 1.5. The MAB experiment addresses where to look, using Thompson Sampling to allocate search effort to sources based on their historical approval rates. I described the experimental designs in detail in [The Algorithms of Self-Improvement](/posts/021-the-algorithms-of-self-improvement). The experiments will produce their own blog posts when the data matures.

## What It Is Not

The system is not a framework. It is not a library you install. It is not a product. There is no `npm install claude-evolution`. There is no API.

It is a pattern: a way of organizing markdown files so that an AI agent can improve its own development environment on a schedule. The repository is a reference implementation containing my pipeline scripts, my evaluation rubric, my scoring weights, and my helper playbooks. Your version would have different sources, different weights, different approval thresholds, different integration targets. The database category in my Bayesian priors has a 6% probability because my work rarely touches databases. Yours might be 40%. The system doesn't care. The architecture is the same.

The argument against frameworks is this: most developer tools assume you want less configuration. Abstract away the details, provide sensible defaults, make it "just work." This system assumes the opposite. You want configuration that is *legible*, meaning files a human can read in five minutes and an AI can execute in five seconds. The complexity is in the orchestration (which phases run, in what order, with what model), not in the code (there is almost no code; it's markdown and shell scripts). The entire evaluation framework is a single markdown file with a table and a formula. The entire integration logic is "read the approved item, determine its type, write the appropriate file."

This is possible because the execution environment is Claude Code, which reads natural language instructions and acts on them. The "code" is English sentences describing what to do. The "compiler" is an LLM. The "runtime" is a CLI tool that has file access, shell access, and web access. If this sounds fragile, it is. If it sounds like it couldn't possibly work reliably, it has run 16 daily cycles without human intervention and produced 155 evaluations, the vast majority of which I agree with in retrospect.

## The Uncomfortable Parts

Three honest problems with this system.

**The scoring rubric is my taste, formalized.** Token efficiency gets 25% weight because I care about context windows. Community validation gets 15% because I think popularity is a weak signal for quality. Someone who values ecosystem compatibility over token efficiency would build a different system and it would find different tools. The Bayesian experiment I described in [The Algorithms of Self-Improvement](/posts/021-the-algorithms-of-self-improvement) is an attempt to let the data adjust category weights, but even a prior that is adjusted by data encodes assumptions: I chose the categories, I defined the boundaries, I built the priors from my own historical evaluations. The data can correct my weights within the structure I gave it. It cannot correct the structure.

**The evaluation is circular.** Claude evaluates tools for Claude. Same architecture, similar training data, probably similar blindspots. Cross-validation with GPT-5 via the Codex MCP helps; the claude-code-mcp evaluation, for instance, was scored by both Claude (23.5/100) and Codex (18/100), and they agreed on rejection for the same reasons. But two AI models agreeing doesn't equal truth. It might equal shared training data. I explored this circularity problem in depth in [AI Evaluating AI](/posts/009-ai-evaluating-ai). The short version: the circularity is manageable in the engineering sense (the system produces useful results) and unsolvable in the epistemological sense (you cannot verify, from within the system, that the evaluation criteria are sound). That distinction is not pedantic. It determines whether you treat a bad evaluation as a bug or as a fundamental limitation.

**The rejection rate is the point.** Of 155 evaluations, roughly 46 were rejected outright. The system's primary function is saying no. Git wrappers, filesystem MCPs, recursive tools that run Claude for Claude, MCP servers with zero actual functionality, tools that solve problems the ecosystem solved six months ago: the ratio of noise to signal in the Claude Code ecosystem is high. A system that approved everything would be worse than no system at all, because it would grow the development environment without growing its capabilities. The bloat metrics tell this story: 57 agents, 34 skills, 31 helpers, and those numbers need to be monitored because growth without utility is just weight. The weekly bloat audit exists for this reason.

## Why Open Source

The blog's motto is "sufficient knowledge compels action." The system works for one person. If the pattern is sound, withholding it is a choice that needs justification. The justification would be "it's not ready," but a system that runs autonomously for two weeks, produces daily reports, monitors its own bloat, and scores discoveries against a documented rubric is ready enough to share. The alternative is waiting until it's perfect, which means waiting forever, which means it was never shared.

What you get: the pipeline scripts, the evaluation framework, the scoring rubric, the helper playbooks, the experiment designs, the capability registry structure. What you don't get: my API keys, my Discord webhooks, my specific skill files and agent definitions. Those are mine, because they encode my preferences, my workflow, my projects. The pattern is yours.

The system's value is not in the configuration. It's in the idea that your AI development environment doesn't have to be a snapshot of the day you set it up. It can search for its own improvements, evaluate them against criteria you define, and install the ones that pass. The search can get smarter over time. The criteria can be adjusted by data. The whole thing runs on cron and costs less than a coffee.

Repository: [github.com/AshitaOrbis/claude-evolution](https://github.com/AshitaOrbis/claude-evolution)

## What Happens Next

The system will keep running. Tomorrow's heartbeat will fire at the scheduled time. It will search the same sources, score what it finds, integrate what passes. Some of those integrations will change how the next search runs. The Bayesian and MAB experiments will accumulate data and, when their thresholds are met, will either modify the pipeline or confirm that the current approach was adequate.

This is what self-improvement looks like in practice: not a singularity, not an intelligence explosion. A cron job that reads RSS feeds and occasionally writes a markdown file. The gap between the science fiction and the implementation is vast and instructive. The science fiction imagines systems that transcend their creators. The implementation is a system that replicates its creator's preferences more efficiently, within boundaries the creator drew, and calls that progress. The data will tell us whether it is. But the data comes from inside the system. It always does.