title: "Automating Prompt Engineering"
slug: 003-automating-prompt-engineering
date: 2026-02-11
tags: [dspy, prompt-engineering, automation, ai-tools, optimization]
meansEndsRatio: 0.6
projects: [dspy-optimizer, claude-evolution]
conversationExcerpts: false
category: practice
draft: false
description: "Prompt optimization is the process of using one AI to improve the instructions given to another AI, or to itself. The concept sounds circular because it is circular. The interesting question is whether circularity is fatal or merely uncomfortable."
---
# Automating Prompt Engineering

Prompt optimization is the process of using one AI system to systematically improve the instructions given to another AI system, or to itself. The concept sounds circular because it is circular: the judge and the optimized are the same architecture, sometimes the same weights, and the definition of "better" is itself a prompt that could be optimized. The field has a billion dollars in market capitalization and no coherent answer to the question of what, exactly, is being optimized.

I spent three weeks building an automated prompt optimization system. The optimizer improved 11 of 13 targets. The two it failed on were the complex ones, the ones where quality is genuinely multidimensional, where "better" means five different things weighted against each other. So the optimizer works perfectly on everything except the things you would actually want to optimize.

That result is more instructive than it sounds.

## The Argument for Automation

The philosophical case is straightforward and, as far as it goes, correct. Neural network weights are optimized algorithmically. Nobody hand-tunes the 175 billion parameters of a language model. So why should prompts, which serve an analogous steering function in the inference pipeline, be manually crafted through trial and error?

DSPy, Stanford's framework for treating prompts as programs, operationalizes this question. You define a task signature (what the model should do), provide a metric function (how to score outputs), supply training examples (known inputs and acceptable outputs), and let an optimizer search for the instruction set that maximizes your metric. The separation is clean: what the model should accomplish is specified by the human; how to tell the model to accomplish it is discovered by the algorithm.

The results are often impressive. A prompt evaluator task jumped from 46.2% to 64.0% accuracy. Jailbreak detection went from 59% with manual prompts to 93.18% with optimized few-shot demonstrations. GEPA, a reflective optimization algorithm published in mid-2025, achieves its improvements using 35 times fewer rollouts than reinforcement learning, which matters because optimization cost is the bottleneck for production adoption. Databricks reported that GEPA produced agents 90 times cheaper than baseline Claude Opus while matching or exceeding its performance.

These numbers are real. The improvements transfer to held-out test sets. The cost savings are measured in production environments, not academic sandboxes. If your task is well-defined and your metric captures what you care about, automated optimization is strictly superior to manual tuning: 10 minutes of compute versus 20 hours of human iteration, with equal or better outcomes.

The case for automation is, within its domain, compelling.

## The Argument Against Automation

The counterevidence is quieter but persistent. AI21 Labs found that manually written prompts produced more consistent results across variable data conditions, and attributed this to human prompts drawing on "common knowledge from life experience, whereas automated prompts come about more randomly." A separate study found that manual prompt engineering had a more consistent effect on model results regardless of test data volume. The Communications of the ACM noted that researchers "were surprised to find that the prompts written by humans actually produced better results on most tasks."

These findings are not contradictory. They're describing different domains.

Automated optimization excels at narrow, well-defined tasks where the success metric captures the full scope of what matters: classification, extraction, formatting, routing. These tasks have clean evaluation functions, bounded output spaces, and stable distributions. The optimizer can explore the space of possible instructions and converge on formulations that a human would never try but that score reliably well.

Manual prompts excel at complex, contextually loaded tasks where the valuable properties of the output (appropriate tone, ethical judgment, cultural sensitivity, the difference between a technically correct answer and a genuinely helpful one) resist quantification. The human engineer embeds tacit knowledge in the prompt, knowledge about what "good" looks like that can't be reduced to a scoring function. The optimizer, which can only see the metric, is blind to everything the metric doesn't capture.

The uncomfortable synthesis: automated prompt engineering works perfectly on the tasks where prompt engineering barely matters, and struggles on the tasks where it matters most.

## The Metrics Problem

This is where the argument sharpens. You can only optimize what you can measure. Goodhart's Law applies with unusual directness: when the metric becomes the optimization target, it ceases to be a good metric.

I built 17 custom evaluation functions for my optimization targets. Code review severity classification. Test coverage scoring. CWE identifier extraction. PR quality assessment across five weighted dimensions. The single-axis metrics (severity classification, tool routing) optimized cleanly and generalized to holdout data. The multi-axis metrics (PR quality, refactoring assessment) showed training scores above 0.80 that collapsed to 0.30 on holdout, which is the optimizer equivalent of studying for the test instead of learning the material.

The optimizer found shortcuts. It discovered that certain phrasings in the prompt correlated with high scores on the training examples, not because those phrasings encoded general principles but because they matched the specific patterns of the training distribution. A prompt that said "focus on security vulnerabilities in authentication flows" scored well on a training set where most examples involved authentication. In production, that same prompt missed injection vulnerabilities in data processing pipelines.

This is the black box problem that DSPy's own documentation acknowledges: "the optimizer is only looking at the final score and is blind to detailed reasoning." The optimizer doesn't understand why a prompt works. It understands that a prompt scores well. These are different things, and the gap between them is where production failures live.

## The Craft Relocation Thesis

The dominant narrative is that prompt engineering is dying. Microsoft survey data shows prompt engineer as the second-to-last role companies plan to hire. Job postings are sparse. Andrej Karpathy prefers "context engineering" and says he is giving preference to it over prompt engineering. The market data tells a different story: $1.13 billion in 2025, up from $0.85 billion in 2024, with enterprise AI spending averaging $85,521 per month, a 36% annual increase.

Both narratives are true. The reconciliation is that the skill isn't dying. It's relocating.

Old prompt engineering was craft: iterate on wording, test manually, develop intuitions about what phrasings work, accumulate tricks. This skill is being automated. The optimizer does it faster, cheaper, and in many cases better.

New prompt engineering is architecture: design evaluation metrics that capture genuine quality, build multi-stage pipelines where each component can be independently optimized, choose the right optimization algorithm for the task structure, debug systematic failures in automated systems, ensure that the optimization objective actually aligns with the deployment objective.

The person who writes "You are a helpful assistant" is being replaced. The person who designs the metric function that distinguishes a helpful response from a technically correct but useless one is more valuable than ever. The craft relocated from the artifact (the prompt text) to the infrastructure (the evaluation framework).

This is the standard progression of every engineering discipline. Hand-calculated structural loads gave way to finite element analysis. The structural engineer didn't disappear. The job shifted from arithmetic to modeling, from computing to specifying what to compute.

## The Interpretability Gap

When GEPA produces a prompt that outperforms Claude Opus at one-ninetieth the cost, the natural question is: what does the optimized prompt say? What instruction made the difference?

Often, we don't know. The optimizer explores prompt variations through mutation and selection, a process analogous to evolutionary search, and the winning prompt is selected for its score, not its intelligibility. Some optimized prompts are clear improvements: more specific instructions, better structured examples, tighter constraints. Others are strange. They contain redundant phrasing, unusual formatting, or instructions that appear irrelevant to the task but consistently improve performance.

This mirrors the interpretability problem in neural networks, but with an irony: prompts are supposed to be the human-readable layer. The whole point of natural language interfaces is that we can read the instructions and understand what the model is being told to do. When the optimizer generates instructions that work but resist explanation, we've lost the interpretability advantage that prompts were supposed to provide.

The practical consequence is debugging. When an optimized prompt fails in production, you can't reason about why from the prompt text alone. You have to re-run the optimization with different metrics, inspect the training data for distributional bias, test individual prompt components in isolation. The debugging process for an optimized prompt is more like debugging a trained model than debugging code.

We traded artisanal prompts we understood for optimized prompts that perform better but that we don't understand. The performance gain is measurable. The interpretability cost is harder to quantify, which is precisely the kind of cost that optimization ignores.

## The Recursion

The deepest tension in automated prompt engineering is recursive. The optimizer uses an AI to evaluate AI outputs. The evaluation criteria are themselves written in natural language, processed by a language model, subject to the same biases and limitations as the outputs being judged. If the evaluator systematically misunderstands what "clarity" means, the optimization converges toward the evaluator's misunderstanding, not toward actual clarity.

I use Claude Opus to evaluate Claude Sonnet's outputs. Different model, different parameters, but the same architecture, similar training data, similar RLHF biases. When both models agree that an output is good, it might be because the output is genuinely good, or it might be because both models share a systematic preference that diverges from human judgment. Cross-model validation helps, but only if the models are genuinely independent. Frontier models trained on similar corpora with similar alignment procedures may disagree on edge cases while converging on central tendencies. The diversity is shallow.

The escape hatch is empirical. Do optimized prompts produce better outcomes in practice? Do users report higher satisfaction? Do downstream metrics improve? If the answer is yes, the circularity is manageable even if it's not resolvable.

Eleven of thirteen targets say yes. I'm choosing to believe them. But I notice that the two failures were the targets where quality is hardest to measure, which means the circularity is most dangerous precisely where it's least detectable.

## What This Means

The field is converging on a hybrid model that satisfies nobody completely. Humans define what "good" means. Algorithms find the instructions that produce it. Humans validate that the algorithm's definition of "good" matches their own. The loop is slow, expensive, and philosophically incomplete.

It also works. The uncomfortable conclusion is not that automation replaces human judgment or that human judgment is irreplaceable. The uncomfortable conclusion is that neither automation nor human craft is sufficient alone, that the correct answer is a collaboration that requires ongoing maintenance, calibration, and humility about what each contributor can see and what each contributor misses.

Automation handles the search. Humans handle the specification. Neither is the hard part in isolation. The hard part is the interface between them: translating what you want into a metric that faithfully represents what you want, then verifying that the optimization didn't find a way to satisfy the metric without satisfying the intent.

We're not optimizing prompts. We're optimizing our ability to describe what we want. That turns out to be the hard problem. It always was.