title: "Where to Spend Your Context Window"
slug: 038-where-to-spend-your-context-window
date: 2026-03-17
category: deep-dive
tags: [context-window, ablation, pipeline-architecture, personality, methodology, 1m-context]
meansEndsRatio: 0.35
projects: [psyche]
conversationExcerpts: false
description: "An ablation study on a personality-preserving narrative pipeline found that planning context is the dominant factor in output quality, with output length as a secondary driver. The old pipeline's failure originated in planning, not in writing."
draft: true
---
# Where to Spend Your Context Window

The assumption that giving a language model more context during generation improves output quality is widespread enough to function as an axiom in most pipeline designs. The reasoning is straightforward: more information available at the point of generation means better informed generation. Retrieval augmented generation, long context models, and the steady expansion of context windows from 4,000 to 128,000 to 1,000,000 tokens all reinforce the same premise, that the bottleneck is how much the model can see while producing its output.

An ablation study on a narrative generation pipeline suggests this premise is incomplete in a specific and measurable way. The single largest factor in output quality was not how much the model saw while writing, but how much the model saw while planning what to write: all three conditions that used planning artifacts derived from the full archive massively outperformed the condition that planned from limited context. But the picture is more nuanced than "planning is all that matters." Among the new conditions, output length emerged as a secondary factor (longer narratives scored better), and restricting writing context slightly hurt rather than helped. The context window's value is real at every pipeline phase, but the planning phase is where restricted context does the most damage.

## The Pipeline and the Problem

The experimental context is a narrative generation pipeline that converts text message archives into first person literary memoir, with the generated narrative's personality signal measured against two benchmarks: an Opus corpus reference profile (Opus's own Big Five inference from the writing corpus, measuring internal model consistency) and a Psyche merged profile (synthesized from 39 validated instruments plus a semi-structured interview, providing an independent accuracy test). The pipeline has been described in detail elsewhere ([From Text Messages to Literary Memoir](/posts/029-from-text-messages-to-literary-memoir), [What AI Learns About You When You're Not Looking](/posts/030-what-ai-learns-about-you)). What matters here is the architecture.

The original pipeline operated within a 100,000 to 150,000 token context window. For each chapter of the memoir, an LLM generated a research brief: a curated list of 5 to 15 messages per chapter along with an emotional arc description. A separate writing agent produced the chapter from that brief plus approximately 500 filtered messages. The model never saw the full message archive at once. It saw summaries of summaries, each layer of abstraction compressing the source signal and potentially amplifying distortions in the process.

When the context window expanded to 1,000,000 tokens, the full archive (19,867 messages at roughly 534,000 tokens) fit in a single context. The pipeline was rebuilt: a discovery agent read all messages and produced emotional phase maps, canonical facts, and character notes. An outline agent designed the chapter structure with full archive visibility. Chapter writers received all messages plus the discovery artifacts plus the outline plus all prior chapters. The same model (Opus 4.6), the same first person memoir format, the same citation system. The only change was that every agent in the pipeline could see everything.

The personality metric is mean absolute delta (mean |Δ|): the average distance between the Big Five domain scores inferred from the narrative and the scores from the Opus corpus reference profile (results against both benchmarks are reported in the Results section). Lower is better. The old pipeline scored 11.4. The 1M pipeline scored 3.9 under Opus evaluation, representing a 66% improvement concentrated almost entirely in two dimensions: Neuroticism (inflated by +23.9 in the old pipeline, reduced to +3.9 in the new) and Agreeableness (inflated by +19.3, reduced to +4.3). Extraversion, Openness, and Conscientiousness showed minimal change across pipeline architectures, which is consistent with the hypothesis that these dimensions are relatively robust to context window size because they are visible in any sufficiently large text sample, while Neuroticism and Agreeableness are highly sensitive to which messages the model sees.

## The Ablation Design

Ablation, borrowed into machine learning from experimental neuroscience (where it refers to the surgical removal of a brain region to determine what function that region served), follows an analogous methodology: remove or disable one component at a time and measure whether the system's performance changes. If performance degrades when a component is removed, that component was contributing. If performance is unchanged, the component was either redundant or compensated for by other mechanisms.

The 1M pipeline's improvement over the old pipeline confounded multiple variables simultaneously. The 1M pipeline used a different planning architecture (discovery agents reading all messages, rather than curated briefs). It produced shorter output (20,000 words versus 24,000). And the same model evaluated both pipelines, raising the possibility that Opus was scoring its own writing favorably. Three ablations were designed to isolate each confound.

**The context restriction ablation** asked whether restricting the chapter writer's message visibility to approximately 200 messages per chapter (simulating the old pipeline's restricted access) would reproduce the old pipeline's personality inflation. The ablation preserved the 1M pipeline's outline and discovery artifacts (which were created with full archive visibility) but restricted each chapter writer to a random subsample of 200 messages from its time period, selected with a fixed seed for reproducibility. If context during writing is the driver, the restricted version should score closer to 11.4 than to 3.9.

**The length ablation** asked whether the improvement was partly an artifact of shorter text having less contradictory surface area. The ablation used the full 1M pipeline architecture but instructed the chapter writer to produce longer chapters (1,800 to 2,200 words each, targeting approximately 28,000 words total, exceeding the old pipeline's 24,000). If brevity is the confound, the longer version should score worse than 3.9.

**Confidence intervals** addressed measurement noise. The Opus evaluation was run four times each on the old and 1M narratives (four re-evaluations of the same generated text, bounding evaluator scoring variance rather than pipeline generation variance) to produce 95% confidence intervals per domain.

**A cross-model evaluation** using GPT-5.4 (via Codex) addressed the self-scoring concern by having an independent model evaluate all four narrative conditions. GPT-5.4 also independently evaluated the writing corpus to produce its own reference profile, enabling a fully self-consistent cross-model check.

## The Results

| Condition | Messages per Chapter | Total Words | Mean |Δ| | 95% CI | Runs |
|-----------|---------------------|-------------|---------|--------|------|
| Old pipeline (100-150K planning context) | ~500 | 24,000 | 11.4 | existing | 4 |
| 1M pipeline (full context throughout) | 19,867 | 20,000 | 3.9 | existing | 4 |
| Filtered (outline from 1M, 200 msgs/chapter) | 200 | 32,000 | 2.4 | [2.0, 2.8] | 4 |
| Long (1M context, higher word targets) | All in time range | 28,000 | 1.7 | [1.4, 2.0] | 4 |

All four conditions now have four evaluation runs each (re-evaluations of the same generated text, bounding evaluator scoring variance). The ranking under Opus evaluation is: long (1.7) < filtered (2.4) < 1M (3.9) < old (11.4). The filtered and long confidence intervals barely overlap at 2.0, suggesting the difference between them is real rather than noise. The dramatic gap between old and the three new conditions (all of which share 1M-derived planning artifacts) is the clearest finding: planning context is the dominant factor.

The confidence intervals for the Neuroticism dimension, which showed the largest single improvement, were non-overlapping at 95% confidence: old pipeline N = [72.4, 77.4] versus 1M pipeline N = [52.5, 55.2], against a reference profile of 51.1. Agreeableness and Conscientiousness intervals overlapped, indicating that differences in those domains fall within measurement noise for a four-run sample.

GPT-5.4 evaluated all four narrative conditions using the same assessment prompt and chunking pipeline as Opus, and also independently evaluated the writing corpus to produce its own reference profile (N=69, E=36, O=91, A=52, C=55). Two findings emerge from this cross-model check. First, both evaluators agree the old pipeline shows the largest Neuroticism shift between conditions (Opus: +23.9 against its GT, GPT: +13.7 against its GT), and all newer pipeline conditions reduce this N inflation. Second, the finer-grained finding that planning context outperforms writing context does not replicate under GPT-5.4 self-evaluation. When GPT-5.4's narrative evaluations are measured against GPT-5.4's own corpus reference profile, the old pipeline scores best (10.7) and the 1M pipeline scores worst (12.9), inverting the Opus ranking. The inversion is driven by a systematic Conscientiousness inflation of +22 to +28 points that GPT-5.4 applies to all narrative conditions, which dominates the aggregate metric and flattens between-condition differences. The Neuroticism finding is evaluator-robust; the planning-versus-generation ranking is not.

Against the Psyche merged profile (the instrument-derived benchmark, synthesized from 39 instruments plus a semi-structured interview rather than Opus corpus inference), the improvement pattern holds: old pipeline 10.7, 1M pipeline 8.4, filtered 7.6, long 8.5. The deltas are uniformly larger because the merged profile differs from the Opus inference on Agreeableness and Openness. The ranking shifts: against the merged profile, filtered (7.6) edges ahead of long (8.5), reversing their order against the Opus reference. This is because the merged profile penalizes the long condition's N undershoot (-6.7) more heavily. Per-domain disaggregation reveals that Openness is inflated by +8 to +12 points across all conditions against the merged profile (where instrument-measured O at 79.1 is substantially below both LLM corpus inferences of ~89-91), a structural artifact of narrative generation (literary memoir is inherently ideas-oriented text). The Neuroticism correction from old (+18.3) to 1M (-1.7) remains the clearest signal against the merged profile.

## What the Ablation Reveals

The initial single-run results appeared to tell a clean story: filtered and long both scored 1.8 versus 1M's 3.9, suggesting that the planning phase was everything and writing context was irrelevant or even harmful. With four evaluation runs per condition, the story is more nuanced and more interesting.

The dominant finding survives replication: all three new conditions massively outperform the old pipeline. The old pipeline scored 11.4; the worst new condition scores 3.9. The shared factor is planning context: all three new conditions used outline and discovery artifacts derived from an agent that could see all 19,867 messages. The old pipeline's chapter briefs were generated by an LLM working from limited context, and each layer of abstraction amplified the narrative framing. A person who occasionally expressed anxiety became "a person characterized by anxiety" in the brief, which became a chapter where anxiety was the dominant emotional register. The 1M pipeline collapsed these layers by giving the planner everything. The planner calibrated emotional peaks against the full behavioral baseline. This is why planning context matters most.

The secondary finding changed with replication. Under single-run evaluation, filtered appeared to match long (both 1.8), suggesting writing context was irrelevant. With confidence intervals, long (1.7 ± 0.3) outperforms filtered (2.4 ± 0.4), and both outperform the base 1M pipeline (3.9). The variable that separates 1M from both ablation conditions is output length: the 1M pipeline produced approximately 20,000 words, while filtered produced approximately 32,000 and long approximately 28,000. Longer narratives provide more surface area for the personality evaluation prompt to detect behavioral patterns that match the reference profile.

The comparison between filtered and long is the most informative. Filtered has restricted writing context (200 messages) but longer output (32,000 words). Long has full writing context (all messages in the time range) but slightly shorter output (28,000 words). Long wins. This means full writing context provides value that outweighs the 4,000-word output advantage of the filtered condition. The "writing context adds noise" hypothesis that the single-run data appeared to support is refuted by the replicated data.

Three factors, ordered by effect size under Opus evaluation: (1) planning context is the dominant factor, accounting for the 7.5-point gap between old and 1M; (2) output length is a secondary factor, with longer narratives scoring better, likely by providing richer behavioral evidence for the evaluator; (3) writing context provides a small additional benefit, visible in the long-vs-filtered comparison. None of these rankings have been confirmed by cross-model evaluation, which means they should be treated as Opus-specific observations.

## Generalization: Where Context Matters Most

The specific domain (personality signal preservation in narrative generation) is narrow. The architectural observations may not be. Multi-step LLM pipelines, whether for document generation, code production, research synthesis, or agentic task execution, face the same structural question: where in the pipeline does context quality matter most?

The standard answer is to maximize context at the point of generation. RAG systems retrieve at generation time. Long context models are marketed on generation-side promise. The ablation results suggest a more layered picture: planning context is the single largest factor, but generation context and output length both contribute measurable improvements. A comprehensive plan created from full context is the prerequisite for good output, but the generator benefits from full context too, and longer output provides more surface area for quality signal to emerge.

This framing maps onto a distinction familiar in other engineering domains: design time information versus execution time information. An architect who has surveyed the entire site before drawing plans produces better buildings than one who surveys each room individually during construction. But the architect analogy has limits: unlike a construction worker following blueprints, an LLM writer with full context can perform local calibration that a context-restricted writer cannot, and the data suggests this calibration has measurable value.

Whether these patterns hold beyond narrative generation is an empirical question that this study cannot answer. The domain has specific properties (personality signal as a measurable outcome, a validated psychometric benchmark, a clear distinction between planning and writing phases) that may not transfer to domains where the planning/generation boundary is less defined. The finding that planning context is the dominant factor has the strongest evidential support (all three new conditions outperform old, and the gap is large). The secondary findings about output length and writing context are based on smaller effect sizes and Opus-only evaluation. The principle that planning context matters most is a hypothesis with strong support from one domain; the finer-grained observations about generation context and output length are more tentative.

## Limitations

The experiment has four limitations that constrain its generalizability.

The sample is a single 8-month messaging arc: 19,867 messages between two people. The finding that planning context matters most may not hold for relationships with different emotional textures, different messaging densities, or different temporal structures.

The evaluation metric (Big Five personality inference from generated text) is specific to this pipeline. The output length effect (longer narratives score better) may partly reflect the evaluator having more text to work with rather than the narrative genuinely containing better personality signal. Whether these patterns hold for other generation quality metrics (factual accuracy, stylistic consistency, narrative coherence) is untested.

The old pipeline and the 1M pipeline differ in more than context window size. The old pipeline used a fundamentally different planning architecture (sequential brief generation rather than holistic discovery), which means the ablation isolates context restriction during writing but does not isolate context restriction during planning. A cleaner test would restrict planning context to 100,000 tokens while keeping the discovery agent architecture, which would separate the effect of context size from the effect of architectural differences in the planning phase.

The cross-model evaluation using GPT-5.4 produced an independent corpus reference profile and evaluated all four conditions against it. The condition rankings inverted under GPT-5.4 self-evaluation (old best, 1M worst), driven by a systematic Conscientiousness inflation that affects all conditions equally. The Neuroticism finding (old pipeline's N inflation is the largest distortion, reduced by all new conditions) is robust across evaluators. The finer-grained rankings (long vs filtered vs 1M, and the output length and writing context effects) are Opus-specific observations that have not been confirmed by cross-model evaluation.

## What Holds and What Remains Open

The finding that the old pipeline inflates Neuroticism and that 1M-derived planning artifacts substantially reduce this inflation holds across evaluators and benchmarks. This is the robust finding: planning context is the dominant factor. With four evaluation runs per condition and confidence intervals, two secondary findings also emerge under Opus evaluation: longer output improves scores (long 1.7 vs 1M 3.9, the only difference being word count target), and full writing context slightly outperforms restricted writing context (long 1.7 vs filtered 2.4, though confounded by output length). Neither secondary finding has been confirmed by cross-model evaluation.

The initial single-run observation that writing context restriction helps (the "planning > generation" thesis in its strong form) is refuted by the replicated data. The filtered condition scored 2.4, not 1.8, and the long condition (with full writing context) scored 1.7. Writing context contributes positively, not negatively.

The generalization to multistep LLM pipelines remains a hypothesis. The finding that planning context matters most has the strongest evidential support and the clearest theoretical basis (design time versus execution time information). The secondary findings about output length and writing context are more domain-specific and may reflect properties of personality inference from text rather than general principles of LLM pipeline design.

The practical implication, for anyone building a pipeline with distinct planning and generation phases, is that investing context in the planning phase yields the largest returns, but restricting generation context is not a free optimization. A comprehensive plan created from full context remains the prerequisite for good output. Whether the generator also benefits from full context depends on the domain, and in at least this one, it does.