{"category": "deep-dive", "conversationExcerpts": false, "date": "2026-02-25", "description": "Psychometric self-report with individual scales reaching .80-.90 reliability. LLM inference from interactive conversation hits r~.44 (Peters et al. 2024). Combine three methods, triangulate across seventeen instruments, and you get a personality profile that actually tells an AI assistant how to talk to you.", "draft": false, "meansEndsRatio": 0.58, "projects": ["psyche"], "slug": "028-building-your-own-personality-profile", "tags": ["psychometrics", "personality", "ai", "open-source", "methodology"], "title": "Building Your Own Personality Profile with AI"}
---
# Building Your Own Personality Profile with AI

Personality labels may not meaningfully change how LLMs respond to you. "High Openness" in a system prompt likely produces no consistent behavioral shift. Neither does "INTJ" or "Type 5" or any other shorthand we've inherited from pop psychology's long flirtation with taxonomy.

The reason is simple: trait labels describe the person. They don't tell the model what to do differently.

A useful personality context for an AI assistant needs to answer a different question: given what we know about this person's cognitive style, emotional patterns, and communication preferences, what should the model *change* about its default behavior? Not "who is this person" but "what should I do differently for them."

That's the problem Psyche tries to solve. Seventeen validated instruments, three independent inference methods, and a synthesis pipeline that converts psychometric scores into behavioral specifications, specifically if-then patterns designed to shift model output.

## Three Methods and a Demotion

Every measurement tool has blind spots. Self-report captures self-perception, which may diverge from behavior. Text analysis captures linguistic patterns, which may diverge from inner experience. The only honest approach is triangulation: measure the same constructs through independent methods and see where they agree.

### Self-Report (Weight: 0.35)

The backbone. Seventeen validated instruments drawn from public-domain item pools and published research scales. "Validated" matters, because it means someone has demonstrated both that the instrument measures what it claims to measure (validity) and that scores are consistent across administrations and items (reliability, measured by Cronbach's alpha). A common reliability threshold is .70. Individual core scales in this battery reach reliability in the .80-.90 range.

Ad-hoc personality questionnaires, the kind that populate self-help websites and corporate team-building workshops, have no such guarantees. A questionnaire that hasn't been validated is a random number generator with a confidence problem.

The instruments use Likert scales (rate agreement 1-5 or 1-7), true/false binary items, and open-ended responses. Each targets a specific construct with known psychometric properties. Completion takes about 75 minutes across all seventeen instruments, a meaningful time investment but trivial compared to the inferential precision it buys.

### LLM Inference (Weight: 0.20, r~.44)

Peters, Cerf, and Matz demonstrated in 2024 that LLMs can predict Big Five personality traits from live conversational interactions. What stands out for our purposes is *how much the interaction design matters*.

Their study tested three chatbot conditions: a default helpful assistant (r ~ .117), a naturalistic conversational interaction (r ~ .218), and a personality-elicitation chatbot that actively probes users for personality-relevant information (r ~ .443). By that measure, the ratio between the best and default conditions is roughly 3.8x, a difference driven by interaction design: the high condition was a fundamentally different conversational protocol, not merely a reworded prompt.

Psyche uses the assessment-optimized approach. The model receives a sampled corpus (diversity-sampled to represent the full range of the person's writing contexts) and applies structured personality inference with explicit dimensional scoring. A blind assessment condition prevents the model from performing to expectations.

The corpus matters. In our system, below roughly 50,000 words the signal-to-noise ratio degrades noticeably. The ideal range for this pipeline is 100,000-500,000 words spanning multiple contexts: professional writing, personal messages, creative work, academic output. Corpora drawn from a single context produce profiles that reflect only that single context.

### Lexical Analysis (Corpus Characterization)

Empath provides 200 lexical categories shown to correlate highly (r=0.906) with corresponding LIWC (Linguistic Inquiry and Word Count) categories. It was originally included as a fourth inference method (weight: 0.10) to cross-check self-report against actual word usage patterns. However, analysis revealed that Empath measures *what you write about* (topic and genre) rather than *who you are* (personality). Across five Big Five dimensions, Empath produced scores compressed into a narrow range regardless of the dimension's actual extremity, far less spread than LLM inference produced for the same constructs.

Empath is now used for comparative corpus characterization only: z-scored lexical frequencies show how word-usage patterns vary across communication contexts (e.g., messaging vs. academic writing vs. AI conversations). This is genuinely useful for understanding register effects on personality *expression*, but it doesn't measure personality itself. The [research paper](/research/convergent-personality-assessment) documents the full root-cause analysis.

### Conversational Interview (Weight: 0.20, Qualitative)

A semi-structured interview protocol consisting of ten open-ended questions that probe values, self-concept, relationships, stress responses, and decision-making patterns. The questions are designed to elicit behavioral examples more than trait labels: closer to "Tell me about a time when..." than "How would you describe yourself?", though both styles appear in the protocol.

The interview captures what instruments cannot: narrative self-understanding, characteristic phrases, behavioral examples in specific contexts, and the texture of how someone thinks about their own thinking. A person's answer to "What does your mind do when everything falls apart?" provides data that no Likert scale can reach.

I set the weight at 0.20, equal to LLM inference, because qualitative evidence anchors the persona model in ways that quantitative scores alone cannot. (These three methods total 0.75; the remaining weight budget is reserved for additional inference methods as the system evolves.) Numbers tell you the trait level. The interview tells you what that trait level *looks like* in practice.

## The Instrument Battery

Phase 1 established the core: IPIP-NEO-120 (Big Five with 30 facets), CRT-7 (cognitive reflection), NCS-18 (need for cognition), Rosenberg (self-esteem), SD3 (dark triad), PHQ-9 + GAD-7 (depression and anxiety screening), and a conversational interview. Seven instruments, 198 items, roughly 45 minutes.

Phase 2 added ten instruments targeting the constructs that Phase 1 couldn't reach. Why seventeen instruments and not seven? Because Big Five domains are too coarse for behavioral prediction, and cognitive style alone doesn't capture how someone handles conflict, forms relationships, or persists through difficulty.

The additions: IPIP-NEO-300 (Big Five at 10 items per facet, providing the granularity needed for behavioral mapping at the facet level), HEXACO-60 (adds Honesty-Humility, absent from the Big Five, among other structural differences), ECR-R (attachment dimensions: anxiety and avoidance), ERQ-10 (emotion regulation strategies), IRI-28 (empathy decomposition into four distinct processes), Self-Monitoring-18 (self-presentation adaptation), LOC IE-4 (locus of control), Grit-S (perseverance and interest consistency), a 48-item RIASEC marker scale (vocational interests across Holland's six domains), and a 9-item Basic Psychological Needs scale (autonomy, competence, relatedness).

Thirty minutes of additional testing, with each instrument selected because it provides data that no other instrument in the battery captures, and because the construct maps onto specific behavioral predictions that an AI assistant could act on.

## From Scores to Behavior

This is where most approaches to personality in AI prompts fail. They convert instrument scores to trait labels and inject the labels into the system prompt. "User is High Openness, Low Agreeableness, Moderate Conscientiousness." The model reads these labels. Nothing changes.

The problem is one of category: trait labels describe the *person* but don't instruct the *model*. "High Openness" is a description. "Prefers probabilistic framing over categorical claims" is an instruction. "Skip unnecessary caveats; state conclusions directly" tells the model to behave differently than its default. The first is about the human. The second and third are about the model's output.

The key test for actionability: does this statement change what the LLM outputs? Statements about internal human processes fail this test. "Processes criticism internally before responding" is not actionable because the model cannot observe processing. It has no temporal awareness of what happens between turns. What it can do: "Deliver critical feedback without hedging; don't backpedal if user responds tersely." That's actionable because it specifies model behavior.

Scoring at the facet level makes this conversion possible. The NEO-300 provides 30 facets at 10 items each, enough granularity to distinguish between someone who is high on Openness-to-Ideas but low on Openness-to-Aesthetics, or high on Conscientiousness-Dutifulness but low on Conscientiousness-Order. These distinctions matter for behavioral specification. "Wants thorough analysis" (Ideas) and "doesn't care about formatting" (Order) are different instructions derived from the same domain score.

Validation across methods adds another layer. When self-report and text analysis agree on a trait, confidence is high. When they diverge (someone rates themselves as agreeable but uses confrontational language) the trait is flagged as dependent on context. The behavioral specification reflects this: "defaults to diplomatic framing in novel contexts; shifts to direct confrontation in established relationships." The divergence is information, not noise.

## The Persona Model

The synthesis generates a ten-dimension persona model mapping psychometric scores to behavioral predictions:

**Communication** draws from Self-Monitoring (audience adaptation), RIASEC (analogy domains, since an Investigative type thinks in systems while an Artistic type thinks in metaphors), and Big Five Openness facets (depth preference, rationale needs).

**Decision-making** draws from Locus of Control (attribution style), Grit (persistence patterns), and the interview (concrete examples of how decisions actually get made, not how the person thinks they get made).

**Conflict response** draws from ECR-R (responses to criticism that are shaped by attachment), ERQ (regulation strategy under conflict, whether reappraisal or suppression), and HEXACO Honesty-Humility (tolerance for manipulation).

**Motivation** draws from BPNS (which basic needs are satisfied and which are frustrated), Grit subscales (perseverance versus interest consistency, which are distinct), and RIASEC (domain preferences that indicate where effort is most naturally deployed).

**Epistemic style** draws from CRT (analytical versus intuitive tendency), NCS (engagement with effortful thinking), and the interview (how the person actually gathers and evaluates information).

The remaining five dimensions (Interpersonal, Stress Response, Relationships, Flow States, Self-Concept) follow the same pattern: multiple instruments converge on behavioral predictions that no single instrument could produce alone.

The model also extracts characteristic phrases from the interview transcript and behavioral examples from the corpus, which ground the persona in specific language patterns rather than just dimensional scores.

## What It Gets Wrong

Self-report reflects self-perception. Self-perception is not behavior. Someone who genuinely believes they are low in Neuroticism may exhibit high anxiety in contexts they don't recognize as anxiety-producing. I weighted self-report highest (0.35), which means self-perception errors propagate into the final profile.

LLM inference inherits training biases. If the training data associates certain linguistic patterns with personality traits based on demographic correlations in the source literature, the inference will reproduce those correlations regardless of their validity for the specific individual. The r ~ .44 ceiling may partly reflect this bias.

Empath has already paid this price: originally a fourth inference method at weight 0.10, it was demoted to corpus characterization once analysis showed it measures register rather than personality, as covered in the Lexical Analysis section above.

The persona mapping that converts scores to behavioral predictions is heuristic, not ground truth. The mapping from "ECR-R Anxiety = 67, Avoidance = 9" to "Seeks reassurance when stressed but doesn't withdraw; prefers direct emotional engagement" is an interpretive act. It's informed by the clinical literature on attachment dimensions, but it's still an inference about behavior from a score that measures self-reported tendency. The persona model is a useful approximation. It is not a truth.

Some behavioral statements that sound actionable actually describe internal processes the model can't observe. "Experiences a competence gap between felt and demonstrated ability" is not something the model can detect. What it can do: "Don't attribute hesitation to lack of knowledge; the user likely knows more than their confidence suggests." The temporal awareness problem lurks in every behavioral specification that references internal states.

## The Repo

[Psyche is open-source on GitHub](https://github.com/AshitaOrbis/psyche). The web app runs the instrument battery (React 19, TypeScript, Vite, with single-item display navigated by keyboard). The analysis pipeline runs the three inference methods and produces the unified profile (Python, Anthropic SDK as an optional dependency). Empath is included for corpus characterization but excluded from personality synthesis.

You need: a text corpus (the more diverse the better), 75 minutes for the instruments, and an Anthropic API key if you want LLM inference (the Empath method runs locally, no API needed).

You get: a structured profile JSON, a 2,000-4,000 word narrative report, and a ~500 word behavioral snippet designed for injection into a system prompt.

## What's Next

We haven't tested whether the behavioral snippet actually changes Claude's responses in measurable ways. The claim that behavioral specifications outperform trait labels is theoretically motivated but empirically unvalidated.

There's a clean experimental design for that. Reverse the Peters, Cerf, and Matz methodology: instead of predicting personality from text, test whether personality context shifts AI behavior toward the profiled person's preferences. Four conditions (no context, trait labels, behavioral snippet, full persona model) rated blind by the profiled person across thirty diverse prompts.

The prediction: trait labels perform no better than baseline. Behavioral snippets perform substantially better. The full persona model adds marginal improvement over the snippet, if any. The design document is in the repo.

*Update (June 2026): a version of this experiment has since run as PsycheEval ([pilot](/posts/042-psycheeval-pilot), [v0.2](/posts/046-psycheeval-v0_2)). The first prediction was wrong in an instructive way: in the pilot's within-author pairwise judgments, the trait-label condition beat the no-profile baseline rather than matching it. The v0.2 run then added position- and length-bias controls that weakened several first-pass readings, so the condition ranking is still being settled, but "trait labels do nothing" did not survive contact with the data. The battery has also grown from 17 to 39 instruments since this post was written; see the repo for the current tier structure.*

The interesting failure mode would be if trait labels actually do help, if the model can translate "High Openness" into different behavior without explicit instructions. That would mean the bottleneck isn't the label format but the model's ability to operationalize abstract descriptions. Worth testing. Worth being wrong about.