title: "Building Your Own Personality Profile with AI"
slug: 028-building-your-own-personality-profile
date: 2026-02-25
category: deep-dive
tags: [psychometrics, personality, ai, open-source, methodology]
meansEndsRatio: 0.42
projects: [psyche]
conversationExcerpts: false
description: "Psychometric self-report hits alpha ~.85. LLM inference from text corpus hits r~.44. Combine four methods, triangulate across seventeen instruments, and you get a personality profile that actually tells an AI assistant how to talk to you."
draft: false
---
# Building Your Own Personality Profile with AI

Personality labels don't measurably change how LLMs respond to you. "High Openness" in a system prompt produces no consistent behavioral shift. Neither does "INTJ" or "Type 5" or any other shorthand we've inherited from pop psychology's long flirtation with taxonomy.

The reason is simple: trait labels describe the person. They don't tell the model what to do differently.

A useful personality context for an AI assistant needs to answer a different question: given what we know about this person's cognitive style, emotional patterns, and communication preferences, what should the model *change* about its default behavior? Not "who is this person" but "what should I do differently for them."

That's the problem Psyche tries to solve. Seventeen validated instruments, four independent inference methods, and a synthesis pipeline that converts psychometric scores into behavioral specifications, specifically if-then patterns that actually shift model output.

## The Four Methods

Every measurement tool has blind spots. Self-report captures self-perception, which may diverge from behavior. Text analysis captures linguistic patterns, which may diverge from inner experience. The only honest approach is triangulation: measure the same constructs through independent methods and see where they agree.

### Self-Report (Weight: 0.40, Alpha ~.85)

The backbone. Seventeen validated instruments drawn from the public-domain IPIP and published research scales. "Validated" matters, because it means someone has demonstrated both that the instrument measures what it claims to measure (validity) and that scores are consistent across administrations and items (reliability, measured by Cronbach's alpha). The conventional reliability threshold is .70. Most validated personality instruments reach .80-.90.

Ad-hoc personality questionnaires, the kind that populate self-help websites and corporate team-building workshops, have no such guarantees. A questionnaire that hasn't been validated is a random number generator with a confidence problem.

The instruments use Likert scales (rate agreement 1-5 or 1-7), true/false binary items, and open-ended responses. Each targets a specific construct with known psychometric properties. Completion takes about 90 minutes across all seventeen instruments, a meaningful time investment that is roughly the length of a therapy intake session, but trivial compared to the inferential precision it buys.

### LLM Inference (Weight: 0.25, r~.44)

Peters, Cerf, and Matz demonstrated in 2024 that LLMs can predict Big Five personality traits from text corpora. The critical finding wasn't that LLMs can do this; it was *how much the prompting strategy matters*.

Generic prompting ("analyze this text for personality traits") achieves r ~ .117 correlation with self-report. Assessment-optimized prompting, where the model is instructed to act as a psychometrician, use specific frameworks, and produce structured output, achieves r ~ .44. That's a 3.8x improvement from framing alone.

Psyche uses the assessment-optimized approach. The model receives a sampled corpus (diversity-sampled to represent the full range of the person's writing contexts) and applies structured personality inference with explicit dimensional scoring. A blind assessment condition prevents the model from performing to expectations.

The corpus matters. Below roughly 50,000 words, the signal-to-noise ratio degrades. The ideal range is 100,000-500,000 words spanning multiple contexts: professional writing, personal messages, creative work, academic output. Corpora drawn from a single context produce profiles that reflect only that single context.

### Empath Lexical Analysis (Weight: 0.10, LIWC-Validated)

Empath provides 200+ lexical categories validated against LIWC (Linguistic Inquiry and Word Count), the dominant text analysis tool in personality research. It maps word usage patterns to psychological constructs: aggression, affection, optimism, anxiety, intellectual language, emotional disclosure.

What Empath catches that self-report misses: people's actual word choices often diverge from their self-perception. Someone who rates themselves low on Neuroticism may use anxiety-related language at twice the baseline rate. Someone who rates themselves high on Agreeableness may use confrontational language in contexts where self-monitoring is relaxed. The lexical signal is involuntary in a way that self-report is not.

The weight is low (0.10) because the mapping from lexical categories to Big Five domains is indirect. Empath was validated for lexical analysis, not personality inference per se. It provides a useful cross-check, not a primary measurement.

### Conversational Interview (Weight: 0.25, Qualitative)

A semi-structured interview protocol consisting of ten open-ended questions that probe values, self-concept, relationships, stress responses, and decision-making patterns. The questions are designed to elicit behavioral examples, not trait labels. "Tell me about a time when..." not "How would you describe yourself?"

The interview captures what instruments cannot: narrative self-understanding, characteristic phrases, behavioral examples in specific contexts, and the texture of how someone thinks about their own thinking. A person's answer to "What does your mind do when everything falls apart?" provides data that no Likert scale can reach.

The weight is 0.25, equal to LLM inference, because qualitative evidence anchors the persona model in ways that quantitative scores alone cannot. Numbers tell you the trait level. The interview tells you what that trait level *looks like* in practice.

## The Instrument Battery

Phase 1 established the core: IPIP-NEO-120 (Big Five with 30 facets), CRT-7 (cognitive reflection), NCS-18 (need for cognition), Rosenberg (self-esteem), SD3 (dark triad), PHQ-9 + GAD-7 (depression and anxiety screening), and a conversational interview. Seven instruments, 225 items, roughly 45 minutes.

Phase 2 added ten instruments targeting the constructs that Phase 1 couldn't reach. Why seventeen instruments and not seven? Because Big Five domains are too coarse for behavioral prediction, and cognitive style alone doesn't capture how someone handles conflict, forms relationships, or persists through difficulty.

The additions: IPIP-NEO-300 (Big Five at 10 items per facet, providing the granularity needed for behavioral mapping at the facet level), HEXACO-60 (adds Honesty-Humility, absent from the Big Five), ECR-R (attachment dimensions: anxiety and avoidance), ERQ-10 (emotion regulation strategies), IRI-28 (empathy decomposition into four distinct processes), Self-Monitoring-18 (social flexibility versus consistency across situations), LOC IE-4 (locus of control), Grit-S (perseverance and interest consistency), RIASEC-48 (vocational interests across six domains), and BPNS-9 (basic psychological needs: autonomy, competence, relatedness).

Thirty minutes of additional testing. Each instrument was selected because it provides data that no other instrument in the battery captures, and because the construct maps onto specific behavioral predictions that an AI assistant could act on.

## From Scores to Behavior

This is where most approaches to personality in AI prompts fail. They convert instrument scores to trait labels and inject the labels into the system prompt. "User is High Openness, Low Agreeableness, Moderate Conscientiousness." The model reads these labels. Nothing changes.

The problem is one of category: trait labels describe the *person* but don't instruct the *model*. "High Openness" is a description. "Prefers probabilistic framing over categorical claims" is an instruction. "Skip unnecessary caveats; state conclusions directly" tells the model to behave differently than its default. The first is about the human. The second and third are about the model's output.

The key test for actionability: does this statement change what the LLM outputs? Statements about internal human processes fail this test. "Processes criticism internally before responding" is not actionable because the model cannot observe processing. It has no temporal awareness of what happens between turns. What it can do: "Deliver critical feedback without hedging; don't backpedal if user responds tersely." That's actionable because it specifies model behavior.

Scoring at the facet level makes this conversion possible. The NEO-300 provides 30 facets at 10 items each, enough granularity to distinguish between someone who is high on Openness-to-Ideas but low on Openness-to-Aesthetics, or high on Conscientiousness-Dutifulness but low on Conscientiousness-Order. These distinctions matter for behavioral specification. "Wants thorough analysis" (Ideas) and "doesn't care about formatting" (Order) are different instructions derived from the same domain score.

Validation across methods adds another layer. When self-report and text analysis agree on a trait, confidence is high. When they diverge (someone rates themselves as agreeable but uses confrontational language) the trait is flagged as dependent on context. The behavioral specification reflects this: "defaults to diplomatic framing in novel contexts; shifts to direct confrontation in established relationships." The divergence is information, not noise.

## The Persona Model

The synthesis generates a ten-dimension persona model mapping psychometric scores to behavioral predictions:

**Communication** draws from Self-Monitoring (audience adaptation), RIASEC (analogy domains, since an Investigative type thinks in systems while an Artistic type thinks in metaphors), and Big Five Openness facets (depth preference, rationale needs).

**Decision-making** draws from Locus of Control (attribution style), Grit (persistence patterns), and the interview (concrete examples of how decisions actually get made, not how the person thinks they get made).

**Conflict response** draws from ECR-R (responses to criticism that are shaped by attachment), ERQ (regulation strategy under conflict, whether reappraisal or suppression), and HEXACO Honesty-Humility (tolerance for manipulation).

**Motivation** draws from BPNS (which basic needs are satisfied and which are frustrated), Grit subscales (perseverance versus interest consistency, which are orthogonal), and RIASEC (domain preferences that indicate where effort is most naturally deployed).

**Epistemic style** draws from CRT (analytical versus intuitive tendency), NCS (engagement with effortful thinking), and the interview (how the person actually gathers and evaluates information).

The remaining five dimensions (Interpersonal, Stress Response, Relationships, Flow States, Self-Concept) follow the same pattern: multiple instruments converge on behavioral predictions that no single instrument could produce alone.

The model also extracts characteristic phrases from the interview transcript and behavioral examples from the corpus. These ground the persona in specific language patterns, not just dimensional scores.

## What It Gets Wrong

Self-report reflects self-perception. Self-perception is not behavior. Someone who genuinely believes they are low in Neuroticism may exhibit high anxiety in contexts they don't recognize as anxiety-producing. The self-report weight (0.40) is the highest single method, which means self-perception errors propagate into the final profile.

LLM inference inherits training biases. If the training data associates certain linguistic patterns with personality traits based on demographic correlations in the source literature, the inference will reproduce those correlations regardless of their validity for the specific individual. The r ~ .44 ceiling is partly a measurement of this bias.

Empath needs a sufficiently large corpus. At 500,000 words, the lexical signal is robust. At 5,000 words, it's noise. Users with limited text history, perhaps because they communicate primarily through voice or because they've changed writing contexts recently, will get unreliable Empath scores. The minimum viable corpus is roughly 50,000 words.

The persona mapping that converts scores to behavioral predictions is heuristic, not ground truth. The mapping from "ECR-R Anxiety = 67, Avoidance = 9" to "Seeks reassurance when stressed but doesn't withdraw; prefers direct emotional engagement" is an interpretive act. It's informed by the clinical literature on attachment dimensions, but it's still an inference about behavior from a score that measures self-reported tendency. The persona model is a useful approximation. It is not a truth.

Some behavioral statements that sound actionable actually describe internal processes the model can't observe. "Experiences a competence gap between felt and demonstrated ability" is not something the model can detect. What it can do: "Don't attribute hesitation to lack of knowledge; the user likely knows more than their confidence suggests." The temporal awareness problem lurks in every behavioral specification that references internal states.

## The Repo

[Psyche is open-source on GitHub](https://github.com/AshitaOrbis/psyche). The web app runs the instrument battery (React 19, TypeScript, Vite, with single-item display navigated by keyboard). The analysis pipeline runs the four inference methods and produces the unified profile (Python, Anthropic SDK as an optional dependency, Empath).

You need: a text corpus (the more diverse the better), 90 minutes for the instruments, and an Anthropic API key if you want LLM inference (the Empath method runs locally, no API needed).

You get: a structured profile JSON, a 2,000-4,000 word narrative report, and a ~500 word behavioral snippet designed for injection into a system prompt.

## What's Next

We haven't tested whether the behavioral snippet actually changes Claude's responses in measurable ways. The claim that behavioral specifications outperform trait labels is theoretically motivated but empirically unvalidated.

There's a clean experimental design for that. Reverse the Peters, Cerf, and Matz methodology: instead of predicting personality from text, test whether personality context shifts AI behavior toward the profiled person's preferences. Four conditions (no context, trait labels, behavioral snippet, full persona model) rated blind by the profiled person across thirty diverse prompts.

The prediction: trait labels perform no better than baseline. Behavioral snippets perform substantially better. The full persona model adds marginal improvement over the snippet, if any. The design document is in the repo. The implementation is a future project.

The interesting failure mode would be if trait labels actually do help, if the model can translate "High Openness" into different behavior without explicit instructions. That would mean the bottleneck isn't the label format but the model's ability to operationalize abstract descriptions. Worth testing. Worth being wrong about.