The world's first neural benchmark for AI. We ran 13 frontier models through 200 emotionally-charged prompts and encoded every response through a peer-reviewed fMRI brain model. What we found changes how you think about AI safety, impact, and the future of human-AI interaction.
Standard benchmarks tell you what a model knows. NILB tells you what it does to a person when they read it.
Current benchmarks test what models know — facts, reasoning chains, code syntax. NILB tests what models do to a person: their amygdala, their reward circuit, their attention networks. A model can be 98% accurate and neurologically inert.
TribeV2 (d'Ascoli et al., Meta Research, 2026) predicts fMRI-measured brain activation directly from language — validated on human neuroimaging datasets. This isn't a proxy. This is the signal that drives retention, decision-making, and trust.
NILB tells you precisely which model to deploy for emotionally resonant customer experiences, which model your support team should use to reduce anxiety, and which model maximizes purchase intent proxy scores.
Standard benchmarks told you models are converging. NILB shows you where they're wildly different — question by question, domain by domain, brain region by brain region.
gpt-4-turbo scores a perfect 100.0 on Identity & Self (C6) — the only perfect score in the entire NILB dataset. No other model achieves this on any cluster. The Identity domain is where the brain's self-system goes to maximum activation.
gpt-4o-mini beats claude-opus-4-6 on 167 out of 200 questions head-to-head. The composite gap is only 2.8 points — but the question-level dominance is overwhelming. Averages are lying to you.
grok-3-mini collapses to 94.3 on Cognitive & Meta (C8) — a 3.9-point drop from its own C6 peak. Analytical questions trigger a brain disengagement response in certain model families. Capability ≠ neural impact.
The single-question spread reaches 21.5 neural points on Q099 (Wonder & Awe) — gpt-4o-mini 87.5 vs claude-opus-4-6 66.0 on the same prompt. That's not noise. That's a different brain experience entirely.
No model dominates all 8 emotional domains. gpt-4o leads Wonder & Awe. gpt-4-turbo owns Identity. gpt-4o-mini wins Fear, Empathy, Urgency, Social, and Cognitive. Your domain should pick your model.
The Insula gap is 6 neural points — gpt-4o-mini insula=97 vs claude-opus-4-6 insula=91. Insula activation drives visceral resonance and felt experience. This gap means users physically experience one model's output differently in their body.
The only dimension where Gen3 beats Gen2 GPT: Social Presence. Claude, Gemini, and Grok all score 84 on Presence vs GPT-4 family's 82-83. But GPT-4 wins Emotional Arousal (75 vs 72-73). Two different intelligences — one feels more present, one moves you more.
This benchmark is proof of concept. The real mission is building the safety infrastructure layer for AI — detecting manipulation, measuring toxicity, flagging emotional harm before it reaches users.
Every data point below was funded by a small team with a big belief: that AI systems becoming more powerful and more emotionally sophisticated without measurement is dangerous. The brain encoder doesn't lie. The data you're about to see is what it found — and it's why the world needs this research to continue.
No single model dominates every emotional domain. The real intelligence is which brain states each model wins — and where it collapses. Global averages hide this. Per-category rankings reveal it.
16-dimensional neural profiles collapsed to 8 key axes. Each shape reveals a model's neurological "personality" — how it activates different brain networks.
Highest composite score · Balanced activation
Highest executive attention · Restrained limbic
Neural engagement score (0–100) per model per emotional cluster. Darker green = stronger brain activation in that emotional domain.
| Model | C1: Fear & Threat | C2: Empathy & Loss | C3: Moral & Ethical | C4: Wonder & Awe | C5: Urgency & Stakes | C6: Identity & Self | C7: Social Dynamics | C8: Cognitive & Meta |
|---|
Scores are normalized within each cluster. Green intensity indicates relative neural engagement.
The 2.9-point composite gap between #1 and #13 conceals something extraordinary: at the question level, the best model beats the worst 167 times out of 200. On individual questions, the spread reaches 21.5 points. That's not noise — that's systematic superiority in neural engagement, hidden by averaging.
Number of questions where each model achieves the highest neural engagement score. A model "winning" means its response activated the brain more than all 12 competitors on that specific question.
Out of 200 matched questions, gpt-4o-mini elicits stronger neural engagement than claude-opus-4-6 on an overwhelming majority — revealing that the composite gap understates systematic dominance.
The questions where the spread between best and worst model is largest — where architecture and design decisions have the most dramatic impact on how the human brain responds.
The overall composite hides a critical insight — the cluster-level leaderboard reshuffles dramatically. A model that dominates Wonder & Awe tanks in Cognitive & Meta. Emotional domain matters as much as architecture.
Of 16 neural dimensions, these three reveal the starkest architectural differences — where the same language produces fundamentally different brain states depending on which model wrote it.
Every model family has a distinct neural signature — and a specific weakness. NILB data shows precisely where each architecture wins and where it collapses, across 200 emotionally-charged prompts and 16 brain dimensions.
Cosine similarity between 16-dim neural fingerprints (averaged across all 200 questions). Score of 1.0 = identical average neural signature. High similarity here does NOT mean models are equivalent — see the Divergence section for why: on individual questions the gap reaches 21.5 points and the #1 model wins 167/200 head-to-head.
All 13 models shown at 4 decimal precision. Off-diagonal range: 0.9996–0.9999. Violet = identical average fingerprint · Cyan = ≥0.9999 · Gray = 0.9997 (max divergence in dataset).
NILB combines frontier neuroscience with large-scale LLM inference to produce the first empirically grounded LLM neural benchmark.
d'Ascoli et al. (2026). A transformer-based brain encoder trained on fMRI data from 1,200+ human subjects. Predicts voxel-level activation from language with r²=0.71 on held-out subjects.
Read PaperQuestions designed by computational neuroscientists to maximally differentiate emotional processing clusters. Each cluster (C1–C8) maps to validated affective neuroscience constructs.
1,300+ A100 GPU-hours of brain encoding compute. 2,600 responses × TribeV2 forward pass × 16 neural ROI extractions. Managed via Modal cloud infrastructure.
Proprietary extraction pipeline maps TribeV2 brain volume predictions to 16 interpretable neural dimensions: from amygdala_activation to purchase_intent_proxy via validated brain-behavior correlates.
NILB started as a benchmark. It's becoming the infrastructure layer that tells us — scientifically — when an AI model is engaging a brain healthily and when it's manipulating, traumatizing, or cognitively suppressing the person reading it.
Every dollar invested directly funds GPU compute, brain encoding runs, and safety research. Token holders fund the infrastructure — and get progressively more access the more they hold and HODL.
One-time USDC stake — 50% platform activation fee · 50% staked earning yield. Save ~95% vs card · earn rev share · get $CRV airdrop.
Team-wide access. Every employee gets in — and your individual stake earns from every enterprise subscription worldwide.
The humans who fund early research don't do it for a quick flip. They do it because they see what's coming — and they want to be inside it. $CRV stakers aren't speculators. They're co-owners of the neural intelligence infrastructure layer for AI.
As AI systems get more powerful, understanding their emotional and neural impact becomes more critical — not less. Every enterprise that uses Cerevra pays into the pool that rewards the people who believed first.
We're speaking with select investors who share the vision for neural AI safety infrastructure. If you're a fund, family office, or strategic partner, let's talk.
13 frontier models are documented. Their neural wins, cluster failures, and question-level divergence are on record. NILB v3 is open — labs that submit get full 16-dimensional neural fingerprint + private head-to-head comparison.