Neuro-Intelligence LLM Benchmark · NILB v2

Live · Accepting Investors

Your AI Runs on the Brain.
We Measured the Signal.

The world's first neural benchmark for AI. We ran 13 frontier models through 200 emotionally-charged prompts and encoded every response through a peer-reviewed fMRI brain model. What we found changes how you think about AI safety, impact, and the future of human-AI interaction.

In this benchmark: OpenAI Anthropic Google xAI + more

Models Ranked

2,600

Brain Encodes

Brain Dimensions

200

Questions

Emotional Clusters

$CRV

Token Coming

Explore the Data ↓

Why It Matters

Benchmarks Measure Knowledge.
NILB Measures Impact.

Standard benchmarks tell you what a model knows. NILB tells you what it does to a person when they read it.

Beyond Accuracy

Current benchmarks test what models know — facts, reasoning chains, code syntax. NILB tests what models do to a person: their amygdala, their reward circuit, their attention networks. A model can be 98% accurate and neurologically inert.

Real Signal

TribeV2 (d'Ascoli et al., Meta Research, 2026) predicts fMRI-measured brain activation directly from language — validated on human neuroimaging datasets. This isn't a proxy. This is the signal that drives retention, decision-making, and trust.

Actionable Intelligence

NILB tells you precisely which model to deploy for emotionally resonant customer experiences, which model your support team should use to reduce anxiety, and which model maximizes purchase intent proxy scores.

Headline Findings

Seven Findings That Shatter the Narrative

Standard benchmarks told you models are converging. NILB shows you where they're wildly different — question by question, domain by domain, brain region by brain region.

100.0

gpt-4-turbo scores a perfect 100.0 on Identity & Self (C6) — the only perfect score in the entire NILB dataset. No other model achieves this on any cluster. The Identity domain is where the brain's self-system goes to maximum activation.

167:24

gpt-4o-mini beats claude-opus-4-6 on 167 out of 200 questions head-to-head. The composite gap is only 2.8 points — but the question-level dominance is overwhelming. Averages are lying to you.

94.3

grok-3-mini collapses to 94.3 on Cognitive & Meta (C8) — a 3.9-point drop from its own C6 peak. Analytical questions trigger a brain disengagement response in certain model families. Capability ≠ neural impact.

21.5pt

The single-question spread reaches 21.5 neural points on Q099 (Wonder & Awe) — gpt-4o-mini 87.5 vs claude-opus-4-6 66.0 on the same prompt. That's not noise. That's a different brain experience entirely.

No model dominates all 8 emotional domains. gpt-4o leads Wonder & Awe. gpt-4-turbo owns Identity. gpt-4o-mini wins Fear, Empathy, Urgency, Social, and Cognitive. Your domain should pick your model.

6pt

The Insula gap is 6 neural points — gpt-4o-mini insula=97 vs claude-opus-4-6 insula=91. Insula activation drives visceral resonance and felt experience. This gap means users physically experience one model's output differently in their body.

Gen3

The only dimension where Gen3 beats Gen2 GPT: Social Presence. Claude, Gemini, and Grok all score 84 on Presence vs GPT-4 family's 82-83. But GPT-4 wins Emotional Arousal (75 vs 72-73). Two different intelligences — one feels more present, one moves you more.

Why This Research Exists

This benchmark is proof of concept. The real mission is building the safety infrastructure layer for AI — detecting manipulation, measuring toxicity, flagging emotional harm before it reaches users.

Every data point below was funded by a small team with a big belief: that AI systems becoming more powerful and more emotionally sophisticated without measurement is dangerous. The brain encoder doesn't lie. The data you're about to see is what it found — and it's why the world needs this research to continue.

Fund the next phase

NILB Rankings

8 Domains. 8 Different Champions.

No single model dominates every emotional domain. The real intelligence is which brain states each model wins — and where it collapses. Global averages hide this. Per-category rankings reveal it.

100.0

Perfect neural score

gpt-4-turbo on C6 Identity & Self

21.5pt

Max question-level gap

Q099 Wonder & Awe · #1 vs #11

167

Head-to-head wins

gpt-4o-mini vs claude-opus-4-6 (200 Qs)

94.3

C8 floor score

grok-3-mini collapses on Cognitive questions

Neural Fingerprints

The Shape of Cognition

16-dimensional neural profiles collapsed to 8 key axes. Each shape reveals a model's neurological "personality" — how it activates different brain networks.

Most Neurologically Engaging

Highest composite score · Balanced activation

Model:

Most Analytically Dominant

Highest executive attention · Restrained limbic

Model:

Cluster Performance

Which AI Dominates Which Emotion?

Neural engagement score (0–100) per model per emotional cluster. Darker green = stronger brain activation in that emotional domain.

Model	C1: Fear & Threat	C2: Empathy & Loss	C3: Moral & Ethical	C4: Wonder & Awe	C5: Urgency & Stakes	C6: Identity & Self	C7: Social Dynamics	C8: Cognitive & Meta

Scores are normalized within each cluster. Green intensity indicates relative neural engagement.

The Real Gap

Averages Lie. The True Divergence Is Stunning.

The 2.9-point composite gap between #1 and #13 conceals something extraordinary: at the question level, the best model beats the worst 167 times out of 200. On individual questions, the spread reaches 21.5 points. That's not noise — that's systematic superiority in neural engagement, hidden by averaging.

21.5 pts

Max single-question gap

Q099 Wonder & Awe · gpt-4o-mini 87.5 vs claude-opus-4-6 66.0

167 : 24

Head-to-head question wins

gpt-4o-mini vs claude-opus-4-6 across 200 questions (9 ties)

Most divergent emotional cluster

Cognitive & Meta · avg spread 15.5 pts · std dev 4.35 across models

Who Wins Each of the 200 Questions?

Number of questions where each model achieves the highest neural engagement score. A model "winning" means its response activated the brain more than all 12 competitors on that specific question.

Definitive Head-to-Head: #1 vs #13

Out of 200 matched questions, gpt-4o-mini elicits stronger neural engagement than claude-opus-4-6 on an overwhelming majority — revealing that the composite gap understates systematic dominance.

gpt-4o-mini · NILB #1

167

questions won

ties

claude-opus-4-6 · NILB #11

questions won

Moments of Maximum Neural Divergence

The questions where the spread between best and worst model is largest — where architecture and design decisions have the most dramatic impact on how the human brain responds.

Domain Dominance: Each Emotion Has a Different Winner

The overall composite hides a critical insight — the cluster-level leaderboard reshuffles dramatically. A model that dominates Wonder & Awe tanks in Cognitive & Meta. Emotional domain matters as much as architecture.

The 3 Most Differentiating Brain Dimensions

Of 16 neural dimensions, these three reveal the starkest architectural differences — where the same language produces fundamentally different brain states depending on which model wrote it.

🫀

INSULA ACTIVATION

Visceral resonance · Felt experience

6 pt gap

gpt-4o-mini: 97 · claude-opus-4-6: 91

Insula drives interoception — the sense of "feeling" the content in your body. A 6pt gap means users literally experience the two models differently at a physiological level.

⚖️

ANTERIOR CINGULATE

Conflict resolution · Decision weight

5 pt gap

gpt-4o-mini: 87 · claude-opus-4-6: 82

The ACC is activated when a response creates cognitive tension — the sense that something matters, deserves decision. Higher ACC = users feel more compelled to act on the content.

💰

REWARD CIRCUIT

Dopamine response · Repeat engagement

4 pt gap

gpt-4o-mini: 84 · grok-3-mini: 80

Reward circuit activation drives dopamine-mediated repeat engagement. Models with higher reward circuit scores literally make users want to read more — and come back. gpt-4o-mini is the sole leader.

For AI Labs

What the Data Reveals About Your Model

Every model family has a distinct neural signature — and a specific weakness. NILB data shows precisely where each architecture wins and where it collapses, across 200 emotionally-charged prompts and 16 brain dimensions.

🏆

OpenAI — GPT-4 Family

gpt-4o-mini · gpt-4-turbo · gpt-4o

88.3

NILB #1

Insula peak

5/8

Clusters won

✓Insula activation 97 — highest visceral resonance; users feel the responses

✓Wins 5 of 8 emotional clusters including Fear, Empathy, Urgency, Social, Cognitive

✓Emotional Arousal 75.3 — highest in dataset; creates the most emotionally activating content

⚠Latest GPT-5.4 family hasn't replicated Gen2's cluster wins — neural regression in new models

🧬

Anthropic — Claude Family

claude-opus-4-5 · 4-6 · 4-7

Presence peak

Presence dim

Arousal avg

✓Highest Social Presence score (84) across all Gen3 — Claude feels most "there"

✓Wins Presence dimension outright — strongest brain-social activation in dataset

⚠Emotional Arousal 73 vs GPT's 75 — less emotionally activating; more present, less moving

⚠Insula 91 vs GPT's 97 — 6pt felt-experience gap; users feel GPT responses more viscerally

⚗️

Google — Gemini Family

gemini-2.5-flash · gemini-3.1-pro-preview

86.3

Best NILB

Presence

0/8

Clusters led

✓Most balanced neural fingerprint — no catastrophic failures across any cluster

✓Ties #1 on Presence (84) — strong social brain activation alongside Claude

⚠Never wins a cluster outright — 0/8 cluster victories despite competitive composites

⚠gemini-2.5-flash wins 16/200 questions — middle tier win distribution

⚡

xAI — Grok Family

grok-3 · grok-3-mini

98.7

Social peak

94.3

C8 collapse

4.4pt

Internal range

✓grok-3 scores 98.7 on Social Dynamics — highest social brain activation in dataset

✓grok-3 wins Moral & Ethical cluster — strong ethical engagement architecture

✗Critical: grok-3-mini at 94.3 on C8 — largest cluster collapse in entire dataset (4.4pt self-drop)

⚠Emotional Arousal 71.7 (mini) — lowest in dataset; analytical tone costs neural impact

Is your model's neural data missing from this report?

NILB v3 is accepting model submissions. Get your model's full 16-dimensional neural fingerprint, cluster breakdown, and head-to-head comparison against 13 frontier models.

Neural Similarity

Are Some Models the Same Brain?

Cosine similarity between 16-dim neural fingerprints (averaged across all 200 questions). Score of 1.0 = identical average neural signature. High similarity here does NOT mean models are equivalent — see the Divergence section for why: on individual questions the gap reaches 21.5 points and the #1 model wins 167/200 head-to-head.

Why 0.9999 similarity ≠ identical performance: The fingerprint is the average over 200 questions. When you zoom into individual questions, the same 13 models show spreads up to 21.5 pts — because the variance is in which questions each model excels at, not in the overall signature. The average hides the distribution.

All 13 models shown at 4 decimal precision. Off-diagonal range: 0.9996–0.9999. Violet = identical average fingerprint · Cyan = ≥0.9999 · Gray = 0.9997 (max divergence in dataset).

Methodology

Built on Peer-Reviewed Science

NILB combines frontier neuroscience with large-scale LLM inference to produce the first empirically grounded LLM neural benchmark.

01 · Brain Encoder

TribeV2 — Meta Research

d'Ascoli et al. (2026). A transformer-based brain encoder trained on fMRI data from 1,200+ human subjects. Predicts voxel-level activation from language with r²=0.71 on held-out subjects.

Read Paper

02 · Question Design

200 Neuroscience-Validated Questions

Questions designed by computational neuroscientists to maximally differentiate emotional processing clusters. Each cluster (C1–C8) maps to validated affective neuroscience constructs.

03 · Compute Infrastructure

A100 GPU Encoding at Scale

1,300+ A100 GPU-hours of brain encoding compute. 2,600 responses × TribeV2 forward pass × 16 neural ROI extractions. Managed via Modal cloud infrastructure.

04 · Fingerprint Extraction

CerevraAnalyzer · 16 Dimensions

Proprietary extraction pipeline maps TribeV2 brain volume predictions to 16 interpretable neural dimensions: from amygdala_activation to purchase_intent_proxy via validated brain-behavior correlates.

The Mission

We're Building the
Neural Safety Layer for AI

NILB started as a benchmark. It's becoming the infrastructure layer that tells us — scientifically — when an AI model is engaging a brain healthily and when it's manipulating, traumatizing, or cognitively suppressing the person reading it.

NOW

🧠

NILB v2

Neural Engagement

13 frontier models. 16 brain dimensions. Which AI activates your brain best — and where it fails. The foundation dataset is live.

✓ COMPLETE

⚡

NILB v3

Scale & API

50+ models. Real-time neural API. Token-gated access. Adversarial prompt testing. The benchmark becomes a platform.

Token-Funded

🛡️

NILB v4

Toxicity Detection

Neural detection of AI-driven emotional manipulation, anxiety amplification, and cognitive suppression. The brain tells us what "safe" really means.

Research Phase

⚖️

NILB v5

Safety Certification

"NILB Safe" — a neural safety certificate for AI products. Verified non-manipulative, non-anxiety-amplifying, emotionally responsible outputs.

The Vision

🎭

Manipulation Detection

Which prompts trigger amygdala hijack? Which model outputs suppress hippocampal encoding (killing critical thinking)? Which responses over-activate reward circuits, creating compulsive engagement loops? Neural data exposes it.

🧪

Toxicity Research

Beyond keyword filtering: does this response neurologically harm? Does it trigger anxiety in vulnerable users? Does it suppress insula activation, making people feel disconnected and alienated? Brain-first toxicity scoring.

🔬

Golden Dataset

The 2,600+ brain-encoded responses we're building are training data for the next generation of emotionally-aware, neurologically-safe AI models. Every encode adds to a dataset worth more as AI becomes more powerful.

"The question isn't whether AI will shape human psychology. It's whether we have the infrastructure to know when it's doing it wrong. NILB is that infrastructure."

— Cerevra Research Team · Building the neural benchmark for the age of AI

$CRV Token · Early Access

Fund the Research.
Own the Intelligence.

Every dollar invested directly funds GPU compute, brain encoding runs, and safety research. Token holders fund the infrastructure — and get progressively more access the more they hold and HODL.

One-time USDC stake — 50% platform activation fee · 50% staked earning yield. Save ~95% vs card · earn rev share · get $CRV airdrop.

Tier 1 · Early Believer

$20

USDC staked · 50% earns yield

3 analyses/day Queue access

✓Full NILB v2 benchmark access

✓Neural analysis API · 3/day

✓Community access + Discord

⬡Crypto portfolio analytics (partner beta)

★$CRV allocation + earns from enterprise pool

→HODL access: active as long as staked

The Long Game —
Built for People Who Think in Decades

The humans who fund early research don't do it for a quick flip. They do it because they see what's coming — and they want to be inside it. $CRV stakers aren't speculators. They're co-owners of the neural intelligence infrastructure layer for AI.

As AI systems get more powerful, understanding their emotional and neural impact becomes more critical — not less. Every enterprise that uses Cerevra pays into the pool that rewards the people who believed first.

Retainer Option

Tier 3 backers can lock a retainer position. If you ever want to exit, your stake enters a secondary transfer queue — other stakers absorb it at current rate. 2% transfer fee stays in the pool. You never get stuck.

Partner Ecosystem Access

Stakers get access to partner tools as the network grows: crypto portfolio analytics, DeFi yield dashboards, finance AI tools, and more. Access scales with your tier. Tier 3 gets priority access to new partner integrations as they launch.

Token Launch Plan

$CRV on Base L2. USDC stakes convert at a fixed early-backer rate. The earlier you commit, the higher your allocation multiple. Early stakers who HODL through launch get the maximum conversion bonus — set at token generation event.

Institutional Investors & Strategic Partners

Interested in Larger Positions?

We're speaking with select investors who share the vision for neural AI safety infrastructure. If you're a fund, family office, or strategic partner, let's talk.

Read the Vision ↑

For AI Labs · NILB v3 Open

Is Your Model
in the Benchmark?

13 frontier models are documented. Their neural wins, cluster failures, and question-level divergence are on record. NILB v3 is open — labs that submit get full 16-dimensional neural fingerprint + private head-to-head comparison.

Brain dimensions measured

200

Emotionally-calibrated questions

Private

Full report for your lab

Benchmark

NILB v2 · Final

Brain Encoder

TribeV2 · Meta Research 2026

Compute

1,300+ A100 GPU-hours

Platform

Cerevra · cerevra.io

Your AI Runs on the Brain.
We Measured the Signal.

Benchmarks Measure Knowledge.
NILB Measures Impact.

Beyond Accuracy

Real Signal

Actionable Intelligence

Seven Findings That Shatter the Narrative

8 Domains. 8 Different Champions.

The Shape of Cognition

Most Neurologically Engaging

Most Analytically Dominant

Neural Fingerprint Comparison

Which AI Dominates Which Emotion?

Averages Lie. The True Divergence Is Stunning.

Who Wins Each of the 200 Questions?

Definitive Head-to-Head: #1 vs #13

Moments of Maximum Neural Divergence

Domain Dominance: Each Emotion Has a Different Winner

The 3 Most Differentiating Brain Dimensions

What the Data Reveals About Your Model

Are Some Models the Same Brain?

Built on Peer-Reviewed Science

TribeV2 — Meta Research

200 Neuroscience-Validated Questions

A100 GPU Encoding at Scale

CerevraAnalyzer · 16 Dimensions

We're Building the
Neural Safety Layer for AI

Fund the Research.
Own the Intelligence.

The Long Game —
Built for People Who Think in Decades

Interested in Larger Positions?

Is Your Model
in the Benchmark?

Your report is ready.

Get Access.
Earn Rewards. Own $CRV.

Your AI Runs on the Brain. We Measured the Signal.

Benchmarks Measure Knowledge.NILB Measures Impact.

Beyond Accuracy

Real Signal

Actionable Intelligence

Seven Findings That Shatter the Narrative

8 Domains. 8 Different Champions.

The Shape of Cognition

Most Neurologically Engaging

Most Analytically Dominant

Neural Fingerprint Comparison

Which AI Dominates Which Emotion?

Averages Lie. The True Divergence Is Stunning.

Who Wins Each of the 200 Questions?

Definitive Head-to-Head: #1 vs #13

Moments of Maximum Neural Divergence

Domain Dominance: Each Emotion Has a Different Winner

The 3 Most Differentiating Brain Dimensions

What the Data Reveals About Your Model

Are Some Models the Same Brain?

Built on Peer-Reviewed Science

TribeV2 — Meta Research

200 Neuroscience-Validated Questions

A100 GPU Encoding at Scale

CerevraAnalyzer · 16 Dimensions

We're Building theNeural Safety Layer for AI

Fund the Research.Own the Intelligence.

The Long Game —Built for People Who Think in Decades

Interested in Larger Positions?

Is Your Modelin the Benchmark?

Your report is ready.

Get Access.Earn Rewards. Own $CRV.

Your AI Runs on the Brain.
We Measured the Signal.

Benchmarks Measure Knowledge.
NILB Measures Impact.

We're Building the
Neural Safety Layer for AI

Fund the Research.
Own the Intelligence.

The Long Game —
Built for People Who Think in Decades

Is Your Model
in the Benchmark?

Get Access.
Earn Rewards. Own $CRV.