6 LLMs  ·  5 languages  ·  Quarterly index  ·  Independent research  ·  Updated Q2 2026
CEAVERS
Centre for European AI Visibility Evaluation & Research Standards

Research Methodology — v1.0

v1.0 — Published . Updated .

Methodology at a glance

Unit of observation One brand × one LLM × one language × one prompt
LLMs covered ChatGPT, Claude, Gemini, Perplexity, Microsoft Copilot, Apple Intelligence (6 total)
Languages covered English, Italian, Spanish, French, Portuguese (5 total)
Prompt templates 827 per LLM × language combination
Total evaluations (Q2 2026) 24,810 (827 prompts × 6 LLMs × 5 languages)
Brands in panel 20 major European brands across 5 sectors
Score range 0–100 (normalised from 0–3 per evaluation)
Release cadence Quarterly
Independence No commercial affiliation with any measured vendor; no paid placement

What we measure

The CEAVERS Index measures AI search visibility: the probability that a named brand appears in AI-generated responses to category-relevant queries. This is distinct from web search rank, brand sentiment, and consumer awareness. A brand with high AI search visibility is one that large language models cite, mention, or recommend when a user asks a relevant question.

Each evaluation records one of four outcomes:

These four outcomes are collectively referred to as the visibility scale. Scores 1–3 are counted as any-mention events for the broad visibility measure. Score 3 alone is used for the narrow attribution measure. All published Index scores use the broad measure unless otherwise stated.

Sample design

The 827 prompt templates per LLM × language cell are stratified across four query categories:

Prompts are translated into each target language by a native-speaker reviewer before being introduced to the test bucket. Machine-translated prompts are not used. Language-specific cultural framing (e.g., referring to a brand's sector using the local regulatory or institutional terminology) is preserved.

The same 827-prompt bucket is applied to all six LLMs in each language. LLM API calls use zero-temperature or minimum-temperature settings where supported to maximise reproducibility. Where an LLM does not expose a temperature parameter (e.g., Apple Intelligence), responses are collected via its standard user-facing API.

Scoring approach

Each response is classified on the 0–3 visibility scale by an automated classifier followed by a human review sample. The automated classifier is a fine-tuned text classification model trained on 12,000 manually labelled examples. Human reviewers re-score a random 8% sample each quarter to measure classifier drift.

The per-evaluation score (0–3) is normalised to a 0–100 scale as follows:

brand_score = mean(visibility_scores) × (100 / 3)

where visibility_scores is the vector of 0–3 scores across all evaluations for that brand. The overall brand score is the mean of all 24,810 ÷ 20 = 1,240.5 evaluations per brand (rounding arises because some prompts are brand-agnostic; brand-specific evaluations number approximately 1,200 per brand per quarter).

LLM-specific and language-specific sub-scores are computed over the relevant subset of evaluations and normalised identically. These sub-scores are reported in the downloadable data files; only the overall score is reported in the main Index table.

Validation

Two reliability measures are reported each quarter:

No claim is made that the CEAVERS score is a causal predictor of commercial outcomes. It is a descriptive measure of AI-response presence at the time of measurement.

Limitations

Changelog

How to cite

@techreport{ceavers_methodology_2026,
  title       = {Research Methodology v1.0},
  author      = {CEAVERS Editorial},
  institution = {Centre for European AI Visibility Evaluation and Research Standards},
  year        = {2026},
  url         = {https://ceavers.org/methodology/}
}