How does the CEAVERS Index measure AI search visibility?

CEAVERS issues a fixed bucket of prompts against six LLMs in five languages each quarter and records whether and how named brands appear in each response.

Which LLMs are covered?

ChatGPT (and ChatGPT Search), Claude, Gemini, Perplexity, Microsoft Copilot, and Apple Intelligence.

Is CEAVERS independent of vendors?

Yes. CEAVERS does not accept payment for evaluation and has no commercial affiliation with any vendor measured by the Index.

Research Methodology — v1.0

v1.0 — Published 2026-05-12. Updated 2026-05-22.

Methodology at a glance

Unit of observation	One brand × one LLM × one language × one prompt
LLMs covered	ChatGPT, Claude, Gemini, Perplexity, Microsoft Copilot, Apple Intelligence (6 total)
Languages covered	English, Italian, Spanish, French, Portuguese (5 total)
Prompt templates	827 per LLM × language combination
Total evaluations (Q2 2026)	24,810 (827 prompts × 6 LLMs × 5 languages)
Brands in panel	20 major European brands across 5 sectors
Score range	0–100 (normalised from 0–3 per evaluation)
Release cadence	Quarterly
Independence	No commercial affiliation with any measured vendor; no paid placement

What we measure

The CEAVERS Index measures AI search visibility: the probability that a named brand appears in AI-generated responses to category-relevant queries. This is distinct from web search rank, brand sentiment, and consumer awareness. A brand with high AI search visibility is one that large language models cite, mention, or recommend when a user asks a relevant question.

Each evaluation records one of four outcomes:

0 — Not mentioned. The brand does not appear in the response.
1 — Passive mention. The brand is named incidentally, without being cited as a source or recommended as an option.
2 — Active citation. The brand is cited as a relevant example, data point, or reference within a category-level response.
3 — Primary recommendation. The brand is named as the primary or leading recommendation in response to a direct or comparative query.

These four outcomes are collectively referred to as the visibility scale. Scores 1–3 are counted as any-mention events for the broad visibility measure. Score 3 alone is used for the narrow attribution measure. All published Index scores use the broad measure unless otherwise stated.

Sample design

The 827 prompt templates per LLM × language cell are stratified across four query categories:

Direct brand queries (30%, ~248 prompts). Queries that name the brand explicitly or ask for its products, headquarters, history, or leadership. These test parametric knowledge recall and structured data retrieval.
Category-level queries (40%, ~331 prompts). Queries that ask for recommendations or comparisons within a product or service category without naming a specific brand. These are the highest-value query type for commercial AI visibility.
Product/service recommendation queries (20%, ~165 prompts). Queries that describe a specific need and ask the LLM to recommend a provider. These test whether a brand is retrieved as a solution match.
News and current-events queries (10%, ~83 prompts). Queries about recent developments in a brand's sector. These test recency and real-time retrieval in web-grounded LLMs.

Prompts are translated into each target language by a native-speaker reviewer before being introduced to the test bucket. Machine-translated prompts are not used. Language-specific cultural framing (e.g., referring to a brand's sector using the local regulatory or institutional terminology) is preserved.

The same 827-prompt bucket is applied to all six LLMs in each language. LLM API calls use zero-temperature or minimum-temperature settings where supported to maximise reproducibility. Where an LLM does not expose a temperature parameter (e.g., Apple Intelligence), responses are collected via its standard user-facing API.

Scoring approach

Each response is classified on the 0–3 visibility scale by an automated classifier followed by a human review sample. The automated classifier is a fine-tuned text classification model trained on 12,000 manually labelled examples. Human reviewers re-score a random 8% sample each quarter to measure classifier drift.

The per-evaluation score (0–3) is normalised to a 0–100 scale as follows:

brand_score = mean(visibility_scores) × (100 / 3)

where visibility_scores is the vector of 0–3 scores across all evaluations for that brand. The overall brand score is the mean of all 24,810 ÷ 20 = 1,240.5 evaluations per brand (rounding arises because some prompts are brand-agnostic; brand-specific evaluations number approximately 1,200 per brand per quarter).

LLM-specific and language-specific sub-scores are computed over the relevant subset of evaluations and normalised identically. These sub-scores are reported in the downloadable data files; only the overall score is reported in the main Index table.

Validation

Two reliability measures are reported each quarter:

Cross-run consistency (Pearson r). A 10% subsample of prompts is re-run independently against each LLM and scored. The correlation between the two runs' brand-level score vectors measures stability against LLM non-determinism. Q2 2026: r = 0.94 (averaged across LLMs and languages).
Classifier agreement (Cohen's κ). The automated classifier's scores on the human-reviewed 8% sample are compared against human labels. Q2 2026: κ = 0.87, indicating strong agreement.

No claim is made that the CEAVERS score is a causal predictor of commercial outcomes. It is a descriptive measure of AI-response presence at the time of measurement.

Limitations

LLM version drift. LLM providers update their models continuously without versioned public releases. A score change between quarters may reflect a change in the LLM's training data or architecture rather than a change in the brand's actual digital presence. CEAVERS logs the API model version identifier used in each quarter's data collection.
English corpus dominance. All six LLMs are pre-trained on corpora where English-language content represents the majority of tokens. This produces a systematic English-language score premium (Q2 2026: +12% vs. non-English mean). Non-English scores should be interpreted as measures of a brand's cross-lingual AI presence, not its absolute importance in that language market.
Prompt sensitivity. Visibility scores are sensitive to prompt phrasing. The 827-prompt bucket is designed to average over this sensitivity, but a brand optimising its content for a narrow set of prompt patterns may score differently than its genuine market relevance would imply.
Geographic scope. The Q2 2026 panel covers 20 brands headquartered in EU/EEA countries plus the UK. Non-European brands are not included. The Index measures European AI visibility specifically.
Sector coverage. The Q2 2026 panel covers Luxury, Automotive, Technology, Finance, FMCG, and Retail. Sectors not in the panel (energy, pharmaceuticals, media, telecommunications) are not represented in headline scores.
No causal attribution. The CEAVERS Index measures correlation between structured data quality, web authority, and AI visibility. It does not identify which specific changes to a brand's digital presence caused a score movement. Interpretation of score changes should account for all simultaneous changes in a brand's online footprint.

Changelog

v1.0 (2026-05-12) — Initial publication.

How to cite

@techreport{ceavers_methodology_2026,
  title       = {Research Methodology v1.0},
  author      = {CEAVERS Editorial},
  institution = {Centre for European AI Visibility Evaluation and Research Standards},
  year        = {2026},
  url         = {https://ceavers.org/methodology/}
}