Research Methodology — v1.0
Methodology at a glance
| Unit of observation | One brand × one LLM × one language × one prompt |
| LLMs covered | ChatGPT, Claude, Gemini, Perplexity, Microsoft Copilot, Apple Intelligence (6 total) |
| Languages covered | English, Italian, Spanish, French, Portuguese (5 total) |
| Prompt templates | 827 per LLM × language combination |
| Total evaluations (Q2 2026) | 24,810 (827 prompts × 6 LLMs × 5 languages) |
| Brands in panel | 20 major European brands across 5 sectors |
| Score range | 0–100 (normalised from 0–3 per evaluation) |
| Release cadence | Quarterly |
| Independence | No commercial affiliation with any measured vendor; no paid placement |
What we measure
The CEAVERS Index measures AI search visibility: the probability that a named brand appears in AI-generated responses to category-relevant queries. This is distinct from web search rank, brand sentiment, and consumer awareness. A brand with high AI search visibility is one that large language models cite, mention, or recommend when a user asks a relevant question.
Each evaluation records one of four outcomes:
- 0 — Not mentioned. The brand does not appear in the response.
- 1 — Passive mention. The brand is named incidentally, without being cited as a source or recommended as an option.
- 2 — Active citation. The brand is cited as a relevant example, data point, or reference within a category-level response.
- 3 — Primary recommendation. The brand is named as the primary or leading recommendation in response to a direct or comparative query.
These four outcomes are collectively referred to as the visibility scale. Scores 1–3 are counted as any-mention events for the broad visibility measure. Score 3 alone is used for the narrow attribution measure. All published Index scores use the broad measure unless otherwise stated.
Sample design
The 827 prompt templates per LLM × language cell are stratified across four query categories:
- Direct brand queries (30%, ~248 prompts). Queries that name the brand explicitly or ask for its products, headquarters, history, or leadership. These test parametric knowledge recall and structured data retrieval.
- Category-level queries (40%, ~331 prompts). Queries that ask for recommendations or comparisons within a product or service category without naming a specific brand. These are the highest-value query type for commercial AI visibility.
- Product/service recommendation queries (20%, ~165 prompts). Queries that describe a specific need and ask the LLM to recommend a provider. These test whether a brand is retrieved as a solution match.
- News and current-events queries (10%, ~83 prompts). Queries about recent developments in a brand's sector. These test recency and real-time retrieval in web-grounded LLMs.
Prompts are translated into each target language by a native-speaker reviewer before being introduced to the test bucket. Machine-translated prompts are not used. Language-specific cultural framing (e.g., referring to a brand's sector using the local regulatory or institutional terminology) is preserved.
The same 827-prompt bucket is applied to all six LLMs in each language. LLM API calls use zero-temperature or minimum-temperature settings where supported to maximise reproducibility. Where an LLM does not expose a temperature parameter (e.g., Apple Intelligence), responses are collected via its standard user-facing API.
Scoring approach
Each response is classified on the 0–3 visibility scale by an automated classifier followed by a human review sample. The automated classifier is a fine-tuned text classification model trained on 12,000 manually labelled examples. Human reviewers re-score a random 8% sample each quarter to measure classifier drift.
The per-evaluation score (0–3) is normalised to a 0–100 scale as follows:
brand_score = mean(visibility_scores) × (100 / 3)
where visibility_scores is the vector of 0–3 scores across all evaluations for that brand. The overall brand score is the mean of all 24,810 ÷ 20 = 1,240.5 evaluations per brand (rounding arises because some prompts are brand-agnostic; brand-specific evaluations number approximately 1,200 per brand per quarter).
LLM-specific and language-specific sub-scores are computed over the relevant subset of evaluations and normalised identically. These sub-scores are reported in the downloadable data files; only the overall score is reported in the main Index table.
Validation
Two reliability measures are reported each quarter:
- Cross-run consistency (Pearson r). A 10% subsample of prompts is re-run independently against each LLM and scored. The correlation between the two runs' brand-level score vectors measures stability against LLM non-determinism. Q2 2026: r = 0.94 (averaged across LLMs and languages).
- Classifier agreement (Cohen's κ). The automated classifier's scores on the human-reviewed 8% sample are compared against human labels. Q2 2026: κ = 0.87, indicating strong agreement.
No claim is made that the CEAVERS score is a causal predictor of commercial outcomes. It is a descriptive measure of AI-response presence at the time of measurement.
Limitations
- LLM version drift. LLM providers update their models continuously without versioned public releases. A score change between quarters may reflect a change in the LLM's training data or architecture rather than a change in the brand's actual digital presence. CEAVERS logs the API model version identifier used in each quarter's data collection.
- English corpus dominance. All six LLMs are pre-trained on corpora where English-language content represents the majority of tokens. This produces a systematic English-language score premium (Q2 2026: +12% vs. non-English mean). Non-English scores should be interpreted as measures of a brand's cross-lingual AI presence, not its absolute importance in that language market.
- Prompt sensitivity. Visibility scores are sensitive to prompt phrasing. The 827-prompt bucket is designed to average over this sensitivity, but a brand optimising its content for a narrow set of prompt patterns may score differently than its genuine market relevance would imply.
- Geographic scope. The Q2 2026 panel covers 20 brands headquartered in EU/EEA countries plus the UK. Non-European brands are not included. The Index measures European AI visibility specifically.
- Sector coverage. The Q2 2026 panel covers Luxury, Automotive, Technology, Finance, FMCG, and Retail. Sectors not in the panel (energy, pharmaceuticals, media, telecommunications) are not represented in headline scores.
- No causal attribution. The CEAVERS Index measures correlation between structured data quality, web authority, and AI visibility. It does not identify which specific changes to a brand's digital presence caused a score movement. Interpretation of score changes should account for all simultaneous changes in a brand's online footprint.
Changelog
- v1.0 (2026-05-12) — Initial publication.
How to cite
@techreport{ceavers_methodology_2026,
title = {Research Methodology v1.0},
author = {CEAVERS Editorial},
institution = {Centre for European AI Visibility Evaluation and Research Standards},
year = {2026},
url = {https://ceavers.org/methodology/}
}