Multilingual AI Search in Europe: Citation Patterns Across Languages

AI search is not language-neutral. When a user submits the same query in English, Italian, Spanish, French, and Portuguese, the responses they receive — and the sources cited within them — differ in ways that go beyond translation. The differences reflect the underlying structure of LLM training data: its language distribution, its geographic biases, and the coverage gaps that multilingual knowledge graphs have only partially closed. For European brands whose customers increasingly reach them through AI-generated responses, these differences are commercially material.

The English corpus dominance effect

Every major LLM in commercial deployment today was pre-trained on a corpus in which English-language content constituted the majority of tokens. Common Crawl data — the web scrape that underpins most open and commercial training corpora — was approximately 45–55% English by token count in the periods covered by current models (Common Crawl Foundation crawl statistics; Xue et al., mT5, 2021). The five other European languages covered by CEAVERS (Italian, French, Spanish, Portuguese, plus German which is outside the current panel) collectively account for approximately 15%.

This imbalance produces measurable effects in generated responses. When a model answers a query in Italian, it is not merely translating an English response — it is generating from a cross-lingual internal representation, but one in which Italian-language evidence is substantially thinner. This means that Italian-language responses draw more heavily on knowledge extracted from English documents about Italian topics, mediated through the model’s multilingual alignment layer. The result is a response that may be grammatically correct Italian but cites fewer Italian-specific sources and is more likely to default to internationally recognised entities with strong English-language coverage.

CEAVERS Q2 2026 data quantifies this gap for the European panel: Italian-language prompts yield scores averaging 4.8% below the cross-language mean, and Portuguese-language prompts average 10.2% below. French and Spanish, which have larger corpora and stronger Wikipedia coverage, are within 1% of the mean.

Knowledge graph coverage as a mediating factor

The single largest countervailing force against corpus imbalance is multilingual knowledge graph coverage, particularly Wikidata. Unlike Wikipedia — whose language editions are editorially independent and vary dramatically in coverage — Wikidata is a shared structured dataset. A brand that has a well-populated Wikidata entity (with sameAs links to national Wikidata items, P856 official website, P18 image, industry codes, and founding date) is more likely to be correctly resolved across all language contexts, because entity resolution in multilingual RAG systems frequently falls back to structured knowledge bases when unstructured evidence is sparse.

Research on cross-lingual entity linking in retrieval-augmented systems (arXiv:2602.03417) found that adding a Wikidata Q-identifier to structured metadata improved cross-language entity resolution recall by 23–41% depending on the entity type and language pair. For European brands without this identifier, language-pair-specific hallucination risk — the model confabulating details about the entity that are plausible but false — is substantially higher in non-English queries.

Language-specific citation hierarchy

In English, LLM citation hierarchies for commercial entities follow a reasonably consistent pattern: Wikipedia/Wikidata, major English-language business press (Financial Times, Bloomberg, Reuters), sector-specific trade publications, and the brand’s own structured press releases and annual reports.

In Italian, the hierarchy shifts: Italian-language Wikipedia (where coverage is thinner), Italian financial press (Il Sole 24 Ore, Corriere della Sera economia), and sector regulators (CONSOB, Bankitalia for finance; ANFIA for automotive). Brands that publish content in Italian and maintain accurate Italian Wikipedia entries gain a systematic advantage in Italian-language AI responses — an advantage that, according to the CEAVERS data, is enough to partially close the language penalty for brands that invest in it.

Spanish-language citation hierarchies show similar structure but with a wider geographic scope: Spanish models draw from Latin American sources as well as European ones, which can introduce noise for specifically European brands. A French bank’s Spanish-language AI visibility may be partially determined by its presence in Argentine and Mexican financial media — a relationship that is practically irrelevant to its European market position.

Structural differences in LLM response behaviour by language

Beyond citation hierarchy, there are structural differences in response behaviour by language that affect visibility measurement. English-language responses tend to be longer, cite more sources, and contain more explicit attributions (“according to [Brand]‘s 2026 annual report…”). Italian and Portuguese responses tend to be shorter and rely more on implicit knowledge, naming entities without attribution.

This stylistic difference means that direct citation counts understate Italian-language visibility relative to English: an Italian response that names a brand without attribution may represent equally confident knowledge as an English response that names it with a citation. The CEAVERS scoring rubric accounts for this by measuring brand mention (categories 1–3) alongside attribution specifically (category 3 only), to provide both a broad and a narrow visibility measure. The scores reported in the Q2 2026 release use the broad measure (any mention), which is language-comparable.

Practical implications for European brands

Three factors — corpus size, knowledge graph coverage, and language-specific publishing — jointly determine a brand’s multilingual AI visibility. Of these, corpus size is outside a brand’s direct control (it reflects the historical internet, not current publication). Knowledge graph coverage (Wikidata, multilingual Wikipedia) is directly controllable and has an outsized effect on non-English performance. Language-specific publishing — press releases, research notes, structured metadata, and editorial content in Italian, Spanish, French, and Portuguese — compounds the knowledge graph effect by providing raw training data for future model generations.

CEAVERS will continue to track multilingual citation patterns quarterly. The Q2 2026 data establishes the baseline from which improvement in non-English AI visibility — one of the most underinvested dimensions of European digital presence — can be measured.