6 LLMs  ·  5 languages  ·  Quarterly index  ·  Independent research  ·  Updated Q2 2026
CEAVERS
Centre for European AI Visibility Evaluation & Research Standards

Glossary

Common Crawl

Last reviewed: 2026-05-22

Common Crawl is a non-profit project that publishes a free monthly crawl of the public web. The resulting corpus seeds the training data of most large language models. CCBot inclusion is a foundational prerequisite for training-corpus recall.

What Common Crawl is

Founded in 2008, Common Crawl runs a continuous web crawler (CCBot) that downloads and archives public web content monthly. The resulting dataset — tens of billions of web pages in compressed WARC format — is made freely available on Amazon S3. It is the single largest open training dataset for language models and underlies GPT-4, Claude, Gemini, Llama, and most open-source models, either directly or via derived filtered datasets such as C4, The Pile, and RefinedWeb.

The English-language dominance effect

Common Crawl data was approximately 45–55% English by token count in the periods covered by current commercial models. The five major European languages (Italian, French, Spanish, Portuguese, German) collectively account for approximately 15% of tokens. This imbalance has direct, measurable consequences for AI visibility: models trained on Common Crawl know more about English-language entities than European ones, and answer more accurately about them in all languages.

CEAVERS Q2 2026 data quantifies this gap: Italian-language prompts yield brand visibility scores averaging 4.8% below the cross-language mean, and Portuguese-language prompts average 10.2% below. English-language prompts average 12% above the mean. The gap is not caused by language at the query level — it is caused by language at the training-data level.

What this means for European brands

An entity that exists only in Italian-language web content is substantially less likely to be cited by any LLM than an entity with equivalent English-language coverage. This is true even for queries submitted in Italian. The model generates Italian text, but its internal representation of entities is shaped by the English-dominant training corpus.

The practical consequence is that European brands publishing content only in their national language are systematically disadvantaged relative to their English-language counterparts — not because of query-time retrieval failures, but because the underlying parametric knowledge is thinner. Publishing in English, building Wikidata entity coverage, and earning cross-language press coverage are the most direct countermeasures.

CCBot and robots.txt

Common Crawl’s crawler identifies as CCBot/2.0. To allow inclusion, ensure your robots.txt does not block CCBot. Blocking CCBot reduces parametric visibility in models trained on future Common Crawl snapshots — an effect that compounds over training cycles.

Frequently asked

What is Common Crawl?
Common Crawl is a non-profit project that publishes a free monthly crawl of the public web. The resulting corpus seeds the training data of most large language models, including GPT, Claude, Gemini, and most open-source models.
Why does Common Crawl matter for AEO?
If your domain is not in Common Crawl, it is effectively invisible to the training-corpus recall layer of every major LLM. CCBot inclusion is a foundational prerequisite for being cited.
How do I verify my site is in Common Crawl?
Query cdx.commoncrawl.org for your domain to see whether and when it has been crawled. Allow CCBot in robots.txt and check inclusion in the next monthly snapshot.

Related terms