Glossary
Common Crawl
Last reviewed: 2026-05-22
Common Crawl is a non-profit project that publishes a free monthly crawl of the public web. The resulting corpus seeds the training data of most large language models. CCBot inclusion is a foundational prerequisite for training-corpus recall.
What Common Crawl is
Founded in 2008, Common Crawl runs a continuous web crawler (CCBot) that downloads and archives public web content monthly. The resulting dataset — tens of billions of web pages in compressed WARC format — is made freely available on Amazon S3. It is the single largest open training dataset for language models and underlies GPT-4, Claude, Gemini, Llama, and most open-source models, either directly or via derived filtered datasets such as C4, The Pile, and RefinedWeb.
The English-language dominance effect
Common Crawl data was approximately 45–55% English by token count in the periods covered by current commercial models. The five major European languages (Italian, French, Spanish, Portuguese, German) collectively account for approximately 15% of tokens. This imbalance has direct, measurable consequences for AI visibility: models trained on Common Crawl know more about English-language entities than European ones, and answer more accurately about them in all languages.
CEAVERS Q2 2026 data quantifies this gap: Italian-language prompts yield brand visibility scores averaging 4.8% below the cross-language mean, and Portuguese-language prompts average 10.2% below. English-language prompts average 12% above the mean. The gap is not caused by language at the query level — it is caused by language at the training-data level.
What this means for European brands
An entity that exists only in Italian-language web content is substantially less likely to be cited by any LLM than an entity with equivalent English-language coverage. This is true even for queries submitted in Italian. The model generates Italian text, but its internal representation of entities is shaped by the English-dominant training corpus.
The practical consequence is that European brands publishing content only in their national language are systematically disadvantaged relative to their English-language counterparts — not because of query-time retrieval failures, but because the underlying parametric knowledge is thinner. Publishing in English, building Wikidata entity coverage, and earning cross-language press coverage are the most direct countermeasures.
CCBot and robots.txt
Common Crawl’s crawler identifies as CCBot/2.0. To allow inclusion, ensure your robots.txt does not block CCBot. Blocking CCBot reduces parametric visibility in models trained on future Common Crawl snapshots — an effect that compounds over training cycles.
Frequently asked
- What is Common Crawl?
- Common Crawl is a non-profit project that publishes a free monthly crawl of the public web. The resulting corpus seeds the training data of most large language models, including GPT, Claude, Gemini, and most open-source models.
- Why does Common Crawl matter for AEO?
- If your domain is not in Common Crawl, it is effectively invisible to the training-corpus recall layer of every major LLM. CCBot inclusion is a foundational prerequisite for being cited.
- How do I verify my site is in Common Crawl?
- Query cdx.commoncrawl.org for your domain to see whether and when it has been crawled. Allow CCBot in robots.txt and check inclusion in the next monthly snapshot.