Common Crawl Foundation @CommonCrawl
Common Crawl is a non-profit foundation dedicated to the Open Web. commoncrawl.org San Francisco, CA Joined February 2010-
Tweets1K
-
Followers8K
-
Following2K
-
Likes665
commoncrawl.org/blog/june-2026… commoncrawl.org/blog/host--and… commoncrawl.github.io/cc-crawl-stati… commoncrawl.github.io/cc-webgraph-st…
Hi everyone, Our June 2026 Crawl Archive and corresponding Web Graph are now available. The June 2026 crawl consists of 2.10 billion web pages (or 354 TiB of uncompressed content). Captures are from 40.8 million hosts or 33.6 million registered domains. The corresponding Web Graph release consists of 247.3 million nodes and 6.3 billion edges at the host level, and 121.1 million nodes and 3.9 billion edges at the domain level. Live long and prosper! Luca / @whitenoise
The May 2026 crawl archive (CC-MAIN-2026-21) is now also available on our HF bucket. 🤗 huggingface.co/buckets/common…
Common Crawl Foundation at IIPC-WAC 2026 Common Crawl was well represented with contributions at the 2026 IIPC Web Archiving Conference and General Assembly.
CommonLID Update: New Tools, Growing Impact CommonLID, a community-built language ID benchmark, has a new website and interactive leaderboard. Its paper was accepted to ACL 2026, with a poster session on 7 July. Source code, a PyPI package, and the dataset are now available.
The Columnar Index Is Now the URL Index We have renamed the Columnar Index to the URL Index, to be clearer about its purpose and to pave the way for more datasets in a columnar format. commoncrawl.org/blog/the-colum…
Introducing the AI Visibility Audit A free guide for SEOs and GEOs on how to check whether AI systems can actually reach a site, and how to stay visible in the crawl that trains them.
RSVP and join speakers Laurie Burchell and Pedro Ortiz Suarez from the Common Crawl Foundation and Kostis Saitas Zarkias and Robert Pugh from Mozilla Data Collective for a truly hands-on session. Thursday, June 4th 6 PM CEST | 12 PM ET | 9 AM PDT Register via Zoom: zoom.us/meeting/regist…
Under-represented languages deserve better tools! On June 4th, The Common Crawl Foundation and Mozilla Data Collective will host a webinar to test language identification for the languages you care about.
May 2026 Crawl Archive Now Available We are happy to announce the release of the May 2026 crawl archive, consisting of 2.16 billion web pages, or 365.56 TiB of uncompressed content. 📷
As an early experiment in distributing Common Crawl data through another channel, the April 2026 crawl archive is now available in a Hugging Face Storage Bucket, alongside its existing home on AWS S3.
You can now build directly on Common Crawl from the browser Browsers can now fetch Common Crawl data directly, no backend needed. Build SQL explorers, snapshot viewers and diff tools as static pages. 📷
Have you ever seen a user agent named "CCBOT"? If so, you were visited by @CommonCrawl, a non-profit that crawls the internet and publishes a 10+ petabytes open-source dataset. I think it's beautiful that humanity shares this data. It means that anyone with minimal resources has the access to data required to build their own AI models. It also means we don't have to crawl the entire internet thousands of times for each research, saving large amounts of bandwidth and resources.
Sorry, now with the actual links. commoncrawl.org/blog/april-202… blog.commoncrawl.org/blog/host--and… commoncrawl.github.io/cc-crawl-stati… commoncrawl.github.io/cc-webgraph-st…
📷 April 2026 Crawl Announcement 📷 April 2026 Web Graph Announcement 📷 Crawl Statistics 📷 Web Graph Statistics Live long and prosper!
Our April 2026 Crawl Archive and corresponding Web Graph are now available. The April 2026 crawl consists of 2.19 billion web pages (or 379.2 TiB of uncompressed content). Captures are from 43.2 million hosts or 35.4 million registered domains and include 660.5 million new URLs, not visited in any of our prior crawls.
Jeremy Howard @jeremyphoward
319K Followers 7K Following 🇦🇺 Co-founder: @AnswerDotAI/@FastDotAI ; Prev: Professor@UQ; @kaggle founding president; founder @fastmail/@enlitic/… https://t.co/16UBFTX7mo
Antonio García Mart�... @antoniogm
225K Followers 17K Following Director, @base growth. Founder @spindl_xyz (acq. @coinbase). Wrote bestseller 'Chaos Monkeys'. גם זה יעבור 🇺🇸🇪🇸
Michael Nielsen @michael_nielsen
119K Followers 5K Following Searching for the numinous 🇦🇺 🇨🇦, currently live in 🇺🇸 Research @AsteraInstitute https://t.co/maezekzRUb https://t.co/2dWwZKrvrn
Michael L. Nelson @phonedude_mln
2K Followers 957 Following Professor: @WebSciDL @ODUcs @ODUVMASC @ODUDataScience (2002-now); Engineer: @NASA_Langley (1991-2002); Postdoc: @UNCSILS (2000-2001)
Peter Wang 🦋 @pwang
48K Followers 2K Following Chief AI & Co-founder @AnacondaInc; invented @pyscript_dev, @PyData @Bokeh @Datashader. Former physicist. A student of the human condition. bsky: @wang.social
👩💻 Paige Bai... @DynamicWebPaige
75K Followers 2K Following ✨ AI should be about empowering humans, building understanding, and making dreams realities. 👩💻 DevX Eng. Lead @GoogleDeepMind ex-@GitHub || views = my own!
Chaitanya @ChaitanyaDes45
8 Followers 1K Following Center for Multidisciplinary Education Undergrad @IIT Bombay
SaaS Browser @SaaSBrowser
114 Followers 258 Following Discover great SaaS using the Internet's largest SaaS search engine. What SaaS will you find?
LUNATiHKAL ^_^(,,©,,... @LUNATiHKAL
266 Followers 6K Following ^_^(,,$,,)~* LUNATiHKAL Sound Designer = Martin Flower aka Mort B. Lumen | #Music composer. Traveler. Satirist. Art nerd & science geek. #Psychedelics & #sport.
Roman_Cryptonaut @RomanCryptonaut
9 Followers 1K Following
Mikeumus @mikeumus
2K Followers 2K Following CoFounder of @DivinciAI. Release management and quality assurance platform for custom AIs.
technoprenerd @technoprenerd
29 Followers 686 Following
Divij Chawla @DivijChawla4
23 Followers 246 Following Incoming @AnthropicAI Research Fellow | CS @UW Seattle | SPAR fellow
J Rosser @jrosseruk
1K Followers 2K Following Research Associate with @NeelNanda5 @ Adecco supporting @GoogleDeepMind | ex-@cohere | DPhil @UniofOxford @j_foerst
neşesine @JasonVoorh80815
9 Followers 521 Following
Jackie Singh @HackingButLegal
92K Followers 7K Following Investigative Journalism & Infosec | @KinexisAI | Jobs: @USArmy @GDMS @Mandiant @Intel @JoeBiden @STOPSpyingNY | Seen: BBC, WaPo, Bloomberg, WSJ, NBC, USA Today
Terry Braintree @T_Braintree
35 Followers 224 Following
Bob Kazamakis @kazamakis_
178 Followers 7K Following
Patrick-léon @gkpl0010
81 Followers 1K Following Software Engineering Student, Cybersecurity Enthusiastic
Mometic @mometicmobile
2K Followers 519 Following Doing the data scrubbing and algo making so you can find that 10x trade before everyone else. Try the new MOMO Vector indicator or our Aura AI Advisor today.
Didier DeCock @didierdecock
0 Followers 54 Following
Bruch Thibault @BruchThibault
0 Followers 27 Following
Alejandro Sánchez Po... @Asperjasp
155 Followers 4K Following 🇨🇴 CS @UNALOficial & @simg_UNAL🧑🔬 🏥 Growth @aimedic_ia 🎵🎻Building @JikanleOficial 乐 Connecting music & languages 🇪🇸🇯🇵🇨🇳🇺🇸 Time Counts twice
Nina Mäki-Kihniä @NinaScience
512 Followers 2K Following Translator. Writer. Author. Animal welfare. Scicomm. This is my notebook on some topics.
AE @aert_12
25 Followers 597 Following
Nick Sharkey @sharkey
1K Followers 3K Following Data scientist and related data tomfoolery. Former: strat comms, presidential campaigns, political committees, NBA, airlines. Always: Yooper, Detroit Lions
Gopinath Varadharajan @varadhg65
18 Followers 632 Following
Castani @arteyco
920 Followers 6K Following ⚡ CryptoCulture, GenArt, AI, https://t.co/wNYahEqvq2, https://t.co/O7hbKOV7C8, https://t.co/pqoXZraF05, https:/https://t.co/o8y25YKUNr, DecaGlyph #127:)!
Boris van Buren @borisbegood
2 Followers 31 Following
Yeyito @im_yeyito
25 Followers 295 Following WMs better than DEs, nvim is ok, and threadding is almost never the solution.
stechert @stechert
863 Followers 6K Following Wrote search @Splunk, planner/schedulers/vision/datamining @NASAJPL, robotics @UCLA. Opinions mine.
Daniel Lougen @DJLougen
2K Followers 649 Following PhD @ UofT | Visual Neuroscience | Gestalt Labs | Qwen Dev Ambassador
alex @var_vulf
680 Followers 3K Following PE at Amazon, enjoyer of information retrieval and "the humanities" :)
Daniel van Strien @vanstriendaniel
6K Followers 2K Following Machine Learning Librarian @huggingface 🤗 I like datasets.
Luca Baggi @baggiponte
537 Followers 2K Following 📈 AI Engineer @ https://t.co/Du2lQ9AFgU 🗞 Ho scritto spiegoni @ilpost 🎓 MSc Econ & Stats @LaStatale 🎓 BA Filosofia @UniBergamo & @SorbonneParis1
𝖯𝖾𝗍𝖾𝗋 ... @outlog
428 Followers 2K Following 𝚝𝚑𝚎 𝚙𝚛𝚘𝚍𝚞𝚌𝚝 𝚐𝚞𝚢, 𝚛̷𝚞̷𝚋̷𝚢̷ 𝚎𝚕𝚒𝚡𝚒𝚛 𝚍𝚎𝚟𝚎𝚕𝚘𝚙𝚎𝚛, 𝚍𝚎𝚟𝚒𝚌𝚎/𝚎𝚎 𝚐𝚎𝚎𝚔, 𝚕𝚎𝚐𝚘 𝚋𝚞𝚒𝚕𝚍𝚎𝚛, 𝚍𝚎𝚟𝚘𝚙𝚜, 𝚖𝚞𝚕𝚝𝚒𝚝𝚎𝚗𝚊𝚗𝚝 𝚜𝚑𝚘𝚙𝚜, 𝚕𝚒𝚟𝚎 𝚘𝚛𝚍𝚎𝚛𝚒𝚗𝚐, 𝚙𝚊𝚢𝚖𝚎𝚗𝚝𝚜, 𝚙𝚑𝚒𝚕𝚘𝚜𝚘𝚙𝚑𝚢
Paweł Wrona @Vrona89
329 Followers 3K Following Jeden z licznych badaczy, których nauka straciła na korzyść biznesu.
Ask H @scale_effects
33 Followers 174 Following https://t.co/YkU6O5Dvfc - find .AI sites https://t.co/cUcQXATZxj - collection of web data I'm creating https://t.co/J5nERWzfOK - misc other projects/personal page
Ayomide Ayodele-Soyeb... @AyomideA_S
221 Followers 3K Following ALX Software Engineer Graduate | (ISC)² Certified in Cybersecurity | DataCamp Certified Data Scientist | 2x MongoDB Certified
Елисей Лубк... @redgreenfree
6 Followers 796 Following Ясновидящий, таролог, экстрасенс, парапсихолог, медиум, астролог, экзорцист, гомеопат, эзотерик, маг. Платная чат-консультация - 100 рублей
MoneyLineSolana @MoneyLineSolana
123 Followers 615 Following
François Chollet @fchollet
702K Followers 826 Following Co-founder @ndea. Co-founder @arcprize. Creator of Keras and ARC-AGI. Author of 'Deep Learning with Python'.
Yann LeCun @ylecun
1.2M Followers 787 Following Professor at NYU & Executive Chairman at AMI Labs. Ex-Chief AI Scientist at Meta. Researcher in AI, Machine Learning, Robotics, etc. ACM Turing Award Laureate.
Bojan Tunguz @tunguz
292K Followers 8K Following Founder and CEO @tabul_ai. Creator of @trainxgb. ML ex Nvidia. Data Scientist. Physicist. Catholic. Husband. Father. Stanford Alum. Memelord. e/xgb. AMDG.
Andrew Ng @AndrewYNg
1.6M Followers 1K Following Co-Founder of Coursera; Stanford CS adjunct faculty. Former head of Baidu AI Group/Google Brain. #ai #machinelearning, #deeplearning #MOOCs
Jason Scott @textfiles
52K Followers 633 Following Proprietor of https://t.co/sdyjXHCZF7, historian, filmmaker, archivist, storyteller. Works on/for the Internet Archive. Rank Amateur. Pitiful Man.
Jeff Atwood @codinghorror
263K Followers 2 Following Indoor enthusiast. Co-founder https://t.co/e62S5uByfO / https://t.co/Tuh5wHPHTI. Let’s be kind to each other. I am no longer on twitter. Find me @[email protected]
GitHub @github
2.7M Followers 333 Following The AI-powered developer platform to build, scale, and deliver secure software.
Jeremy Howard @jeremyphoward
319K Followers 7K Following 🇦🇺 Co-founder: @AnswerDotAI/@FastDotAI ; Prev: Professor@UQ; @kaggle founding president; founder @fastmail/@enlitic/… https://t.co/16UBFTX7mo
Percy Liang @percyliang
108K Followers 425 Following professor of computer science @Stanford @stanfordnlp, co-founder of @togethercompute, creator of https://t.co/7R5THVogW2, co-founder of @simile_ai, pianist
Internet Archive @internetarchive
458K Followers 1K Following Internet Archive is a non-profit research library preserving web pages, books, movies & audio for public access. Explore web history via the @waybackmachine.
Michael Nielsen @michael_nielsen
119K Followers 5K Following Searching for the numinous 🇦🇺 🇨🇦, currently live in 🇺🇸 Research @AsteraInstitute https://t.co/maezekzRUb https://t.co/2dWwZKrvrn
Michael L. Nelson @phonedude_mln
2K Followers 957 Following Professor: @WebSciDL @ODUcs @ODUVMASC @ODUDataScience (2002-now); Engineer: @NASA_Langley (1991-2002); Postdoc: @UNCSILS (2000-2001)
Peter Wang 🦋 @pwang
48K Followers 2K Following Chief AI & Co-founder @AnacondaInc; invented @pyscript_dev, @PyData @Bokeh @Datashader. Former physicist. A student of the human condition. bsky: @wang.social
Jimmy Lin @lintool
15K Followers 864 Following I profess CS-ly at @UWaterloo. Previously, I monkeyed code for @Twitter, slides for @Cloudera, and scienced for @yupp_ai.
👩💻 Paige Bai... @DynamicWebPaige
75K Followers 2K Following ✨ AI should be about empowering humans, building understanding, and making dreams realities. 👩💻 DevX Eng. Lead @GoogleDeepMind ex-@GitHub || views = my own!
Mike Hutu @MichaelHutu
112 Followers 228 Following AI Native Dev | Open-source hunter | Focused on building with- and on top of foundation & local models | Keeping up with the AI era |
Tristan Rhodes @tristanbob
4K Followers 8K Following AI engineer, senior vibe coder, open source advocate, network engineer, game dev, trumpeter, pickleball, and dad. https://t.co/kq2xKmtJDl
Eva (Bettina) de Paul... @britneyscripts
1K Followers 522 Following Product Manager | 8+ YOE |MBA Candidate @icmc_usp | UTFPR & ESPM Alumni | Researching Agentic Commerce ✍🏼 https://t.co/DBlZs8mYLs
Andrew Piper @_akpiper
6K Followers 3K Following Using #AI and #NLP to study storytelling at McGillU. Director of .txtlab and author of the new book, Why You Should Read More Fiction.
Judd Burton @burtonbeyond
5K Followers 131 Following Dr. Judd H. Burton is an historian, anthropologist, & archaeologist. His research covers a range of supernatural & conventional topics within the humanities.
Lutfi Zuchri @lutfizuchri
1K Followers 940 Following Nose-picking enthusiast. JRPG aficionado. E-commerce, Digital Strategy, and accidental Data Science guy. ex @TripAdvisor @Teespring
Ariane Bibel 🍵 @ArianeKonzepte
820 Followers 844 Following Kommunikations- & Medienwissenschaftlerin 👋 | Angeh. Konzepterin | Marketing-Strategien für: Technik, Ingenieur, Forschung & KMU | Kuratorin des @medienfeed
👽 @ag_dlo
212 Followers 362 Following
Joe Devon @joedevon
6K Followers 6K Following CoFounder: #GAAD (Global Accessibility Awareness Day) 200M+ social media reach https://t.co/OVSwb5wEuN, an accessibility benchmark @a11yaudits my co. @A11yGenAI my podcast
EleutherAI @AiEleuther
28K Followers 102 Following A non-profit research lab focused on interpretability, alignment, and ethics of AI. Creators of Pythia, VQGAN-CLIP, and using SAEs for interp
Matt Perault @MattPerault
3K Followers 631 Following Head of AI Policy @a16z. Fmr director of @UNC_TechPolicy and Facebook public policy.
Snibby @ItsSnibby
26K Followers 11K Following research → conviction → chaos portfolio powered by pure denial 🦅
Tommaso Green @tommasogreen
193 Followers 1K Following PhD Student 👨🎓 in NLP 📜 at the University of Mannheim.
Joseph Imperial @ ICM... @josephimperial_
2K Followers 6K Following AI safety, legal compliance, and alignment. Technical Governance Fellow @pivotal_org. UKRI PhD Candidate @ARTAIBath 🇬🇧
Ankur Gupta @getpy
37K Followers 3K Following Python Dev, Parent. Author - https://t.co/5lts7q9z7R Curator - https://t.co/wr74oHNs8O Creator - MapToPoster https://t.co/YQt2CoiupJ 🖖
Quentin Lhoest 🤗 @lhoestq
5K Followers 328 Following Datasets @huggingface | Open Source + HF Dataset Hub
Jason Ho @cod1r
48 Followers 780 Following a nerd for computers aspiring game developer/better programmer 418 I'm a teapot
THE CYBER CHUPA² @jmj29540
2K Followers 5K Following It’s all about living the best life I can and helping others along the journey! I AM A TRUE BELIEVER IN @INTRANA, $INT TOKEN, BUY HERE: https://t.co/RthpeV7Llf
OpenClaw🦞 @openclaw
540K Followers 24 Following The AI that does things. Emails, calendar, home automation, from your favorite chat app. Your machine, your rules. New shell, same lobster soul. 🦞
Stephen Burns @Burnszilla
220 Followers 209 Following
Andrea Volpini @cyberandy
6K Followers 3K Following One of the better-known cyberandy. Passionate about Semantic SEO and AI I am co-founder and CEO of WordLift and insideout10.
James F Gibbons @jamesfgibbons
2K Followers 2K Following All about SEO, search strategy, ai, saas, product, gpt, ar, demand gen; plus travel. If currently working in the martech saas space
Jess Joyce AI 👩... @jessjoyce
7K Followers 2K Following Organic Growth (SEO, AI Search, AEO? GEO?) 🇨🇦🌲👩💻🎧 = helping humans w/ organic marketing (she/her) https://t.co/cIIZZ5rVy9 ...It's very weird right now...
Citizen Zoo @citizen_zoo
116 Followers 735 Following
Chris Long @chris_nectiv
13K Followers 475 Following Co-founder at Nectiv. AEO/SEO for $30M+ ARR B2B and Technology companies.
US Tech Force @USTechForce
14K Followers 167 Following The US Tech Force is an elite corps of top engineering talent building the future of American government technology.
D@RWIN @DarwinSantosNYC
3K Followers 7K Following AI | Product | Tech SEO. I find, build, and share solutions. Founder @addtocartai - @aistudiolab. E-commerce x AI | SEO x AI
Charles Floate 📈 @Charles_SEO
95K Followers 650 Following British SEO 🐐 - Over A Decade In Digital Marketing. BIG THINGS COMING SOON 🏗️
Wei Ping @_weiping
4K Followers 398 Following Distinguished Research Scientist @NVIDIA | LLM post-training, reasoning, agent, and multimodality
George Nimeh @iboy
6K Followers 1K Following TED AI Ambassador :: CEO of N&P, The Agency for Changing Times
Nick Vincent @nickmvincent
1K Followers 903 Following Assistant professor @SFU_CompSci. Data-centric AI and HCI; research to support healthy data markets and ecosystems for better AI and less power concentration.
bookofjoe @bookofjoe
7K Followers 85 Following Blog: https://t.co/3VPJphwntk YouTube: https://t.co/Az73H4XbNe Neurosurgical anesthesia: https://t.co/cRLM67pfIX…
Amy James ☀️ @AmyofAlexandria
1K Followers 1K Following Executive Director @web3wg Cofounder @Alexandria & @OpenIndexProto Buidl for internet freedom ✌️
GreenLine Trading @GreenLineAI
64 Followers 336 Following An easier way to develop, test, and deploy algo trading strategies Be among the first to get access: https://t.co/WHDPBd01Ht
David Ifeoluwa Adelan... @davlanade
3K Followers 1K Following Assistant Professor @mcgillu, Core Academic Member @Mila_Quebec, Canada CIFAR AI Chair @CIFAR_News | interested in multilingual NLP | Disciple of Jesus
The Upsider² @TheUpsiderAI
3K Followers 52 Following The AI agent for @Conste11ation Network and @base!
Elizabeth Salesky @esalesk
1K Followers 810 Following Research Scientist @GoogleDeepMind・PhD @jhuclsp・I like bicycles, tokens, and linguistic diversity・https://t.co/x2ZlH1xWty

























