• Home
  • Latest
  • Coins2Day 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia
TechMeta

A new web crawler launched by Meta last month is quietly scraping the internet for AI training data

By
Kali Hays
Kali Hays
Down Arrow Button Icon
By
Kali Hays
Kali Hays
Down Arrow Button Icon
August 20, 2024, 6:59 PM ET
Meta CEO Mark Zuckerberg is betting big on AI.
Meta CEO Mark Zuckerberg is betting big on AI.Jason Henry—Bloomberg/Getty Images

Meta has quietly unleashed a new web crawler to scour the internet and collect data en masse to feed its AI model.

The crawler, named the Meta External Agent, was launched last month, according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes,” all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.

A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.

Meta, the parent company of Facebook, Instagram, and WhatsApp, updated a corporate website for developers with a tab disclosing the existence of the new scraper in late July, according to a version history found using the Internet Archive. Besides updating the page, Meta has not publicly announced the new crawler.

A Meta spokesperson said the company has had a crawler under a different name “for years,” although this crawler—dubbed Facebook External Hit—”has been used for different purposes over time, like sharing link previews.”

“Like other companies, we train our generative AI models on content that is publicly available online,” the spokesman said. “We recently updated our guidance regarding the best way for publishers to exclude their domains from being crawled by Meta’s AI-related crawlers.”    

Scraping web data to train AI models is a controversial practice that has led to numerous lawsuits by artists, writers, and others, who say AI companies used their content and intellectual property without their consent. Some AI companies like OpenAI and Perplexity have struck deals in recent months that pay content providers for access to their data (Coins2Day was among several news providers that announced a revenue-sharing deal with Perplexity in July).

Flying under the radar

While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.

In order for a website to attempt to block a web scraper, it must deploy robots.txt, a line of code added to a codebase, in order to signal to a scraper bot that it should ignore that site’s information. However, typically the specific name of a scraper bot needs to be added as well in order for robots.txt to be respected. That’s difficult to accomplish if the name has not been openly disclosed. An operator of a scraper bot can also simply choose to ignore robots.txt – it is not enforceable or legally binding in any way. 

Such scrapers are used to pull mass amounts of data and written text from the web, to be used as training data for generative AI models, also referred to as large language models or LLMs, and related tools. Meta’s Llama is one of the largest LLMs available, and it powers things like Meta AI, an AI chatbot that now appears on various Meta platforms. While the company did not disclose the training data used for the latest version of the model, Llama 3, its initial version of the model, used large datasets put together by other sources, like Common Crawl.

Earlier this year, Mark Zuckerberg, Meta’s cofounder and longtime CEO, boasted on an earnings call that his company’s social platforms had amassed a data set for AI training that was even “greater than the Common Crawl,” an entity that has scraped roughly 3 billion web pages each month since 2011.

The existence of the new crawler suggests Meta’s vast trove of data may no longer be enough, however, as the company continues to work on updating Llama and expanding Meta AI. LLMs typically need new and quality training data to keep improving in functionality. Meta is on track to spend up to $40 billion this year, mostly on AI infrastructure and related costs.

Are you a Meta employee or someone with insight or a tip to share? Contact Kali Hays securely through Signal at +1-949-280-0267 or at [email protected].

Join us at the Coins2Day Workplace Innovation Summit May 19–20, 2026, in Atlanta. The next era of workplace innovation is here—and the old playbook is being rewritten. At this exclusive, high-energy event, the world’s most innovative leaders will convene to explore how AI, humanity, and strategy converge to redefine, again, the future of work. Register now.
About the Author
By Kali Hays
See full bioRight Arrow Button Icon

Latest in Tech

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Coins2Day Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Coins2Day Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Coins2Day Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Coins2Day Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Coins2Day Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Coins2Day Editors
October 20, 2025

Most Popular

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Coins2Day Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Coins2Day Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Coins2Day Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Coins2Day Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Coins2Day Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Coins2Day Editors
October 20, 2025
Rankings
  • 100 Best Companies
  • Coins2Day 500
  • Global 500
  • Coins2Day 500 Europe
  • Most Powerful Women
  • Future 50
  • World’s Most Admired Companies
  • See All Rankings
Sections
  • Finance
  • Leadership
  • Success
  • Tech
  • Asia
  • Europe
  • Environment
  • Coins2Day Crypto
  • Health
  • Retail
  • Lifestyle
  • Politics
  • Newsletters
  • Magazine
  • Features
  • Commentary
  • Mpw
  • CEO Initiative
  • Conferences
  • Personal Finance
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Coins2Day Brand Studio
  • Coins2Day Analytics
  • Coins2Day Conferences
  • Business Development
About Us
  • About Us
  • Editorial Calendar
  • Press Center
  • Work At Coins2Day
  • Diversity And Inclusion
  • Terms And Conditions
  • Site Map
  • Facebook icon
  • Twitter icon
  • LinkedIn icon
  • Instagram icon
  • Pinterest icon

Latest in Tech

Healthchief executive officer (CEO)
Elon Musk says humans are ‘pre-programmed to die’ and longevity is ‘solvable’, raising huge questions about the future of health
By Marco Quiroz-GutierrezJanuary 14, 2026
3 hours ago
Alphabet and Google CEO Sundar Pichai seated next to Apple CEO Tim Cook at a formal dinner.
AIApple
What Apple’s AI deal with Google means for the two tech giants, and for $500 billion ‘upstart’ OpenAI
By Jeremy Kahn and Beatrice NolanJanuary 13, 2026
11 hours ago
A smartphone displaying the Google Gemini logo.
AIEye on AI
As ‘agentic commerce’ gains ground, companies shouldn’t put too much faith in ‘GEO,’ one industry insider warns
By Jeremy KahnJanuary 13, 2026
17 hours ago
AIChatbots
Being mean to ChatGPT can boost its accuracy, but scientists warn you may regret it
By Marco Quiroz-GutierrezJanuary 13, 2026
19 hours ago
AIGoldman Sachs Group
‘Humans could go the way of horses’: Goldman calculated how bad the AI ‘job apocalypse’ will be—and its analysts were pleasantly surprised
By Jim EdwardsJanuary 13, 2026
19 hours ago
Mark Zuckerberg
Future of WorkMeta
Meta is changing its performance review to reward output over effort, taking a page from Amazon and X
By Jake AngeloJanuary 13, 2026
19 hours ago

Most Popular

placeholder alt text
Newsletters
The oil CEO who stood up to Trump is a follower of the disciplined 'Exxon way' and has a history of blunt statements
By Jordan BlumJanuary 13, 2026
1 day ago
placeholder alt text
Tech
Elon Musk asked people to upload their medical data to X so his AI company could learn to interpret MRIs and CT scans
By Sasha RogelbergJanuary 11, 2026
3 days ago
placeholder alt text
Economy
The longer the Supreme Court delays its tariff decision, the better it is for President Trump
By Jim EdwardsJanuary 13, 2026
1 day ago
placeholder alt text
Success
Despite his $2.6 billion net worth, MrBeast says he’s having to borrow cash and doesn’t even have enough money in his bank account to buy McDonald’s
By Emma BurleighJanuary 13, 2026
20 hours ago
placeholder alt text
Economy
Treasury spent $276 billion in interest on the national debt in the final three months of 2025, says the CBO—up $30 billion from a year prior
By Eleanor PringleJanuary 12, 2026
2 days ago
placeholder alt text
AI
'Godfather of AI' says the technology will create massive unemployment and send profits soaring — 'that is the capitalist system'
By Jason MaJanuary 12, 2026
2 days ago

© 2025 Coins2Day Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Coins2Day Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.