Researchers have revealed that a major AI training model contains hundreds of millions of personal documents. It’s a legal and ethical quagmire, says Tim Green, director of MEF’s Data and Identity programme.
Remember when you posted your resume to an online recruiter? Or scanned your birth certificate for a loan application? Or submitted your passport number to an airline?
Did you think that was private?

For example, this week.
An academic paper has just revealed that millions of images of passports, credit cards, birth certificates, and other  personally identifiable information (PII) have been found in DataComp CommonPool – an open-source training set for image generation scraped from the web.
The story serves as a reminder that the law has not yet risen to the challenge posed by AI scraping. And there’s no shared set of ethical assumptions to guide people either. For now, we’re in a grey area. And it all feels very uncomfortable. “
The researchers looked at just 0.1 percent of the data. But based on this, they reckon the number of scraped PII documents/images is in the hundreds of millions. They checked a sample of the data and confirmed (through LinkedIn and other sites) that it corresponds to real people.
DataComp CommonPool was launched in 2023, and comprises 12.8 billion data samples. It is the follow-up to the LAION-5B data set, which was used to train models including Stable Diffusion and Midjourney. However, the researchers believe that many more models are trained on its data set.
The story serves as a reminder that the law has not yet risen to the challenge posed by AI scraping. And there’s no shared set of ethical assumptions to guide people either. For now, we’re in a grey area. And it all feels very uncomfortable.
This is complex area. There’s lots to digest. So let’s simplify the topic with a list of observations. That’s what an AI agent would do (I promise you AI did not write this article. Really.)
Don’t expect training model curators to blur all PII.
Curators would prefer to redact PII but with hundreds of millions of documents in the set, it’s not practical.
Metadata complicates the problem
Even if you could delete PII, the training sets contain metadata that might be used to infer a person’s identity.
It’s hard for users to do their own checks
Ideally, individuals would search a database for their own info and delete it. Again, this is not realistic. And anyway, deleting from a model only goes so far. What about the services it has trained?
The law is unclear on consent
CommonPool was built on web data scraped from 2014. How could anyone consent (or not consent) to share data with a service that didn’t exist then?
Regional laws complicate the challenge
Europe has GDPR, California has CCPA. Other regions have nothing. It’s the age old question: to what extent do laws apply to models based in other locations? And can the training model firms claim the ‘legitimate interest’ defence?
As with many issues, it will probably take a bunch of lawsuits to establish what is the right and legal thing to do about scraping. In 2023, class action lawsuits were brought against Microsoft, OpenAI and Google for alleged misuse of personal information. Meanwhile, the lawsuits filed by aggrieved copyright owners are also piling up.
And it seems as if these actions are having some effect. On July 1, cloud connectivity firm Cloudflare set a precedent when it announced it will block AI crawlers that access content without permission or compensation.
Things are moving fast. AI is moving into the mainstream. Just days ago, OpenAI confirmed it is close to releasing an AI-powered web browser that will challenge Chrome, Safari and the rest. After 30 years of domination a new model has emerged to challenge web search as the way people find information.
Should this information include people’s scraped home addresses and social security numbers? Obviously not.
Clearly, the industry has a lot of self-examination to do. MEF members are well-placed to play a leading role. We look forward to some very interesting conversations.
Find out more about the themes discussed –Â Join the MEF ID & Data Interest Group.