Home > News > Techscience

AI Giants Accused of Secretly "Appropriating" Data

ZhangJiaXin Thu, Apr 18 2024 11:06 AM EST

The rapid advancement of artificial intelligence (AI) relies heavily on training models. However, a scarcity of high-quality data and closed data ecosystems in certain fields seem to hinder AI development.

According to multiple media reports, companies like OpenAI, Google, and Meta are seeking online information to train their latest AI systems. However, they are disregarding established policies, deliberately altering rules, and attempting to circumvent copyright laws.

Shortcutting Data Collection

The Financial Times recently pointed out that tech giants have been "shortcutting" the collection of training data for their AI systems. OpenAI developed a speech recognition tool called Whisper, which transcribes audio files from YouTube videos into plain text documents, thereby creating a source of conversational data to train its next-generation text-based GPT-4 algorithm.

Business Insider reported that YouTube explicitly prohibits applications outside of its ecosystem from using its video content. However, OpenAI's data collection was not accidental.

In fact, OpenAI employees were aware that this approach ventured into legal gray areas. OpenAI's CEO Greg Brockman even personally participated in the collection of videos used. However, OpenAI still considered it justifiable and ultimately obtained over 1 million hours of transcribed videos.

The biggest mystery lies in how OpenAI accessed a sufficient number of YouTube videos to complete this task.

When asked if the company used YouTube videos to train Sora, OpenAI's Chief Technology Officer, Mira Murati, expressed uncertainty. When questioned again about the source of training data, she declined to disclose details.

The New York Times reported that, like OpenAI, Google also transcribed YouTube videos to collect text for its AI models, potentially infringing on the copyrights of video creators. Last year, Google also changed its terms of service. The motive behind this move seems clear: to allow AI to train on data from publicly available documents in Google Docs and other materials such as restaurant reviews uploaded to Google Maps.

Facing a "Data Bottleneck"

For tech companies, vast data "fertilizer" is the core nutrient for generative AI and essential for the development of large models. Only with sufficient data can technology guide real-time generation of text, images, sounds, and videos similar to human creation, enabling system innovation.

However, as AI develops, the scarcity of existing internet information, the shortage of high-quality textual data, and the monopoly of high-quality data by tech giants may lead to a "shortage of nutrients" for AI. Despite having billions of users, Google and Meta generate search queries and social media posts daily, but these data are largely restricted by privacy laws and their own policies, preventing AI from utilizing this content.

These tech companies seem to be in a tight spot. According to the artificial intelligence research firm Epoch, tech companies are expected to exhaust high-quality data on the internet by 2026. These companies are using data at a faster rate than it is being generated.

Meta also faces limitations in the availability of training data. The company plans to take measures such as paying book licensing fees or even directly acquiring a large publishing company. Meta has also made privacy-centered changes, so its use of consumer data is evidently restricted.

In the face of a shortage of human-generated data, many companies are even attempting to feed AI with AI. Companies including Microsoft and OpenAI are feeding results generated by large models, known as "synthetic data," to smaller models. However, some research suggests that synthetic data will ultimately cause AI to "bite the hand that feeds it."

Challenged by Copyright Lawsuits

Last year, The New York Times sued OpenAI and Microsoft, alleging their unauthorized use of copyrighted news articles to train AI chatbots. OpenAI and Microsoft responded that this falls under "fair use," or is permitted by copyright law because they transformed the works for different purposes.

Last year, over 10,000 trade groups, authors, companies, and individuals submitted opinions to the U.S. Copyright Office regarding the use of creative works by AI models.

The rapid rise of generative AI has sparked a global competition for high-quality data. However, in this new field, there are no clear regulations regarding what is legal or ethical.

Business Insider notes that currently, Google, OpenAI, and other tech companies are arguing that using copyrighted content for AI model training is legal, but regulators and courts have not yet made a ruling on this.

American filmmaker, former actor, and writer Justin Bettman told the Copyright Office that AI models have obtained the content of his works without permission or payment. He called it "the biggest theft in America."