Apple, NVIDIA, Anthropic allegedly used unauthorized YouTube data for AI training: Report


An investigation by Proof News, co-published with Wired, reveals that several tech giants, including Apple, NVIDIA, Anthropic, and others, have employed contentious methods to fuel their AI models, gathering data from books, websites, photos, and social media posts without the creators’ knowledge.

In case you missed it, Last year, Zoom clarified in updated terms that AI training data won’t be used without explicit user consent, addressing concerns over potential privacy invasion.

YouTube Subtitles Dataset

Proof News uncovered that these companies utilized subtitles from 173,536 YouTube videos, sourced from over 48,000 channels, despite YouTube’s policies against such data harvesting.

The YouTube Subtitles dataset comprises transcripts from educational channels such as Khan Academy, MIT, and Harvard, along with content from media outlets like The Wall Street Journal and NPR, and entertainment shows including The Late Show and Last Week Tonight.

The collection, totaling 5.7GB in size, comprises 489 million words and encompasses videos from prominent YouTubers like MrBeast and PewDiePie, and even content promoting conspiracy theories like the flat-earth theory.

Impact on Creators

Creators like David Pakman, whose videos nearly 160 were included in the dataset, expressed frustration over the unauthorized use of their content. Pakman, who produces daily content for his political channel, emphasized the financial and creative investments involved in his work and called for compensation from AI companies using his data.

Critics, including Dave Wiskus of Nebula, argue that using creators’ work without consent is unethical and could potentially harm artists and content creators. Concerns also extend to the dataset’s content, which includes profanity and biases that may influence AI models trained on it.

Company Responses

Big tech companies like Apple and NVIDIA acknowledged using the Pile dataset, which includes YouTube Subtitles, to train AI models. Apple utilized it for their OpenELM model, released shortly before announcing new AI features for iPhones and MacBooks.

Anthropic defended its use of the Pile dataset, stating it included only a small subset of YouTube subtitles and that their use was distinct from direct YouTube platform use, referring queries to the dataset’s authors.

Salesforce also confirmed using the Pile for AI research purposes, releasing an AI model for public use in 2022. They acknowledged the dataset’s inclusion of profanity and biases against certain groups, highlighting potential vulnerabilities and safety concerns.

In their previous interviews, YouTube CEO Neal Mohan and Google CEO Sundar Pichai have both affirmed that using video content, including transcripts, to train AI violates YouTube’s terms of service.

Future Implications

AI companies, striving for data excellence, often guard data sources for model enhancement, raising ethical concerns about using creators’ content without consent and advocating for regulation and fair compensation.

The use of such datasets underscores ongoing debates about data ethics and copyright in AI development. As AI technologies evolve, questions persist about fair compensation for content used and the responsibility of tech giants in safeguarding creators’ rights.

This investigation highlights the complex landscape where technological advancement intersects with ethical and legal considerations, prompting calls for greater transparency and accountability in AI data sourcing and usage.

Source