Proof News Report: 173,536 YouTube Videos Feed Data Hungry AI at Major Tech Firms—Find Out the Impact

Rahul Somvanshi

Updated on:

Robots watching YouTube on TV screen.

Although they still have much room for improvement, artificial intelligence (AI) chatbots continue to impress us with their ability to maintain fluid conversations, answer questions, analyze data, among many other tasks. For all this to be possible, AI companies need to train the language models that power their applications with vast amounts of data. Currently, this is a topic that raises some controversy as the tech giants are not very forthcoming about the origins of their training data. Now, a Proof News investigation points out that firms like Apple, Anthropic, Nvidia, and Salesforce used YouTube data.

YouTube Subtitles for AI Model Training:

The report states that a nonprofit organization called EleutherAI collected subtitles from 173,536 YouTube videos, extracted from over 48,000 channels. The collected data, which did not include video images but raw text from the videos, often with translations into different languages, was used to create a dataset called “YouTube Subtitles.” This dataset includes material from content creators like MrBeast and Marques Brownlee, as well as data from educational channels such as Khan Academy, MIT, and Harvard. It is part of “Pile,” a training set consisting of 22 datasets that also includes material from the European Parliament, English Wikipedia, and more.

Pile is publicly accessible, and a large number of academics and companies have used it for their AI-related work. Among them are the aforementioned U.S. tech companies, which did not directly take the data from YouTube but relied on the work done by EleutherAI to train some of their AI models.


Similar Posts


YouTube’s Terms of Service and Data Usage Controversies:

Earlier in the second quarter of the year, YouTube CEO Neal Mohan provided an interesting response when asked if he believed OpenAI was training Sora with material from their video platform. Mohan stated that while certain YouTube content, such as the video title, channel name, or creator’s name, is subject to web scraping for search engine visibility, current rules do not permit the downloading of videos or their transcripts. He affirmed that downloading transcripts or video excerpts is a “clear violation” of the platform’s terms of service. This situation leads us to question the role of YouTube’s terms of service in the AI data acquisition process. Proof News’ research determined that identifying the exact source of videos in the dataset was complex, using video IDs from the dataset and consulting YouTube’s publicly accessible tools to obtain detailed metadata such as titles, channels, and categories. While companies like Anthropic and Salesforce have confirmed using training datasets like Pile, they deny any wrongdoing. In contrast, NVIDIA representatives chose not to comment, while Apple, Databricks, and Bloomberg did not respond to comment requests.

This discovery underscores the AI industry’s growing dependence on large amounts of high-quality data to train models that mimic human language. Often, this data comes from a variety of sources, including books, blogs, and, in this case, content from popular video platforms like YouTube, often without explicit knowledge of the original creators. Recently, YouTube stated that it does not want OpenAI to use its videos to train its Sora artificial intelligence model. The use of YouTube subtitles for AI training has sparked controversy due to potential copyright infringement, as the content used often comes from publicly accessible sources, but not always with explicit consent from the original creators. This raises questions about the ethics and legality of using such data.

Leave a comment