Big Tech's Data Woes: AI Giants Accused of Harvesting Data from Video Platforms

Thu, Mar 21 2024 08:15 AM EST

Source: Global Times

[Global Times Special Correspondents in the US and Germany Feng Yarong, Zhaodong, Global Times Reporters Wang Dong, Zhen Xiang] "There are many experts who believe that OpenAI is using data from public video-sharing websites to train its large models." American business news website Business Insider reported on the 18th that the data acquisition practices of this leading artificial intelligence (AI) startup are sparking controversy. OpenAI is not the only one; recently, several major US tech companies have faced similar disputes. The legality of data sources for training large AI models and the proper boundaries for corporate use of public data are emerging as issues that countries around the world will need to address as they refine their AI regulations.

OpenAI under Fire

Business Insider's article cites Sora, a popular AI-powered video generation tool developed by OpenAI, as an example. Sora's training relies on vast datasets, which are widely believed to have been scraped from Google's video-sharing site YouTube. In fact, YouTube has a long-standing policy against the use of automated tools to download videos in bulk and prohibits the use of YouTube videos for commercial purposes, employing measures like throttling to combat scraping tools. The article notes that it is unclear what technical means OpenAI has employed to bypass YouTube's safeguards. ?url=http%3A%2F%2Fdingyue.ws.126.net%2F2024%2F0321%2Fb7c81c6bj00sao6fc0038d000sg00j5g.jpg&thumbnail=660x2147483647&quality=80&type=jpg OpenAI 的数据来源

  • 当被问及训练 Sora 使用什么数据时,OpenAI CTO Mira Murati 回答说他们使用了“公开和许可的数据”。
  • 但当被问及其中是否包括 YouTube 视频内容时,她表示“不确定”。


  • 大型人工智能模型分为通用模型和垂直行业模型。
  • OpenAI 正在构建一个通用模型,它会在 YouTube 等公共平台上抓取数据。
  • 图像和视频的版权往往比文本更加明确,更容易引发争议。


  • 众多初创公司都在争先恐后地收集高质量的数据来训练人工智能模型。
  • 据报道,OpenAI 指派了一个“秘密团队”来获取训练数据,并且不会深入调查数据来源。
  • 各大科技公司似乎达成共识,只要他们可以抓取他人的数据,就默认允许其他人也这样做。


  • 这种“共识”可能是人工智能产业需要关注的一个隐患。
  • 生成式人工智能的兴起引发了全球技术竞赛,但对于什么是合法和合乎道德的,目前尚未制定明确的规则。


  • 人们对生成式人工智能的潜在危害有诸多担忧。
  • 《商业内幕》称,上述法律纠纷可能会推动监管措施的改变。


  • 一些科技巨头对用于训练人工智能的大量数据来源保持沉默。
  • 美国国会提出了一项《人工智能基础模型透明法案》,要求披露训练数据来源。


  • 关于获取用于训练人工智能大模型的数据,各国法规有所不同。
  • 有些法规更偏向于信息公开,而另一些则更偏向于信息安全。
  • 共识是,选择的数据不能涉及个人隐私数据。


  • 中国去年出台了关于管理人工智能大模型的法规。
  • 欧洲出台了《人工智能法案》以确保人工智能的使用不会侵犯基本权利。


  • 与中国不同,美国大多数人工智能大模型都是基于内部数据进行训练的。
  • 美国是判例法国家,数据可以从数据平台购买或从公开数据中抓取。