Source: Global Times
[Global Times Special Correspondents in the US and Germany Feng Yarong, Zhaodong, Global Times Reporters Wang Dong, Zhen Xiang] "There are many experts who believe that OpenAI is using data from public video-sharing websites to train its large models." American business news website Business Insider reported on the 18th that the data acquisition practices of this leading artificial intelligence (AI) startup are sparking controversy. OpenAI is not the only one; recently, several major US tech companies have faced similar disputes. The legality of data sources for training large AI models and the proper boundaries for corporate use of public data are emerging as issues that countries around the world will need to address as they refine their AI regulations.
OpenAI under Fire
Business Insider's article cites Sora, a popular AI-powered video generation tool developed by OpenAI, as an example. Sora's training relies on vast datasets, which are widely believed to have been scraped from Google's video-sharing site YouTube. In fact, YouTube has a long-standing policy against the use of automated tools to download videos in bulk and prohibits the use of YouTube videos for commercial purposes, employing measures like throttling to combat scraping tools. The article notes that it is unclear what technical means OpenAI has employed to bypass YouTube's safeguards. OpenAI 的数据来源
大模型和版权风险
人工智能数据收集的“共识”
快速发展的技术产生的隐患
监管变化
模糊的数据来源
多国法规的差异
中国和欧洲的监管
美国的判例法