"After Sora, seeing is not necessarily believing anymore."
The sentiment expressed above is likely shared by many. With just a text description, Sora can generate a one-minute-long, high-definition video that is remarkably lifelike and seamlessly coherent. Due to the realism of the visuals, it has become exceedingly challenging for people to discern whether these videos are crafted by AI solely by relying on their naked eyes. Title: Detecting and Mitigating AI-Generated Video Manipulation
Sora's Video Screenshots: Source - OpenAI Official Website
Not only can AI generate videos, but it can also "tamper" with them. Recently, the research team at Xiaopeng Motors introduced a novel universal video simulation framework called "Any Object in Any Scene," which seamlessly inserts any object into existing dynamic videos. Similarly, it's hard for the naked eye to discern.
Amidst the difficulty in distinguishing truth from falsehood, more and more people are becoming concerned about the potential chaos AI-generated videos might cause. For instance, video evidence may no longer be reliable: "In the future, you might find yourself sitting in the defendant's seat in court, watching a 'crime video' that even you didn't know existed."
Dong Jing, a researcher at the Institute of Automation, Chinese Academy of Sciences, focuses on AI content security and adversarial techniques such as image tampering and deep fakes. Many of her team's findings have been applied to multimedia intelligent authentication. Given the increasing capabilities of AI, what are the methods and means to counter it technologically? How can the general public be more cautious when consuming video content to avoid being deceived? To address these concerns, Dong Jing was interviewed by the Chinese Science Bulletin.
Authentication Still in a Reactive State
"Dong Jing says, "Using magic to defeat magic." Currently, there are mainly two intelligent detection methods to identify whether a video segment is generated by AI.
One is based on data learning. This typically requires collecting forged and authentic videos (preferably paired data) as training datasets to train powerful deep networks. As long as the model can "remember" anomalies or traces in the video frames, such as image noise or discontinuous motion trajectories between frames, it can distinguish between real and fake.
Dong Jing says this method is relatively universal, with simple deployment and good batch detection performance once the detection model parameters are determined. However, this method relies heavily on the volume and completeness of the training data and often fails to detect unknown or untrained data.
The other method is based on specific clues. It first needs to define some visual "clues" in the video that are inconsistent or illogical, such as inconsistent lighting, expected physiological signals in facial videos, or mismatched lip movements and pronunciation timing of speakers. Then, design corresponding algorithms to extract and locate these clues for evidence collection. This method has better interpretability and performs well in targeted detection of video segments, but is less compatible with the diversity of the data itself.
The video "tampered" by the Xiaopeng Motors team can be identified using this method. Dong Jing says their team's preliminary analysis revealed slight changes in color and texture across different frames after "inserting" the target object, "which can serve as clues for training and testing after collecting relevant data."
However, Dong Jing points out that with tools like Sora enhancing AI-generated video details and diversification capabilities, explicit forgery traces in generated videos will become fewer. Relying solely on traditional video analysis and forgery detection methods to distinguish the authenticity of video content will undoubtedly become more challenging.
"The progress of targeted technologies is still relatively preliminary, and it is necessary to strengthen the development and optimization of various detection technologies," Dong Jing told Chinese Science Bulletin. Currently, technologically, the approach still follows conventional detection techniques, so it is necessary to improve the recognition capabilities of models based on constructing a new dataset of forged videos.
At the same time, it is also necessary to timely update existing video detection models for the compatibility with new generation video synthesis algorithms. Additionally, leveraging techniques such as digital watermarking, digital signatures, and video retrieval can strengthen the tracking and management of the lifecycle of generated video data.
"In general, the authentication of video content is still relatively passive, requiring a game of verification against continuously iterated and upgraded video synthesis algorithms," Dong Jing said. Although it's becoming increasingly difficult, AI videos will inevitably produce specific patterns or traces during the generation process, and related detection technologies will continue to utilize these imperceptible clues for countermeasures, analysis, and authentication.
She and her team have proposed new detection algorithms from various perspectives. These algorithms, based on reconstruction errors, multimodal contrastive learning, or forged feature purification, are all continuous attempts to explore "new specific authentication clues."
Promoting the Establishment of Internationally Consensus Standards and Norms
To avoid causing chaos, "source control" and other non-technical solutions are frequently mentioned. For example, some propose that agreements can be made with relevant AIGC technology entities such as OpenAI to embed AI-generated marks at the beginning of video generation.
Dong Jing told Chinese Science Bulletin that embedding markers is currently one of the recommended strategies, but it still faces technological challenges and limitations, such as the reliability, concealment, and universality of markers, while considering factors like privacy and security.
Compared to passive detection of videos, watermarking or marking belongs to active defense. Dong Jing told reporters that her team is currently conducting research on visual generative watermarking—they hope to incorporate a "robust watermark embedding module" into current generative models to make the generated videos carry visible or invisible digital watermarks.
They recently also attempted to introduce "adversarial noise" into real images or videos, preventing the generation model from synthesizing on these source data.
In addition to technical means, Dong Jing also mentioned some non-technical measures.
"People need to improve AI data governance and regulatory laws on AI tool usage, while conducting popular science education, strengthening industry norms, and public awareness of relevant precautions," Dong Jing said. For overseas AI generation service entities like OpenAI, "we call for the establishment of internationally consensual AI data technology standards and norms, forming a coordinated marking and supervision scheme to address generated video rationally."
Dong Jing believes that by regulating the use of new video generation tools like Sora, such as managing and collecting the source datasets on which they are trained, standardizing the output and security testing of generated videos that may contain sensitive or false content, and governing and controlling measures, the risk of abuse of AI-generated videos can be reduced, and "the difficulty of identification will not continue to increase."
Enhancing Immunity to False Videos
Although agreeing that "identifying whether a video is AI-generated should not be left to the public," Dong Jing insists that ordinary people can still "be a little more vigilant when facing video content" to avoid being deceived. Here are a few pointers provided by Dong Jing:
Firstly, examine the logical authenticity of video details, such as whether the actions of individuals in the video, background settings, etc., align with the objective reality, and whether physiological features of individuals (such as teeth, fingers, skin texture, iris color, etc.) are plausible.
She mentioned that currently, it's unknown whether algorithms like Sora can generate high-quality images and videos in large quantities easily. However, from the video clips that have been released, flaws in motion can still be discerned upon careful observation.
Secondly, assess whether the video's quality and clarity are consistent. Generally, AI-generated videos may exhibit flaws in picture quality, clarity, etc., such as image blurring, screen shaking, etc.
Lastly, examine whether the content logic of the video is reasonable, such as whether the content and plot are logical and coherent. If there are doubts, further verification can be done by checking the credibility or consistency of information such as video sources, publishing platforms, comments, format, and production time. Additionally, one can use specialized tools and software designed to detect AI-generated videos for cross-validation.
Dong Jing suggests that in interactive scenarios like video chats, one can actively request the other party to turn their face to the side, move closer or farther from the camera for discernment, as current forgery techniques have relatively poor predictive and generative effects on significant motion changes.
Furthermore, Dong Jing reminds us that in the current complex media and public opinion environment, the general public should actively learn relevant knowledge, understand the mechanisms and loopholes of AI generation appropriately, for occasional needs.
"It functions much like getting regularly vaccinated against the latest flu strains, enhancing immunity against false videos," Dong Jing told reporters. "Although personally, I believe the public shouldn't bear the burden of identifying AI-generated content, it is our duty, both publicly and privately, to elevate the overall digital literacy and awareness of cyber safety, to minimize the spread of false information, economic fraud, misinformation, and to foster social trust."