中国AI视频模型Vidu Q2发布，挑战OpenAI Sora和谷歌Veo

AI-generated image

TMTPOST -- ShengShu Technology, one of China’s fastest-growing multimodal generative artificial intelligence startups, has unveiled a new version of its AI video generation model aimed squarely at challenging OpenAI’s Sora 2 and Google’s Veo 3.1, two of the world’s most advanced text-to-video systems.

The Beijing-based firm said on Tuesday that its new release, Vidu Q2, significantly improves consistency, narrative control, and creative flexibility, marking a step forward in the company’s ambition to compete globally in the emerging field of AI-driven video creation.

According to ShengShu, Vidu Q2 allows creators to upload and merge up to seven reference images—covering faces, scenes, or props—into a single coherent video. The model’s new “multi-entity consistency” feature blends these visual elements with text prompts while maintaining the unique characteristics of each reference, reducing the distortions and blending errors that often appear in existing models.

“Vidu Q2 marks a new chapter in AI video creation,” said Luo Yihang, ShengShu’s chief executive officer, during the product announcement. “We’re entering an era where AI doesn’t just create videos but acts, reacts, and tells stories alongside human creators. This launch goes beyond simple generation—it’s about teaching AI to perform and express emotion.”

Luo said the company’s goal is not to replace human creativity but to expand it. “With each release, we bring technology and imagination closer together,” he said. “Our aim is to make creativity more accessible—turning imagination into visible, emotional storytelling.”

The Vidu Q2 model introduces several new features that position it directly against Western rivals. Like Google’s Veo 3.1, Vidu Q2 supports transition animations that allow users to upload only the first and last frames of a scene, letting the model generate the in-between motion. This offers creators enhanced control over narrative flow and pacing—a capability particularly valued in film and advertising production.

The company also released a Vidu Q2 application programming interface (API), allowing enterprises and studios to integrate the model into their workflows for automated or customized content generation.

ShengShu emphasized that its new system delivers comparable visual quality to Sora 2 and Veo 3.1 at a faster speed and lower cost, potentially making high-quality generative video creation more accessible to independent creators and small businesses.

Industry insiders told Yicai Global that the pricing advantage could prove decisive. While U.S.-based models require extensive cloud resources and expensive compute credits, ShengShu’s localized infrastructure and optimized compression algorithms make Vidu Q2 considerably cheaper to operate.

In one scenario, Vidu Q2 was prompted to generate a video depicting a blade battery module moving on a conveyor belt inside a Chinese electric vehicle factory, being scanned by a Siasun yellow industrial robot, with a digital screen showing “99.92” in simplified Chinese characters.

The system successfully fused all visual elements—the battery, robotic arm, Siasun logo, and Chinese text—into a smooth, stable sequence. Observers said the video maintained high visual fidelity, especially in rendering Chinese characters accurately, demonstrating the strength of the multi-entity consistency feature.

In comparison, Google’s Veo 3.1, which supports up to three reference images, failed to reproduce the Chinese text correctly. OpenAI’s Sora 2 handled the text accurately but mistakenly changed the Siasun logo to that of Nissan Motor, showing the difficulty of managing multiple distinct references across frames.

A second test involved a short dialogue scene: a Chinese chairman angrily asking, “The battery caught fire, are you messing with me?” followed by an American CEO replying in English, “Not me, it’s them,” in a Shanghai boardroom setting.

Vidu Q2 generated the scene using reference images for the characters’ expressions. The video demonstrated accurate lip synchronization in both languages and convincing facial animation for anger and frustration. However, the emotional tone of the accompanying audio was relatively flat, lagging behind the natural expressiveness achieved by Veo 3.1.

Despite that, analysts said the results highlight ShengShu’s progress in cross-lingual emotional modeling and multimodal consistency—areas considered technically challenging even for global leaders.

Founded in March 2023 by researchers from Tsinghua University’s Institute for AI Industry Research, ShengShu has quickly risen to prominence in China’s fast-evolving generative AI industry. The startup launched Vidu 1.0 in April 2024 and has since accumulated 30 million users across more than 200 countries and regions, generating over 400 million videos to date.

Vidu’s early versions could produce five- to eight-second clips at 1080p resolution from text or image prompts in either Chinese or English. The Q2 update builds on that base with improved realism, narrative capability, and expanded creative control.

Analysts say the company’s trajectory mirrors China’s broader push to narrow the technological gap with U.S. AI developers. “China’s AI ecosystem is catching up fast,” said an industry expert at a Beijing venture capital firm. “ShengShu’s focus on multimodal integration—especially with localized features like Chinese text and cultural nuances—gives it an edge in domestic and Asian markets.”

Generative video has become one of the most competitive frontiers in AI development. Since OpenAI’s Sora first stunned the industry with its photorealistic videos in early 2024, companies worldwide have raced to build their own models capable of producing complex, cinematic sequences directly from text prompts.

Google’s Veo 3.1 and Anthropic’s experimental systems have set the bar high for quality and consistency, but Chinese startups such as ShengShu, Kuaishou’s Kolors, and Tencent’s Hunyuan Video are rapidly improving.

“The next phase of competition is not just about realism,” said Luo. “It’s about emotional intelligence—how well AI can understand and express human feelings through visual storytelling.”

With Vidu Q2, ShengShu aims to establish itself as a major global player in AI video, blending scientific precision with artistic expression. Luo summed it up: “We want to make imagination visible. This is where technology and emotion finally meet.”

更多精彩内容，关注钛媒体微信号（ID：taimeiti），或者下载钛媒体App

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签