Apart from the “submergence” of AI localization, the biggest change in the AI sector recently is the technological breakthrough in multimodal video generation, which has evolved from supporting pure text-based video generation to a fully integrated generation technology combining text, images, and audio.
Here are a few examples of technological breakthroughs for everyone to experience:
1) ByteDance open-sources the EX-4D framework: Monocular video instantly transforms into free-viewpoint 4D content, with a user acceptance rate of 70.7%. This means that for an ordinary video, AI can automatically generate viewing effects from any angle, which previously required a professional 3D modeling team to achieve.
2) Baidu “Hui Xiang” platform: generates a 10-second video from one image, claiming to achieve “movie-level” quality. However, whether this is exaggerated by marketing remains to be seen until the Pro version update in August.
3) Google DeepMind Veo: Can achieve 4K video + environmental sound synchronization generation. The key technological highlight is the achievement of “synchronization” capability, as previously it was a splicing of two systems for video and audio. To achieve true semantic-level matching, significant challenges must be overcome, such as in complex scenes, where the synchronization of walking actions in the video and corresponding footstep sounds must be addressed.
4) Douyin ContentV: 8 billion parameters, 2.3 seconds to generate 1080p video, cost 3.67 yuan/5 seconds. To be honest, this cost control is quite good, but currently, considering the generation quality, it still falls short when encountering complex scenes.
Why is it said that these cases have significant value and meaning in terms of breakthroughs in video quality, production costs, and application scenarios?
1. In terms of breakthroughs in technological value, the complexity of generating a multimodal video is often exponential. A single frame image consists of about 10^6 pixels, and a video must ensure temporal coherence (at least 100 frames), along with audio synchronization (10^4 sample points per second), while also considering 3D spatial consistency.
In summary, the technical complexity is not low. Originally, it was a super large model tackling all tasks head-on. It is said that Sora burned tens of thousands of H100s to achieve video generation capabilities. Now, it can be realized through modular decomposition and collaborative work of large models. For example, Byte’s EX-4D actually breaks down complex tasks into: depth estimation module, viewpoint transformation module, temporal interpolation module, rendering optimization module, and so on. Each module specializes in one task and then coordinates through a mechanism.
2. In terms of cost reduction: it actually involves optimizing the reasoning architecture itself, including a layered generation strategy, where a low-resolution skeleton is generated first and then high-resolution imaging content is enhanced; a caching reuse mechanism, which is the reuse of similar scenes; and dynamic resource allocation, which actually adjusts the model depth based on the complexity of the specific content.
With this set of optimizations, we will achieve a result of 3.67 yuan per 5 seconds for Douyin ContentV.
3. In terms of application impact, traditional video production is a capital-intensive game: equipment, venues, actors, post-production; it’s normal for a 30-second advertisement to cost hundreds of thousands. Now, AI compresses this entire process to a prompt plus a few minutes of waiting, and can achieve perspectives and special effects that are difficult to attain in traditional shooting.
This turns the original technical and financial barriers of video production into creativity and aesthetics, which may promote a reshuffling of the entire creator economy.
The question arises, what is the relationship between the changes in the demand side of web2 AI technology and web3 AI?
1. First, the change in the structure of computing power demand. Previously, in AI, the competition was based on scale; whoever had more homogeneous GPU clusters would win. However, the demand for multimodal video generation requires a diverse combination of computing power, which could create a need for distributed idle computing power, as well as various distributed fine-tuning models, algorithms, and inference platforms.
2. Secondly, the demand for data labeling will also strengthen. Generating a professional-grade video requires: precise scene descriptions, reference images, audio styles, camera movement trajectories, lighting conditions, etc., which will become new professional data labeling requirements. Using Web3 incentive methods can encourage photographers, sound engineers, 3D artists, and others to provide professional data elements, enhancing the AI video generation capability with specialized vertical data labeling.
3. Finally, it is worth mentioning that when AI gradually shifts from centralized large-scale resource allocation to modular collaboration, it itself represents a new demand for decentralized platforms. At that time, computing power, data, models, incentives, etc. will jointly form a self-reinforcing flywheel, which will in turn drive the integration of web3AI and web2AI scenarios.
Share