TF-T2V

Versatile technology for video content creation.

A Recipe for Scaling up Text-to-Video Generation
with Text-free Videos

Xiang Wang1, Shiwei Zhang2, Hangjie Yuan3, Zhiwu Qing1, Biao Gong2, Yingya Zhang2,
Yujun Shen4, Changxin Gao1, Nong Sang1

1Huazhong University of Science and Technology, 2Alibaba Group, 3Zhejiang University, 4Ant Group

Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (e.g., 10M video- text pairs in WebVid10M vs. 5B image-text pairs in LAION), considering the high cost of video captioning. Instead, it could be far easier to collect unlabeled clips from video platforms like YouTube. Motivated by this, we come up with a novel text-to-video generation framework, termed TF-T2V, which can directly learn with text-free videos. The rationale behind is to separate the process of text decoding from that of temporal modeling. To this end, we employ a content branch and a motion branch, which are jointly optimized with weights shared. Following such a pipeline, we study the effect of doubling the scale of training set (i.e., video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9.67 to 8.19 and FVD from 484 to 441), demonstrating the scalability of our approach. We also find that our model could enjoy sustainable performance gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some text labels for training. Finally, we validate the effectiveness and generalizability of our ideology on both native text-to-video generation and compositional video synthesis paradigms. Code and models will be made public here.

Overview: Summary of the Generated Videos
The real source of art is your imagination, and TF-T2V is the solution for bringing it to life.
Ability 1: Text-to-Video Generation (Resolution: 448x256)
You can generate videos flexibly in any style that you can imagine.
Ability 2: Text-to-Video Generation (Resolution: 896x512)
You can generate high-resolution videos flexibly in any style that you can imagine.
Ability 3: Compositional Video Synthesis (Resolution 448x256)
You can generate videos flexibly with diverse structural conditions.
Ability 4: Compositional Video Synthesis (Resolution >= 1280x640)
You can generate high-resolution videos that are consistent with the input structure guidance.
Evaluation on Semi-supervised Setting
In the semi-supervised setting, you can generate videos that follow the motion-related text.
Extension: VideoLCM based on TF-T2V

VideoLCM: Video Latent Consistency Model

Xiang Wang1, Shiwei Zhang2, Han Zhang3, Yu Liu2, Yingya Zhang2, Changxin Gao1, Nong Sang1

1Huazhong University of Science and Technology, 2Alibaba Group, 3Shanghai Jiao Tong University,

Consistency models have demonstrated powerful capability in efficient image generation and allowed synthesis within a few sampling steps, alleviating the high computational cost in diffusion models. However, consistency model in the more challenging and resource-consuming video generation is still less-explored. In this report, we present the VideoLCM framework to fulfill this gap, which leverages the concept of consistency models from image generation to efficiently synthesize videos with minimal steps while maintaining high quality. VideoLCM builds upon existing latent video diffusion models and incorporates distillation techniques for training the latent consistency model. Experimental results reveal the effectiveness of our VideoLCM in terms of computational efficiency, fidelity and temporal consistency. Notably, VideoLCM achieves high-fidelity and smooth video synthesis with only 4 sampling steps, showcasing the potential for real-time synthesis. We hope that VideoLCM can serve as a simple yet effective baseline for subsequent research work.

Extension: Video Demo of VideoLCM
Efficient video generation based on TF-T2V, just four denoising steps for good results.
Extension: Video Examples Generated by VideoLCM with 4 Steps
You can generate videos within merely 4 inference steps.

@article{TFT2V,
 title={A Recipe for Scaling up Text-to-Video Generation with Text-free Videos},
 author={Wang, Xiang and Zhang, Shiwei and Yuan, Hangjie and Qing, Zhiwu and Gong, Biao and Zhang, Yingya and Shen, Yujun and Gao, Changxin and Sang, Nong},
 journal={arXiv preprint arXiv:2312.15770},
 year={2023}
}

@article{wang2023videolcm,
 title={Videolcm: Video latent consistency model},
 author={Wang, Xiang and Zhang, Shiwei and Zhang, Han and Liu, Yu and Zhang, Yingya and Gao, Changxin and Sang, Nong},
 journal={arXiv preprint arXiv:2312.09109},
 year={2023}
}

@article{videocomposer,
 title={VideoComposer: Compositional Video Synthesis with Motion Controllability},
 author={Wang, Xiang and Yuan, Hangjie and Zhang, Shiwei and Chen, Dayou and Wang, Jiuniu and Zhang, Yingya and Shen, Yujun and Zhao, Deli and Zhou, Jingren},
 journal={NeurIPS},
 year={2023}
}