Extending Text2Video-Zero for Multi-ControlNet

Ben Backen

This research paper presents an extension to the Text2Video-Zero (T2V0) generative model, augmenting the synthesis of video from textual and video inputs. The project focuses on enhancing the functionality and accessibility of T2V0 by integrating Stable Diffusion’s (SD) support for multiple ControlNets, implementing frame-wise masking for selective ControlNet application, and introducing memory optimizations to enable running the model on consumer-grade hardware. The paper also provides a high-level overview of SD, explores experimental features, and offers practical tips for generating videos using these tools. Additionally, we include a demonstration video showcasing T2V0 with Multi-ControlNet. The video highlights the early potential of text-to-video models for storytelling. Ultimately, the study strives to expand the capabilities and accessibility of T2V0, increasing users' control over their generated outputs while upholding the democratic principles of open-source AI.