* Hover over the video to see the input prompt.
* Hover over the video to see the input image and prompt.
Using a single reference image, Lumiere can generate videos in the target style by utilizing fine-tuned text-to-image model weights.
* Hover over the video to see the prompt.
We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.
With Lumiere, off-the-shelf text-based image editing methods can be used for consistent video editing.
The Lumiere model is able to animate the content of an image within a specific user-provided region.
(*): Equal first co-author, (†) Core technical contribution
Work was done while O. Bar-Tal, H. Chefer were interns at Google.
We would like to thank Ronny Votel, Orly Liba, Hamid Mohammadi, April Lehman, Bryan Seybold, David Ross, Dan Goldman, Hartwig Adam, Xuhui Jia, Xiuye Gu, Mehek Sharma, Keyu Zhang, Rachel Hornung, Oran Lang, Jess Gallegos, William T. Freeman and David Salesin for their collaboration, helpful discussions, feedback and support.
We thank owners of images and videos used in our experiments (links for attribution) for sharing their valuable assets.
* References:
Mona Lisa, public domain.
Pillars of Creation, public domain.
Raising the Flag on Iwo Jima, public domain.
Mask of Tutankhamun,
CC BY-SA 3.0.
Girl with a Pearl Earring, public domain.
Isaac Newton, public domain.
Starry Night, public domain.
The Great Wave of Kanagava, public domain.
Our primary goal in this work is to enable novice users to generate visual content in an creative and flexible way. However, there is a risk of misuse for creating fake or harmful content with our technology, and we believe that it is crucial to develop and apply tools for detecting biases and malicious use cases in order to ensure a safe and fair use.