Motivation experiments are conducted to highlight the identified problems and to demonstrate the effectiveness of the proposed solutions: K-RNR and Stochastic Latent Modulation.
To motivate the zero terminal SNR problem discussed in Section 4.1, we present a motivating example in the context of video inversion. This example highlights the limitations of standard video reconstruction methods and demonstrates the effectiveness of our proposed solution K-RNR.
Reference Video
Reverse diffusion process starts with DDIM inverted latent as the initial noise.
VAE encoded original video is noised with random standard‑normal noise through the formula: √at x0 + √(1 - at) · ε
Random standard‑normal noise initialization and Key–Value sharing of DDIM inverted latent along the sequence dimension.
VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=3.
VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=4.
Reference Video
Reverse diffusion process starts with DDIM inverted latent as the initial noise.
VAE encoded original video is noised with DDIM inverted latent through the formula: √at x0 + √(1 - at) · ε
Random standard‑normal noise initialization and Key–Value sharing of DDIM inverted latent along the sequence dimension.
VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=2.
VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=3.
This motivation experiment corresponding to Figure 2 in the paper and supporting section 4.1 demonstrates common workarounds for zero-terminal SNR collapse problem and how they fall short in Dynamic Video Synthesis task.
Reference Video
3D point cloud rendering of the reference video with unseen regions.
Strength parameter is set to 0.88 which shortens the diffusion path and results in the reconstruction of the unseen regions.
Strength parameter is set to 0.95 which improves the propagation to unseen regions but causes identity drift.
VAE encoded original video is noised with DDIM inverted latent through the formula: √at x0 + √(1 - at) · ε
Our K-RNR wih K=6 and Stochastic Latent Modulation preserves identity fidelity and completes newly visible regions with plausibly.
This motivation experiment corresponding to Figure 7 in the paper and supporting section 4.4 highlights how VAE-encoded latents and DDIM-inverted latents behave differently under physically implausible, out-of-distribution scenarios, shedding light on their respective capacities for scene representation.
Reference Video
3D point cloud rendering of the reference video with unseen regions.
Smoothed mask of the 3D point cloud render video.
Unseen regions in the processed mask are repeatedly filled with top-left 60x60 pixels of the reference video.
Both VAE encoding in the reverse diffusion and DDIM inversion is applied to the filled render video.
VAE encoding in the reverse diffusion is applied to the filled render video, and DDIM inversion is applied to the original render video.
Comprehensive showcase of our method's performance across diverse scenarios and challenging cases. All experiments are conducted with CogVideoX-5B I2V architecture.
Experiments demonstrating the identity and motion preservation capabilities of our method in Sora generated synthetic videos.
A corgi sits on a blue beach towel, holding a selfie stick with a GoPro
A corgi sits on a blue beach towel, holding a selfie stick with a GoPro
A corgi sits on a blue beach towel, holding a selfie stick with a GoPro
A corgi sits on a blue beach towel, holding a selfie stick with a GoPro
A corgi sits on a blue beach towel, holding a selfie stick with a GoPro
A corgi sits on a blue beach towel, holding a selfie stick with a GoPro
Cat wearing thick round glasses sits on a crimson velvet armchair
Cat wearing thick round glasses sits on a crimson velvet armchair
Cat wearing thick round glasses sits on a crimson velvet armchair
Cat wearing thick round glasses sits on a crimson velvet armchair
Van Gogh sits at a grand wooden desk
Van Gogh sits at a grand wooden desk
A monkey wearing red cap and a puffy blue vest sits atop a bench
A monkey wearing red cap and a puffy blue vest sits atop a bench
A confident corgi walks along the shoreline
A confident corgi walks along the shoreline
Experiments investigating the effectiveness of our method in real world scenarios, specifically in the context of movie scenes. Creating a dynamic view in movie scenes requires preserving the identity and complex mouth, hand, and body motions.
A masked figure in suit stands against the of a modern cityscape
A masked figure in suit stands against the of a modern cityscape
Two men are engaged in a serious conversation in front of an ornate building
Two men are engaged in a serious conversation in front of an ornate building
A police officer in uniform, accompanied by two men in mid-20th century overcoats and hats
Two men in mid-20th century formal attire, including overcoats and fedoras, are on a boat
A cluttered mid-sized corporate office filled with standard office furnishings and supplies
A man in a gray suit and striped tie is seen balancing a large, carved pumpkin on his head
Two hobbits dressed in worn cloaks sit among rugged, rocky terrain under a muted, overcast light
Two officers dressed in elaborate 18th-century naval uniforms stand in an outdoor port setting.
Experiments focusing on the human faces and body motions. OpenVid-1M is a large-scale dataset of humans performing various actions. Creating a dynamic view in OpenVid-1M requires preserving the identity and complex mouth, hand, and body motions.
A woman with light brown hair and a white headscarf is captured mid-song
A middle-aged man with graying hair, dressed in a dark coat and a purple shirt
A man in a beige suit and green shirt is seated at a table, engaging in a serious conversation
A middle-aged man with glasses, dressed in a light pink shirt, is seen standing indoors, possibly in a living room
A man with shoulder-length brown hair, wearing a dark blazer over a purple shirt, is seated in a professional setting
A middle-aged man with a full beard and graying hair is seen speaking into a microphone with a serious expression
A cluttered mid-sized corporate office filled with standard office furnishings and supplies
A man with short, dark hair, wearing a black turtleneck sweater, is seated in the driver's seat of a car
A woman with long, straight blonde hair and bangs is seated in a dark green booth, wearing a black top
A man with dark curly hair and a serious expression, wearing black-rimmed glasses and a dark jacket over a light shirt,
Visual demonstrations of ablation studies showing the impact of different components in our method.
Experiments demonstrating noise‑initialization strategies to observe how effectively they preserve the original video's spatial and temporal details.
Reference Video
3D point cloud rendering of the reference video with unseen regions.
Random standard normal noise is directly used as the initial latent.
Random standard‑normal noise initialization and Key–Value sharing of DDIM inverted latent along the sequence dimension.
VAE encoded original video is noised with random standard‑normal noise through the formula: √at x0 + √(1 - at) · ε
DDIM inverted latent is directly used as the initial latent.
VAE encoded original video is noised with DDIM inversion through the formula: √at x0 + √(1 - at) · ε
VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=2.
VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=3.
VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=4.
VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=5.
VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=6.
Experiments investigating effectiveness of Stochastic Latent Modulation (SLM) on inpainting missing unseen regions.
Reference Video
3D point cloud rendering of the reference video with unseen regions.
K-RNR with K=3 is applied without SLM in the noise initialization phase, yielding the reconstructions of unseen regions.
K-RNR with K=4 is applied without SLM in the noise initialization phase, yielding the reconstructions of unseen regions.
K-RNR with K=3 is applied with SLM in the noise initialization phase, yielding plausible inpainting of unseen regions.
K-RNR with K=4 is applied with SLM in the noise initialization phase, yielding plausible inpainting of unseen regions.
Experiments investigating various scale adaptation strategies to DDIM inverted latent and their impact on the generated video.
Reference Video
3D point cloud rendering of the reference video with unseen regions.
DDIM inverted latent is standardized to minimum 0 and maximum 1 in the noise initialization phase.
DDIM inverted latent is normalized to mean 0 and standard deviation in the noise initialization phase.
K-RNR with K=20 is applied without adaptive normalization in the noise initialization phase, yielding high contrast generation.
K-RNR with K=20 is applied with adaptive normalization in the noise initialization phase, yielding more natural generation.
Side-by-side comparisons of our method with state-of-the-art approaches across diverse scenarios.
Reference Video
Render Video
Stable Video Diffusion
Stable Video Diffusion
CogVideoX
CogVideoX
Wan2.1
CogVideoX
CogVideoX
Reference Video
Render Video
Stable Video Diffusion
Stable Video Diffusion
CogVideoX
CogVideoX
Wan2.1
CogVideoX
CogVideoX
Reference Video
Render Video
Stable Video Diffusion
Stable Video Diffusion
CogVideoX
CogVideoX
Wan2.1
CogVideoX
CogVideoX
Reference Video
Render Video
Stable Video Diffusion
Stable Video Diffusion
CogVideoX
CogVideoX
Wan2.1
CogVideoX
CogVideoX
Reference Video
Render Video
Stable Video Diffusion
Stable Video Diffusion
CogVideoX
CogVideoX
Wan2.1
CogVideoX
CogVideoX