Supplementary Materials

Teaser

From real-world complex scenes to AI-generated videos, our method preserves identity fidelity and synthesizes plausible novel views by operating entirely in noise initialization phase.

A corgi sits on a blue beach towel, holding a selfie stick with a GoPro

Cat wearing thick round glasses sits on a crimson velvet armchair

A masked figure in suit stands against the of a modern cityscape

In a dimly lit room, two men are talking, another figure stands behind

Motivations

Motivation experiments are conducted to highlight the identified problems and to demonstrate the effectiveness of the proposed solutions: K-RNR and Stochastic Latent Modulation.

Zero Terminal SNR Collapse

To motivate the zero terminal SNR problem discussed in Section 4.1, we present a motivating example in the context of video inversion. This example highlights the limitations of standard video reconstruction methods and demonstrates the effectiveness of our proposed solution K-RNR.

Original Video

Reference Video

DDIM Inversion

Reverse diffusion process starts with DDIM inverted latent as the initial noise.

Encoded Video + DDIM Inversion

VAE encoded original video is noised with random standard‑normal noise through the formula: √a_t x₀ + √(1 - a_t) · ε

Random Noise + KV Caching

Random standard‑normal noise initialization and Key–Value sharing of DDIM inverted latent along the sequence dimension.

K-RNR[K=3]

VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=3.

K-RNR[K=4]

VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=4.

Original Video

Reference Video

DDIM Inversion

Reverse diffusion process starts with DDIM inverted latent as the initial noise.

Encoded Video + DDIM Inversion

VAE encoded original video is noised with DDIM inverted latent through the formula: √a_t x₀ + √(1 - a_t) · ε

Random Noise + KV Caching

Random standard‑normal noise initialization and Key–Value sharing of DDIM inverted latent along the sequence dimension.

K-RNR[K=2]

VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=2.

K-RNR[K=3]

VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=3.

Approaches to Zero Terminal SNR Collapse Problem

This motivation experiment corresponding to Figure 2 in the paper and supporting section 4.1 demonstrates common workarounds for zero-terminal SNR collapse problem and how they fall short in Dynamic Video Synthesis task.

Original Video

Reference Video

3D Point Cloud Rendering

3D point cloud rendering of the reference video with unseen regions.

Strength: 0.88

Strength parameter is set to 0.88 which shortens the diffusion path and results in the reconstruction of the unseen regions.

Strength: 0.95

Strength parameter is set to 0.95 which improves the propagation to unseen regions but causes identity drift.

VAE + DDIM

VAE encoded original video is noised with DDIM inverted latent through the formula: √a_t x₀ + √(1 - a_t) · ε

K-RNR + SLM

Our K-RNR wih K=6 and Stochastic Latent Modulation preserves identity fidelity and completes newly visible regions with plausibly.

Stochastic Latent Modulation Motivation

This motivation experiment corresponding to Figure 7 in the paper and supporting section 4.4 highlights how VAE-encoded latents and DDIM-inverted latents behave differently under physically implausible, out-of-distribution scenarios, shedding light on their respective capacities for scene representation.

Original Video

Reference Video

3D Point Cloud Rendering

3D point cloud rendering of the reference video with unseen regions.

Processed Mask

Smoothed mask of the 3D point cloud render video.

Filled Render Video

Unseen regions in the processed mask are repeatedly filled with top-left 60x60 pixels of the reference video.

Inv. + Recon. on Filled Render

Both VAE encoding in the reverse diffusion and DDIM inversion is applied to the filled render video.

Inv. on Render + Recon. on Filled Render

VAE encoding in the reverse diffusion is applied to the filled render video, and DDIM inversion is applied to the original render video.

Qualitative Results

Comprehensive showcase of our method's performance across diverse scenarios and challenging cases. All experiments are conducted with CogVideoX-5B I2V architecture.

AI Generated Videos

Experiments demonstrating the identity and motion preservation capabilities of our method in Sora generated synthetic videos.

A corgi sits on a blue beach towel, holding a selfie stick with a GoPro

Cat wearing thick round glasses sits on a crimson velvet armchair

Van Gogh sits at a grand wooden desk

A monkey wearing red cap and a puffy blue vest sits atop a bench

A confident corgi walks along the shoreline

Recapturing Movie Scenes

Experiments investigating the effectiveness of our method in real world scenarios, specifically in the context of movie scenes. Creating a dynamic view in movie scenes requires preserving the identity and complex mouth, hand, and body motions.

A masked figure in suit stands against the of a modern cityscape

Two men are engaged in a serious conversation in front of an ornate building

A police officer in uniform, accompanied by two men in mid-20th century overcoats and hats

Two men in mid-20th century formal attire, including overcoats and fedoras, are on a boat

A cluttered mid-sized corporate office filled with standard office furnishings and supplies

A man in a gray suit and striped tie is seen balancing a large, carved pumpkin on his head

Two hobbits dressed in worn cloaks sit among rugged, rocky terrain under a muted, overcast light

Two officers dressed in elaborate 18th-century naval uniforms stand in an outdoor port setting.

OpenVid-1M Examples

Experiments focusing on the human faces and body motions. OpenVid-1M is a large-scale dataset of humans performing various actions. Creating a dynamic view in OpenVid-1M requires preserving the identity and complex mouth, hand, and body motions.

A woman with light brown hair and a white headscarf is captured mid-song

A middle-aged man with graying hair, dressed in a dark coat and a purple shirt

A man in a beige suit and green shirt is seated at a table, engaging in a serious conversation

A middle-aged man with glasses, dressed in a light pink shirt, is seen standing indoors, possibly in a living room

A man with shoulder-length brown hair, wearing a dark blazer over a purple shirt, is seated in a professional setting

A middle-aged man with a full beard and graying hair is seen speaking into a microphone with a serious expression

A cluttered mid-sized corporate office filled with standard office furnishings and supplies

A man with short, dark hair, wearing a black turtleneck sweater, is seated in the driver's seat of a car

A woman with long, straight blonde hair and bangs is seated in a dark green booth, wearing a black top

A man with dark curly hair and a serious expression, wearing black-rimmed glasses and a dark jacket over a light shirt,

Ablations

Visual demonstrations of ablation studies showing the impact of different components in our method.

Noise Initialization Ablations (x12)

Experiments demonstrating noise‑initialization strategies to observe how effectively they preserve the original video's spatial and temporal details.

Original Video

Reference Video

3D Point Cloud Rendering

3D point cloud rendering of the reference video with unseen regions.

Random Noise

Random standard normal noise is directly used as the initial latent.

Random Noise + KV Caching

Random standard‑normal noise initialization and Key–Value sharing of DDIM inverted latent along the sequence dimension.

Encoded Video + Random Noise

VAE encoded original video is noised with random standard‑normal noise through the formula: √a_t x₀ + √(1 - a_t) · ε

DDIM Inversion

DDIM inverted latent is directly used as the initial latent.

Encoded Video + DDIM Inv.[K=1]

VAE encoded original video is noised with DDIM inversion through the formula: √a_t x₀ + √(1 - a_t) · ε

K-RNR + SLM [K=2]

VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=2.

K-RNR + SLM [K=3]

VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=3.

K-RNR + SLM [K=4]

VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=4.

K-RNR + SLM [K=5]

VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=5.

K-RNR + SLM [K=6]

VAE encoded original video is noised with DDIM inverted latent with our noise initialization using order K=6.

Stochastic Latent Modulation Ablations (x6)

Experiments investigating effectiveness of Stochastic Latent Modulation (SLM) on inpainting missing unseen regions.

Original Video

Reference Video

3D Point Cloud Rendering

3D point cloud rendering of the reference video with unseen regions.

K-RNR without SLM[K=3]

K-RNR with K=3 is applied without SLM in the noise initialization phase, yielding the reconstructions of unseen regions.

K-RNR without SLM[K=4]

K-RNR with K=4 is applied without SLM in the noise initialization phase, yielding the reconstructions of unseen regions.

K-RNR + SLM [K=3]

K-RNR with K=3 is applied with SLM in the noise initialization phase, yielding plausible inpainting of unseen regions.

K-RNR + SLM [K=4]

K-RNR with K=4 is applied with SLM in the noise initialization phase, yielding plausible inpainting of unseen regions.

Adaptive K-RNR Ablations (x6)

Experiments investigating various scale adaptation strategies to DDIM inverted latent and their impact on the generated video.

Original Video

Reference Video

3D Point Cloud Rendering

3D point cloud rendering of the reference video with unseen regions.

Standardization

DDIM inverted latent is standardized to minimum 0 and maximum 1 in the noise initialization phase.

Normalization

DDIM inverted latent is normalized to mean 0 and standard deviation in the noise initialization phase.

K-RNR without Adaptive Normalization [K=20]

K-RNR with K=20 is applied without adaptive normalization in the noise initialization phase, yielding high contrast generation.

K-RNR with Adaptive Normalization [K=20]

K-RNR with K=20 is applied with adaptive normalization in the noise initialization phase, yielding more natural generation.

CogVideoX

Ours [K=6]

CogVideoX

Limitations

As our method is a training-free method, it inherits the limitations and biases of the pre-trained video base model. Here we demonstrate the limitations regarding the dynamic view generation task.

3D Point Cloud Rendering Artifacts

As the camera moves closer to the subject, the imperfection of 3D point cloud rendering become more pronounced and results in body deformations.

A young woman with medium-length dark hair and a concerned expression is seated in the driver's seat of a car

A woman with a concerned expression is initially seen in a close-up, her dark hair pulled back to highlight her features

A middle-aged man with graying hair, dressed in a dark coat and a purple shirt

A woman with long, straight blonde hair and bangs is seated in a dark green booth, wearing a black top