LDF-VFI employs holistic modeling for generative video frame interpolation (VFI), enabling stable handling of large and complex motion scenarios. We present input videos (up) and 16X VFI results by LDF-VFI (down) on SNU-FILM dataset.
Existing video frame interpolation (VFI) methods typically follow a frame-centric paradigm, processing videos as short, independent segments (e.g., triplets). This often leads to temporal inconsistencies and motion artifacts, especially for long sequences with large motion. We propose a holistic, video-centric framework named Local Diffusion Forcing for Video Frame Interpolation (LDF-VFI), which models the entire video sequence to ensure long-range temporal coherence.
Our approach is built upon an auto-regressive diffusion transformer that synthesizes all frames in a temporal chunk jointly, and then connects chunks in an auto-regressive manner. To mitigate error accumulation during auto-regressive inference, we introduce a skip-concatenate sampling strategy that periodically resets and reconnects context, leading to stable generation over long sequences. Furthermore, LDF-VFI combines sparse, local attention with tiled VAE encoding to efficiently support high resolutions (e.g., 4K) without retraining, and employs a conditional VAE decoder that leverages multi-scale features from the input low-frame-rate video to improve reconstruction fidelity.
Empirically, LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per-frame quality and temporal consistency compared with both optical-flow-based and diffusion-based VFI baselines.
Traditional VFI methods typically interpolate frames in isolation (e.g., predicting a single intermediate frame from a triplet), often neglecting long-term temporal dependencies. In contrast, LDF-VFI adopts a holistic, video-centric paradigm. By modeling the entire high-frame-rate video sequence conditioned on the complete low-frame-rate input, our approach explicitly captures temporal correlations across all interpolated frames and leveraging more input information.
Our architecture is built upon a 3D Diffusion Transformer (DiT) optimized for video generation. Inputs are encoded into a compact latent space using a spatially tiled and temporally non-overlapping VAE encoder, which ensures constant memory usage and discrete temporal units for auto-regressive processing.
To efficiently handle high-resolution videos (e.g., 4K), we employ a hybrid sparse attention mechanism: chunk-based sliding window attention in the spatial dimension to reduce computational complexity, and full attention in the temporal dimension to capture complex, non-local motion dynamics. This design allows the model to scale to high resolutions while maintaining strong temporal modeling capabilities.
To enhance reconstruction quality, we design a conditional VAE decoder. A dedicated conditional encoder extracts multi-scale spatio-temporal features from the LQ video. These features are injected into the VAE decoder through zero-initialized convolutional layers similar to ControlNet, providing fine-grained guidance to preserve the original details and textures of the input frames.
To handle videos of arbitrary length, LDF-VFI generates temporal chunks auto-regressively. During training, we employ a chunk-level diffusion-forcing strategy, teaching the model to synthesize the current chunk conditioned on previously generated context.
For inference, we introduce a skip-concatenate sampling scheme to mitigate error accumulation. The model periodically generates a "skip" chunk independent of immediate predictions, followed by a "concatenate" chunk that bridges the timeline using both the skip chunk and earlier context. This strategy effectively bounds long-term error propagation while preserving smooth and coherent motion dynamics.
LDF-VFI maintains temporal consistency over long durations by employing a skip-concatenate sampling strategy, which effectively prevents error accumulation and drifting in auto-regressive inference. We present a generated video by our method over 2000 frames. Top: input video. Down: generated video by LDF-VFI.