Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers

Xinyu Peng¹*, Han Li¹*, Yuyang Huang¹, Ziyang Zheng¹, Yaoming Wang²,
Xin Chen², Wenrui Dai¹, Chenglin Li¹, Junni Zou¹, Hongkai Xiong¹

¹Shanghai Jiao Tong University, ²Meituan

* Equal contribution

Abstract

Existing video frame interpolation (VFI) methods typically follow a frame-centric paradigm, processing videos as short, independent segments (e.g., triplets). This often leads to temporal inconsistencies and motion artifacts, especially for long sequences with large motion. We propose a holistic, video-centric framework named Local Diffusion Forcing for Video Frame Interpolation (LDF-VFI), which models the entire video sequence to ensure long-range temporal coherence.

Our approach is built upon an auto-regressive diffusion transformer that synthesizes all frames in a temporal chunk jointly, and then connects chunks in an auto-regressive manner. To mitigate error accumulation during auto-regressive inference, we introduce a skip-concatenate sampling strategy that periodically resets and reconnects context, leading to stable generation over long sequences. Furthermore, LDF-VFI combines sparse, local attention with tiled VAE encoding to efficiently support high resolutions (e.g., 4K) without retraining, and employs a conditional VAE decoder that leverages multi-scale features from the input low-frame-rate video to improve reconstruction fidelity.

Empirically, LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per-frame quality and temporal consistency compared with both optical-flow-based and diffusion-based VFI baselines.

Methodology

Holistic modeling for VFI

Traditional VFI methods typically interpolate frames in isolation (e.g., predicting a single intermediate frame from a triplet), often neglecting long-term temporal dependencies. In contrast, LDF-VFI adopts a holistic, video-centric paradigm. By modeling the entire high-frame-rate video sequence conditioned on the complete low-frame-rate input, our approach explicitly captures temporal correlations across all interpolated frames and leveraging more input information.

Model architecture

Our architecture is built upon a 3D Diffusion Transformer (DiT) optimized for video generation. Inputs are encoded into a compact latent space using a spatially tiled and temporally non-overlapping VAE encoder, which ensures constant memory usage and discrete temporal units for auto-regressive processing.

To efficiently handle high-resolution videos (e.g., 4K), we employ a hybrid sparse attention mechanism: chunk-based sliding window attention in the spatial dimension to reduce computational complexity, and full attention in the temporal dimension to capture complex, non-local motion dynamics. This design allows the model to scale to high resolutions while maintaining strong temporal modeling capabilities.

To enhance reconstruction quality, we design a conditional VAE decoder. A dedicated conditional encoder extracts multi-scale spatio-temporal features from the LQ video. These features are injected into the VAE decoder through zero-initialized convolutional layers similar to ControlNet, providing fine-grained guidance to preserve the original details and textures of the input frames.

Auto-regressive inference

To handle videos of arbitrary length, LDF-VFI generates temporal chunks auto-regressively. During training, we employ a chunk-level diffusion-forcing strategy, teaching the model to synthesize the current chunk conditioned on previously generated context.

For inference, we introduce a skip-concatenate sampling scheme to mitigate error accumulation. The model periodically generates a "skip" chunk independent of immediate predictions, followed by a "concatenate" chunk that bridges the timeline using both the skip chunk and earlier context. This strategy effectively bounds long-term error propagation while preserving smooth and coherent motion dynamics.

LDF-VFI maintains temporal consistency over long durations by employing a skip-concatenate sampling strategy, which effectively prevents error accumulation and drifting in auto-regressive inference. We present a generated video by our method over 2000 frames. Top: input video. Down: generated video by LDF-VFI.