ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction

1 Guangdong Provincial Key Laboratory of Ultra High Definition Immersive Media Technology
Shenzhen Graduate School, Peking University
2 Pengcheng Lab, 3 Peking University, 4 MIGU Video Co., Ltd.,
CVPR 2026
Extracted image from PDF

Training Pipeline

Abstract

Dynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length. We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with high temporal coherence and reduced memory overhead. Extensive experiments demonstrate that ClipGStream achieves state-of-the-art reconstruction quality and efficiency.

Long 360 Dataset (Large-Scale Motion, 1400 frames)

Unlike prior approaches, our method is the first to achieve temporal consistency on long-sequence, large-motion datasets while supporting efficient random access.

Ours

LocalDyGS (one Clip)

LocalDyGS (140 Clips)

4DGaussian (One Clip)

Ours (140 Clips)

VRU Dataset GZ scene (Large-Scale Motion)

Our method achieves higher fidelity for both dynamic objects (e.g., athletes) and static regions (e.g., floor textures) in scenes with large motion.

3DGStream

SpaceTimeGS

4DGaussian

Ours

N3DV Dataset (fine-scale motion)

Our method achieves better dynamic modeling of fine-grained motion, such as hand movements of the man and facial expressions of dogs.

4DGaussian

Grid4D

LocalDyGS

Ours

Method

Extracted image from PDF