Dynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length. We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with high temporal coherence and reduced memory overhead. Extensive experiments demonstrate that ClipGStream achieves state-of-the-art reconstruction quality and efficiency.
Unlike prior approaches, our method is the first to achieve temporal consistency on long-sequence, large-motion datasets while supporting efficient random access.
Ours
LocalDyGS (one Clip)
LocalDyGS (140 Clips)
4DGaussian (One Clip)
Ours (140 Clips)
Our method achieves higher fidelity for both dynamic objects (e.g., athletes) and static regions (e.g., floor textures) in scenes with large motion.
3DGStream
SpaceTimeGS
4DGaussian
Ours
Our method achieves better dynamic modeling of fine-grained motion, such as hand movements of the man and facial expressions of dogs.
4DGaussian
Grid4D
LocalDyGS
Ours