TLDR: DynamicVerse is a physical-scale, multi-modal 4D modeling framework for real-world video, which contains a novel automated data curation pipeline and corresponding large-scale 4D dataset.

We develop DynamicGen, a novel automated data curation pipeline designed to generate physically-aware multi-modal 4D data at scale. This pipeline contains two main stages: (1) metric-scale geometry and moving object recovery from raw videos, and (2) hierarchical detailed semantic captions generation at three granularities (i.e., object, camera and scene). Powered by foundation models (i.e., VFMs, VLMs, LLMs), DynamicGen efficiently generate 4D data at scale, thus addressing the critical scalability, physical reality and modality diversity limitations of traditional 4D data curation.
We introduce DynamicVerse, a large-scale 4D dataset featuring diverse dynamic scenes accompanied by rich multi-modal annotations including metric-scale point maps, camera parameters, object masks with corresponding categories, and detailed descriptive captions. DynamicVerse encompasses 100K+ 4D scenes coupled with 800K+ masklets, sourced through a combination of massive 2D video datasets and existing 4D datasets. This represents a significant improvement in terms of data scale, scene and modality diversity compared to prior 4D datasets.

DynamicGen
Pipeline

DynamicVerse
Dataset

Moving Object
Recovering

Metric-scale
4D Reconstruction

Dynamic Content
Captioning

Click to jump to each section.

Demo Video

We provide a demo video (Raw video➡️Moving Object Recovery➡️Dynamic Point Cloud) to showcase the Metric-scale 4D Reconstruction capability of DynamicGen. The generated fine-grained semantic annotations can be found at subsequent section.

Motivation

🔷Limited Data Diversity and Realism: indoor scenes or autonomous driving / "simulation-to-real" gap.
🔷Lack of Physical Scale & Rich Semantics: no metric-scale geometry & detailed descriptive captions.
🔷Non-Scalability: using multiple sensors is not a scalable process.

DynamicGen Pipeline

The DynamicGen pipeline contains two main stages: (1) metric-scale geometric and moving object recovery (i.e., object category and mask) from raw videos, and (2) hierarchical dynamic contents (i.e., object, camera and scene) detailed caption generation. This pipeline primarily consists of five steps: 4D scene curation, data filter strategy, moving object recovery, dynamic bundle adjustment and dynamic content caption generation.

Stage 1: Moving Object and Metric-scale Geometry Recovery

We provide visual comparison of the moving object segmentation and metric-scale geometry recovery results. We also provide dynamic point cloud reconstruction results on more in-the-wild data.

Stage 2: Dynamic Content Captioning

We provide a comprehensive caption at three specific levels: moving object, dynamic scene, and camera motion.

Click the video to view more semantic annotations

DynamicVerse Dataset

We provide the statistics and data source of DynamicVerse. We also compare DynamicVerse with large-scale 2D video datasets and existing 4D scene datasets. DynamicVerse expands the data scale and annotation richness compared to prior works.

Conclusion

In this work, we addressed the critical limitations in traditional 4D data curation concerning scalability, physical reality, and modality diversity. We introduced DynamicGen, a novel automated pipeline leveraging foundation models for video filtering, metric-scale geometric and moving object recovery, alongside hierarchical detailed semantic captioning from raw videos. We rigorously validated the capabilities of DynamicGen through standard benchmarks for video depth and camera pose/intrinsics estimation, qualitative generalization analysis on diverse web videos, and human/LLM-assisted evaluations confirming the high quality of the generated captions Utilizing DynamicGen, we successfully constructed DynamicVerse, a large-scale 4D dataset offering over 100K dynamic scenes with rich physically-aware multi-modal annotations. Collectively, this work provides both a robust and scalable methodology for 4D data generation and a comprehensive new resource, DynamicVerse, to drive future research in dynamic 4D scene understanding.

BibTeX

        @misc{wen2025dynamicverse,
            title={DynamicVerse: A Physically-Aware Multimodal Framework for 4D World Modeling}, 
            author={Kairun Wen and Yuzhi Huang and Runyu Chen and Hui Zheng and Yunlong Lin and Panwang Pan and Chenxin Li and Wenyan Cong and Jian Zhang and Junbin Lu and Chenguo Lin and Dilin Wang and Zhicheng Yan and Hongyu Xu and Justin Theiss and Yue Huang and Xinghao Ding and Rakesh Ranjan and Zhiwen Fan},
            year={2025},
            eprint={2512.03000},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2512.03000}, 
        }

DynamicVerse A Physically-Aware Multimodal Framework for 4D World Modeling