DynamicVerse

Physically-Aware Multimodal Modeling for Dynamic 4D Worlds

DynamicVerse: Physically-Aware Multimodal Modeling for Dynamic 4D Worlds

Pipeline

Abstract

Understanding the dynamic physical world, characterized by its evolving 3D structure, real-world motion, and semantic content with textual descriptions, is crucial for human-agent interaction and enables embodied agents to perceive and act within real environments with human‑like capabilities. However, existing datasets are often derived from limited simulators or utilize traditional Structure-from-Motion for up-to-scale annotation and offer limited descriptive captioning, which restricts the capacity of foundation models to accurately interpret real-world dynamics from monocular videos, commonly sourced from the internet.
To bridge these gaps, we introduce DynamicVerse, a physical‑scale, multimodal 4D modeling framework for real-world video. We employ large vision, geometric, and multimodal models to interpret metric-scale static geometry, real-world dynamic motion, instance-level masks, and holistic descriptive captions. By integrating window-based Bundle Adjustment with global optimization, our method converts long real-world video sequences into a comprehensive 4D multimodal format. DynamicVerse delivers a large-scale dataset consists of 100K+ videos with 800K+ annotated masks and 10M+ frames from internet videos. Experimental evaluations on three benchmark tasks---video depth estimation, camera pose estimation, and camera intrinsics estimation---validate that our 4D modeling achieves superior performance in capturing physical-scale measurements with greater global accuracy than existing methods..

DynamicVerse: Dataset Overview

Pipeline The overview of our dataset DynamicVerse.

DynamicGen: Pipeline Overview

Pipeline The physically-aware multi-modal 4D data generation pipeline DynamicGen.

DynamicVerse: Data Statistics

Pipeline The statistics and data source of our dataset DynamicVerse.

Key Capabilities

Moving Object Recovery

Isolates moving objects from the static background in video frames by pixel.

Metric-Scale 4D Reconstruction

Reconstruct dynamic point clouds from videos by estimating depths and camera poses.

Dynamic Content Captioning

Generates textual descriptions that explain the actions, events, or changes occurring within dynamic visual content, like videos.

Moving Object Recovery

Pipeline

Metric-Scale 4D Reconstruction

For more visualizations, please refer to anonymous link: More Amazing Results.

Dynamic Content Captioning

图片描述1 图片描述2