TLDR: 3D-R1 is an open-source generalist model that enhances the reasoning of 3D VLMs for unified scene understanding.
This and the following visualizations show the zero-shot results of 3D-R1 in various complex scenes, demonstrating its incredible generalizability and state-of-the-art performance.
A pioneering 3D-VLM leverages reinforcement learning and dynamic view selection to enhance reasoning capabilities in 3D scene understanding.
A high-quality 30K scene CoT dataset is constructed with a data engine based on Gemini-Pro and existing 3D-VL datasets.
Extensive experiments demonstrate that 3D-R1 achieves an average improvement of 10% across 7 downstream tasks and various 3D scene benchmarks.
3D-R1 systematically outperforms traditional 3D-VLMs on scene understanding through high-quality data-driven and reinforcement-driven policy optimization.
Diversity: Scene-30K contains diverse scene categories and question types.
Multi-task performance: 3D-R1 demonstrates strong performance across various tasks.
Generalizability: 3D-R1 exhibits remarkable generalizability with enhanced reasoning capabilities.
Distribution of Question Types
Multi-task Performance
Generalizability
Our 3D-R1 model is designed based on Qwen2.5-VL-7B-Instruct and trained with the high-quality synthetic Scene-30K dataset. It takes text, multi-view images, 3D point clouds, and depth maps as input and formulates comprehensive 3D tasks as autoregressive sequence prediction.
3D-R1's Architecture
The point cloud of a scene is first sent to scene dscription generator to get a description of the scene. Then based on the description, we apply Gemini-Pro to synthetic CoT data.
Scene-30K Data Engine
The policy model generates N outputs from a point cloud and question. Then perception IoU, semantic CLIP-similarity, and format-adherence rewards are computed, grouped, and combined with a KL term to a frozen reference model to update the policy.
RL Rewards
3D-R1 is a generalist model capable of handling various downstream tasks and applications in a zero-shot manner with incredible generalizability, significantly reducing the need for expensive adaptation.
Various Application