3D-R1: Enhancing Reasoning in 3D VLMs for Unified Scene Understanding

1Shanghai University of Engineering Science 2Peking University
*Equal contribution. Project lead. Corresponding author.

TLDR: 3D-R1 is an open-source generalist model that enhances the reasoning of 3D VLMs for unified scene understanding.

This and the following visualizations show the zero-shot results of 3D-R1 in various complex scenes, demonstrating its incredible generalizability and state-of-the-art performance.

🚀 Key Contributions

🤖

Foundation Model

A pioneering 3D-VLM leverages reinforcement learning and dynamic view selection to enhance reasoning capabilities in 3D scene understanding.

Scene-30K Data Engine

A high-quality 30K scene CoT dataset is constructed with a data engine based on Gemini-Pro and existing 3D-VL datasets.

🏆

SOTA Performance

Extensive experiments demonstrate that 3D-R1 achieves an average improvement of 10% across 7 downstream tasks and various 3D scene benchmarks.

Application

3D Scene Dense Captioning (3D-DC)

3D Object Captioning

3D Visual Grounding (3D-VG)

3D Question Answering (3D-QA)

3D Dialogue

3D Reasoning

3D Planning

Method

Features

3D-R1 systematically outperforms traditional 3D-VLMs on scene understanding through high-quality data-driven and reinforcement-driven policy optimization.

1

Diversity: Scene-30K contains diverse scene categories and question types.

2

Multi-task performance: 3D-R1 demonstrates strong performance across various tasks.

3

Generalizability: 3D-R1 exhibits remarkable generalizability with enhanced reasoning capabilities.

Scene-30K

Distribution of Question Types

radar

Multi-task Performance

bar

Generalizability

Architecture

Our 3D-R1 model is designed based on Qwen2.5-VL-7B-Instruct and trained with the high-quality synthetic Scene-30K dataset. It takes text, multi-view images, 3D point clouds, and depth maps as input and formulates comprehensive 3D tasks as autoregressive sequence prediction.


arch

3D-R1's Architecture

CoT Data Engine

The point cloud of a scene is first sent to scene dscription generator to get a description of the scene. Then based on the description, we apply Gemini-Pro to synthetic CoT data.


cot

Scene-30K Data Engine

Reinforcement Learning

The policy model generates N outputs from a point cloud and question. Then perception IoU, semantic CLIP-similarity, and format-adherence rewards are computed, grouped, and combined with a KL term to a frozen reference model to update the policy.


reward

RL Rewards

Multi-Task Generalist

3D-R1 is a generalist model capable of handling various downstream tasks and applications in a zero-shot manner with incredible generalizability, significantly reducing the need for expensive adaptation.


generalist

Various Application