Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

Abstract

Transformer-based 3D reconstruction has emerged as a powerful paradigm for recovering geometry and appearance from multi-view observations, offering strong performance across challenging visual conditions. As these models scale to larger backbones and higher-resolution inputs, improving their efficiency becomes increasingly important for practical deployment. However, modern 3D transformer pipelines face two coupled challenges: dense multi-view attention creates substantial token-mixing overhead, and low-precision execution can destabilize geometry-sensitive representations and degrade depth, pose, and 3D consistency. To address the first challenge, we propose Lite3R, a model-agnostic teacher-student framework that replaces dense attention with Sparse Linear Attention to preserve important geometric interactions while reducing attention cost. To address the second challenge, we introduce a parameter-efficient FP8-aware quantization-aware training (FP8-aware QAT) strategy with partial attention distillation, which freezes the vast majority of pretrained backbone parameters and trains only lightweight linear-branch projection layers, enabling stable low-precision deployment while retaining pretrained geometric priors. We further evaluate Lite3R on two representative backbones, VGGT and DA3-Large, over BlendedMVS and DTU64, showing that it substantially reduces latency (1.7-2.0×) and memory usage (1.9-2.4×) while preserving competitive reconstruction quality overall. These results demonstrate that Lite3R provides an effective algorithm-system co-design approach for practical transformer-based 3D reconstruction.

Inference Latency on
BlendedMVS

Original Lite3R

483

274

VGGT

187

DA3-Large

Latency (ms)

1.7~2.0×
Faster

Memory Footprint on
BlendedMVS

Original Lite3R

5706

2455

VGGT

2713

1368

DA3-Large

Memory (MB)

1.9~2.4×
Memory
Saved

Method: Lite3R

Overall framework of Lite3R. Starting from a dense pretrained 3D reconstruction teacher, Lite3R constructs a lite student by replacing dense attention with Sparse Linear Attention, freezing the inherited backbone projections, and training only lightweight linear-branch projection layers under FP8-aware quantization-aware training. Partial attention distillation preserves intermediate geometric priors, and the resulting student is converted to an efficient FP8-compatible deployment model.

Results

BlendedMVS: Latency Reduction

Original Lite3R

1.76× faster

1.97× faster

483.3

274.4

VGGT

188.0

95.3

DA3-Large

Latency (ms)

DTU64: Latency Reduction

Original Lite3R

1.75× faster

1.87× faster

482.4

276.0

VGGT

186.4

99.5

DA3-Large

Latency (ms)

Speedup Summary

1.76×

VGGT
(BlendedMVS)

1.97×

DA3
(BlendedMVS)

1.75×

VGGT
(DTU64)

1.87×

DA3
(DTU64)

Speedup

BlendedMVS: Memory Reduction

Original Lite3R

2.32× less

1.98× less

5706

2455

VGGT

2713

1368

DA3-Large

Memory (MB)

DTU64: Memory Reduction

Original Lite3R

2.33× less

1.99× less

5701

2452

VGGT

2709

1364

DA3-Large

Memory (MB)

Memory Saving Summary

2.32×

VGGT
(BlendedMVS)

1.98×

DA3
(BlendedMVS)

2.33×

VGGT
(DTU64)

1.99×

DA3
(DTU64)

Memory Saving

Depth Quality: AbsRel

Original Lite3R

0.0184

0.0271

VGGT

0.0862

0.0889

DA3-Large

AbsRel (↓)

Pose Quality: Rotation

Original Lite3R

1.93

2.23

VGGT

9.48

10.74

DA3-Large

Rotation Error (↓)

Geometry Quality: F5cm

Original Lite3R

0.2005

0.2029

VGGT

0.1149

0.1210

DA3-Large

F-score @ 5cm (↑)

Comprehensive visualization of Lite3R main results across nine experimental settings. The bar charts summarize the key quality and efficiency metrics reported in the main paper, highlighting how Lite3R compares with the corresponding higher-precision baselines across different backbones, datasets, and evaluation dimensions.

Component Ablation

(a) Quality: Lower is Better

AbsRel (↓) Rot./100 (↓)

0.0271

2.23

Full
(SLA+QAT)

0.0243

2.19

SLA
no QAT

0.0238

2.14

no SLA
QAT

0.0184

1.93

Original

Quality Metrics

(b) Latency: Lower is Better

274ms1.76×

Full
(SLA+QAT)

377ms1.28×

SLA
no QAT

400ms1.21×

no SLA
QAT

483ms1.00×

Original

Latency (ms)

2455MB2.32×

Full
(SLA+QAT)

4196MB1.36×

SLA
no QAT

3068MB1.86×

no SLA
QAT

5706MB1.00×

Original

Memory (MB)

Visual comparison of the component-ablation results on VGGT over BlendedMVS. The chart shows that removing SLA or FP8-aware QAT recovers only part of the final quality-efficiency tradeoff.

Adaptation Sensitivity

Analysis of Lite3R adaptation sensitivity. Left: layer-wise quantization sensitivity of VGGT, showing that different backbone stages respond unevenly to low-precision perturbations. Right: change pattern of the linear-branch projection layers during training, illustrating how continued FP8-aware QAT increases drift in the small trainable subspace and helps explain the weaker stability of longer schedules.

Citation


@techreport{lite3r,
  title={Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction},
  author={Haoyu Zhang and Zeyu Zhang and Zedong Zhou and Yang Zhao and Hao Tang},
  journal={Tech Report},
  year={2026}
}

Lite3R

A Model-Agnostic Framework for EfficientFeed-Forward 3D Reconstruction

Abstract

Visualization

Method: Lite3R

Results

Component Ablation

Adaptation Sensitivity

Citation

A Model-Agnostic Framework for Efficient
Feed-Forward 3D Reconstruction