Abstract

Transformer-based 3D reconstruction has emerged as a powerful paradigm for recovering geometry and appearance from multi-view observations, offering strong performance across challenging visual conditions. As these models scale to larger backbones and higher-resolution inputs, improving their efficiency becomes increasingly important for practical deployment. However, modern 3D transformer pipelines face two coupled challenges: dense multi-view attention creates substantial token-mixing overhead, and low-precision execution can destabilize geometry-sensitive representations and degrade depth, pose, and 3D consistency. To address the first challenge, we propose Lite3R, a model-agnostic teacher-student framework that replaces dense attention with Sparse Linear Attention to preserve important geometric interactions while reducing attention cost. To address the second challenge, we introduce a parameter-efficient FP8-aware quantization-aware training (FP8-aware QAT) strategy with partial attention distillation, which freezes the vast majority of pretrained backbone parameters and trains only lightweight linear-branch projection layers, enabling stable low-precision deployment while retaining pretrained geometric priors. We further evaluate Lite3R on two representative backbones, VGGT and DA3-Large, over BlendedMVS and DTU64, showing that it substantially reduces latency (1.7-2.0×) and memory usage (1.9-2.4×) while preserving competitive reconstruction quality overall. These results demonstrate that Lite3R provides an effective algorithm-system co-design approach for practical transformer-based 3D reconstruction.

Reconstructed scene from multiple views
Inference Latency on
BlendedMVS
Original Lite3R
483
274
VGGT
187
95
DA3-Large
Latency (ms)
1.7~2.0×
Faster
Memory Footprint on
BlendedMVS
Original Lite3R
5706
2455
VGGT
2713
1368
DA3-Large
Memory (MB)
1.9~2.4×
Memory
Saved

Visualization

GT
Lite3R (VGGT)
Lite3R (DA3-L)
QuantVGGT
VGGT
DA3-L
GT Views
Ornate Facade input view 1 Ornate Facade input view 2 Ornate Facade input view 3 Ornate Facade input view 4 Ornate Facade input view 5 Ornate Facade input view 6 Ornate Facade input view 7 Ornate Facade input view 8 Ornate Facade input view 9 Ornate Facade input view 10 Ornate Facade input view 11 Ornate Facade input view 12

Method: Lite3R

Overall framework of Lite3R

Overall framework of Lite3R. Starting from a dense pretrained 3D reconstruction teacher, Lite3R constructs a lite student by replacing dense attention with Sparse Linear Attention, freezing the inherited backbone projections, and training only lightweight linear-branch projection layers under FP8-aware quantization-aware training. Partial attention distillation preserves intermediate geometric priors, and the resulting student is converted to an efficient FP8-compatible deployment model.

Results

BlendedMVS: Latency Reduction
Original Lite3R
1.76× faster
1.97× faster
483.3
274.4
VGGT
188.0
95.3
DA3-Large
Latency (ms)
DTU64: Latency Reduction
Original Lite3R
1.75× faster
1.87× faster
482.4
276.0
VGGT
186.4
99.5
DA3-Large
Latency (ms)
Speedup Summary
1.76×
VGGT
(BlendedMVS)
1.97×
DA3
(BlendedMVS)
1.75×
VGGT
(DTU64)
1.87×
DA3
(DTU64)
Speedup
BlendedMVS: Memory Reduction
Original Lite3R
2.32× less
1.98× less
5706
2455
VGGT
2713
1368
DA3-Large
Memory (MB)
DTU64: Memory Reduction
Original Lite3R
2.33× less
1.99× less
5701
2452
VGGT
2709
1364
DA3-Large
Memory (MB)
Memory Saving Summary
2.32×
VGGT
(BlendedMVS)
1.98×
DA3
(BlendedMVS)
2.33×
VGGT
(DTU64)
1.99×
DA3
(DTU64)
Memory Saving
Depth Quality: AbsRel
Original Lite3R
0.0184
0.0271
VGGT
0.0862
0.0889
DA3-Large
AbsRel (↓)
Pose Quality: Rotation
Original Lite3R
1.93
2.23
VGGT
9.48
10.74
DA3-Large
Rotation Error (↓)
Geometry Quality: F5cm
Original Lite3R
0.2005
0.2029
VGGT
0.1149
0.1210
DA3-Large
F-score @ 5cm (↑)

Comprehensive visualization of Lite3R main results across nine experimental settings. The bar charts summarize the key quality and efficiency metrics reported in the main paper, highlighting how Lite3R compares with the corresponding higher-precision baselines across different backbones, datasets, and evaluation dimensions.

Component Ablation

(a) Quality: Lower is Better
AbsRel (↓) Rot./100 (↓)
0.0271
2.23
Full
(SLA+QAT)
0.0243
2.19
SLA
no QAT
0.0238
2.14
no SLA
QAT
0.0184
1.93
Original
Quality Metrics
(b) Latency: Lower is Better
274ms1.76×
Full
(SLA+QAT)
377ms1.28×
SLA
no QAT
400ms1.21×
no SLA
QAT
483ms1.00×
Original
Latency (ms)
(c) Memory: Lower is Better
2455MB2.32×
Full
(SLA+QAT)
4196MB1.36×
SLA
no QAT
3068MB1.86×
no SLA
QAT
5706MB1.00×
Original
Memory (MB)

Visual comparison of the component-ablation results on VGGT over BlendedMVS. The chart shows that removing SLA or FP8-aware QAT recovers only part of the final quality-efficiency tradeoff.

Adaptation Sensitivity

Lite3R adaptation sensitivity analysis

Analysis of Lite3R adaptation sensitivity. Left: layer-wise quantization sensitivity of VGGT, showing that different backbone stages respond unevenly to low-precision perturbations. Right: change pattern of the linear-branch projection layers during training, illustrating how continued FP8-aware QAT increases drift in the small trainable subspace and helps explain the weaker stability of longer schedules.

Citation


@techreport{lite3r,
  title={Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction},
  author={Haoyu Zhang and Zeyu Zhang and Zedong Zhou and Yang Zhao and Hao Tang},
  journal={Tech Report},
  year={2026}
}