UniMesh: Unifying 3D Mesh Understanding and Generation

Zeyu Zhang^2*†

¹Boston University

²Peking University

^*Equal contribution. ^†Project lead. ^‡Corresponding author.

Paper Code Data Models

Mesh Generation and Editing Results

Images	meshes	edited images	edited meshes

Framework of UniMesh

Framework of UniMesh — Framework of UniMesh. Given a text prompt or modification instruction, BAGEL with Qwen generates an image latent, which is transformed by the Mesh Head into a conditioning latent for Hunyuan3D to produce a 3D mesh. The reference image latent of the generated mesh can be fed back into BAGEL for iterative refinement via Chain-of-Mesh, while self-reflection enables semantic feedback loops for understanding tasks.

Chain of Mesh

Chain of Mesh — Chain of Mesh. A closed-loop "latent, prompting, and re-generation" cycle.

Self-Reflection

Pipeline of Self-Reflection — Pipeline of Self-Reflection. The pipeline progresses from a 3D object, through rendering, view selection, to model captioning. The Reflexion agent continuously corrects errors through iterative loops, proposes improvements, and eventually provides the final answer.

Mesh Generation and Editing

UniMesh enables semantic-aware 3D mesh generation and editing — UniMesh enables semantic-aware 3D mesh generation and editing. From a single text prompt (top row), UniMesh generates high-fidelity 3D meshes. Leveraging its unified understanding--generation architecture, it further supports iterative semantic edits (bottom row), such as changing object color ("blue motorcycle" to "red motorcycle"), adding attributes ("astronaut" to "astronaut holding the Moon"), or modifying structure ("tracks" to "wheels"), demonstrating the synergy between 3D understanding and generation capabilities within the Chain-of-Mesh mechanism.

Object Captioning

Captions generated by UniMesh — Captions generated by UniMesh. In each box, there are 4 good views of a 3D object and a caption of it generated by UniMesh. UniMesh generates detailed, attribute-rich captions, describing not only object identity but also color combinations, structural elements, etc.