UniVid: The Open-Source Unified Video Model

TL;DR: We present UniVid, an open-source unified video model for both understanding and generation tasks. Our model requires only a small amount of high-quality data for fine-tuning, achieveing competitive results across various tasks.

Unified video modeling that combines generation and understanding capabilities is increasingly important but faces two key challenges: maintaining semantic faithfulness during flow-based generation due to text-visual token imbalance and the limitations of uniform cross-modal attention across the flow trajectory, and efficiently extending image-centric MLLMs to video without costly retraining. We present UniVid, a unified architecture that couples an MLLM with a diffusion decoder through a lightweight adapter, enabling both video understanding and generation. We introduce Temperature Modality Alignment to improve prompt adherence and Pyramid Reflection for efficient temporal reasoning via dynamic keyframe selection. Extensive experiments on standard benchmarks demonstrate state-of-the-art performance, achieving a 2.2% improvement on VBench-Long total score compared to EasyAnimateV5.1, and 1.0% and 3.3% accuracy gains on MSVD-QA and ActivityNet-QA, respectively, compared with the best prior 7B baselines.

Text to Video (T2V)

Dog walking on the beach.

A garden with greek plants (levander, dendrolivano, stepa, olive).

A narrow suspension bridge stretches across a vast canyon in the mountains. Mist surrounds the area, pierced by soft rays of the morning sun. Below the bridge, a wild river rushes through the chasm, and on both sides of the cliffs, massive waterfalls cascade down with thunderous force. The roar of water echoes in the background.

A dark, cinematic music video in a gothic style. A storm brews over the night sea, with waves crashing against sharp cliffs and lightning flashing across the sky. On the wet sand, a lone figure dressed in black walks under the cold blue glow of the moon, symbolizing both inner struggle and a faint hope. The visuals should feel moody and powerful, with slow-motion shots of water surging, storm clouds racing, and light breaking through the darkness.

Golden sunrise over snowy mountains, warm golden light illuminating snow-covered peaks, dramatic sky with orange and blue gradient, soft mist drifting across the valley, epic and tranquil cinematic landscape.

The race car speeds down the track, tires screeching.

A dolphin leaps out of the ocean, splashing water as it dives back in.

Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage.

A futuristic drone weaves quickly between skyscrapers, lights glowing in the night sky.

A high-speed train rushes past the station, its motion blurring in the background.

Text/Image to Video (TI2V)

Image

Generated Video

A cinematic video of a young woman with natural makeup and long blonde hair, standing on a sunlit street with blurred trees and cars in the background. The camera slowly moves closer as her hair gently flows with the breeze. She softly smiles and blinks, creating a natural and elegant moment. Warm golden hour lighting, realistic style, high detail, 4K.

Image

Generated Video

A cinematic stormy seashore, dark thunderclouds looming over turbulent waves, crashing surf against jagged rocks, dramatic lighting with gray and blue tones, powerful ocean spray, epic and intense atmosphere.

Image

Generated Video

A hawk soars above the mountains, wings spread wide against the sunset.

Image

Generated Video

Windswept coastal cliffs in rugged landscape, towering rocky cliffs with windswept grass, dramatic ocean waves crashing below, moody gray-blue sky, raw and powerful cinematic scenery.

Video Understanding

Method

Overall architecture of our proposed UniVid for unified video understanding and generation. UniVid couples an autoregressive-based MLLM with a DiT-based diffusion decoder. The MLLM's outputs are linked through a lightweight adapter to interface with the Wan2.2-TI2V-5B backbone, forming the generation branch, while simultaneously passing through the Pyramid Reflection module to connect with the LLM, thereby establishing the understanding branch.

Abstract

Text to Video (T2V)

Text/Image to Video (TI2V)

Video Understanding

Method