DeepSeek's Janus-Pro: A New Era in Unified Multimodal AI

DeepSeek's Janus-Pro unifies image understanding and generation with two specialized encoders, outperforming models like DALL-E 3. Its efficient training process boosts AI performance and adaptability.

In the early hours of January 28, 2025, DeepSeek unveiled two cutting-edge multimodal frameworks—Janus-Pro and JanusFlow. Among these, Janus-Pro stands out as a truly innovative leap forward.

Janus-Pro is a next-gen framework for unified multimodal understanding and generation, serving as an upgraded version of the original Janus. By decoupling visual encoding, Janus-Pro significantly enhances the model's adaptability and performance across various tasks. It excels in image generation benchmarks, outperforming OpenAI’s "text-to-image" model, DALL-E 3. As with the earlier Janus series, DeepSeek has made Janus-Pro open-source, offering two models with 1.5 billion parameters (JanusPro 1.5B) and 7 billion parameters (JanusPro 7B).

Upon reviewing DeepSeek's technical reports, a key observation stood out: their approach shares striking similarities with MetaMorph, the project led by Yang Likun and Xie Saining. However, DeepSeek’s approach takes this concept even further, making a more thorough exploration of this promising direction.

Two leading players in the open-source model space—DeepSeek and Meta—are now set to revolutionize the paradigm of unified multimodal models. As Yang Likun predicted, the triumph of open-source models is happening before our eyes.

The "Eyes" Revolution: Achieving Unity Through "Specialization"

The concept of a unified multimodal model was first introduced by Google, with Gemini being its flagship product. The core design of such models is to use Transformer architectures to process data from multiple modalities—such as text, images, and audio—enabling unified understanding and generation.

This breakthrough allows a single model to both "understand" and "generate" images—something that was previously handled separately by multiple models. Models like Stable Diffusion and DALL-E require a separate system to interpret text before generating images. These systems demand more storage space and computational resources while lacking the ability to share knowledge between models.

On the other hand, GPT-4V (OpenAI's multimodal large model) can understand images and translate them into text but cannot generate images. If unified multimodal models are so powerful, why does OpenAI still rely on separate models like GPT-4V and DALL-E?

The answer lies in the complexity and inefficiency of training unified models. Initially, DeepSeek adopted a unified Transformer architecture for text-to-image tasks. The theory was elegant: a single model with a multimodal encoder capable of both text interpretation and image generation. However, in practice, they faced significant performance bottlenecks.

DeepSeek solved this issue by introducing two distinct "eyes" for the model: the first "eye" (SigLIP encoder) focuses on understanding images, extracting high-level semantic features and context, much like an experienced art critic. The second "eye" (VQ tokenizer encoder) specializes in creation, converting images into discrete token sequences, akin to an artist focused on the finer details.

While these two "eyes" work independently, they share a single "brain" (Transformer) that enables them to integrate their knowledge through a specialized attention head for image understanding. Unlike Meta’s approach, which fine-tuned an existing language model to "awaken" its visual comprehension, DeepSeek took a more comprehensive approach by employing separate decoders for both image understanding and generation. This truly unified multimodal framework was a departure from the traditional single-encoder model.

Interestingly, the name "Janus" is a nod to the Roman god of duality, symbolizing DeepSeek's innovative approach to achieving unity through specialization.

Training Breakthroughs: DeepSeek’s Efficiency Miracle

For DeepSeek, architectural innovation is not the only key to success; the efficiency of their training process plays a significant role. Through meticulous control over their training stages, DeepSeek was able to significantly reduce costs and boost model performance.

Stage 1: Unlocking Performance Through Fixed Parameters

In traditional multimodal AI training, the first phase is generally seen as a mere warm-up, with the visual encoder trained to extract basic features. This phase typically takes up about 15% of the total training time. However, DeepSeek made a counterintuitive discovery: even with the language model's parameters fixed, training only the adapters could still enable the model to learn complex pixel dependencies. This discovery not only reduced training complexity but also delivered a significant performance boost.

As a result, DeepSeek extended this stage to 25-30% of the total training time, leading to a substantial improvement in the model’s foundational visual understanding.

Stage 2: Moving Beyond ImageNet

The second phase in multimodal AI training typically focuses on modality alignment, where both visual and language models are trained together to align their capabilities. Traditionally, ImageNet has played a central role in this process, with up to 67% of training steps devoted to it. DeepSeek, however, made a bold move: they abandoned ImageNet entirely, recognizing that its data distribution didn’t reflect real-world applications.

Instead, they used actual text-to-image data, which resulted in a 40% reduction in training time, a 35% improvement in image generation quality, and better model adaptability to real-world scenarios.

Stage 3: The Eastern Secret to Training Efficiency

In the final stage, the model undergoes task-specific fine-tuning, which plays a crucial role in optimizing its performance. Through extensive experimentation, DeepSeek found a more effective data ratio for fine-tuning: 5:1:4, instead of the traditional 7:3:10 distribution of multimodal, pure text, and text-to-image data. This new ratio included a 1:1 mix of synthetic and real-world text-to-image data, leading to faster convergence, more stable results, and significantly better aesthetic quality in generated images.

Thanks to these groundbreaking methods, DeepSeek managed to train the Janus-Pro-7B model in just 14 days using 32 nodes and 256 A100 GPUs—a remarkable achievement in AI training efficiency.

Shifting the Paradigm of Unified Models

DeepSeek’s Janus-Pro-7B has proven that "understanding" and "generation" can coexist in a single, unified framework, each reaching its optimal performance. Interestingly, while traditional unified models claim to be inspired by the human brain, they fail to account for the brain's essential anatomical features—namely, functional specialization and integration.

In evolutionary terms, the human brain’s left hemisphere is responsible for language processing and logical analysis, while the right hemisphere focuses on spatial perception, artistic creation, and holistic cognition. This functional division is not simple isolation; rather, it is the integration of these specialized functions through the corpus callosum that enables a unified cognitive experience.

DeepSeek’s Janus-Pro model seems to mirror this architecture: its image-understanding encoder takes on the left-brain’s analytical role, while its image-generation encoder reflects the creative capacity of the right brain. The Transformer acts as the corpus callosum, integrating the two streams of information into a unified whole.

Perhaps, by embracing this model of specialization and integration, DeepSeek is pointing the way toward the next evolution of unified multimodal AI.

Original:https://mp.weixin.qq.com/s/dJ1grS0daUImIghGaLwELg