When Models Learn to Think Before Painting

Juan Manuel Ortiz de Zarate
Dec 6, 2025
9 min read

The landscape of image generation has been rewritten several times over the past five years, but few milestones feel as consequential as Tencent’s HunyuanImage 3.0 [1]. Not only does it push the frontier of open-source text-to-image systems, it also represents a methodological shift: a native multimodal model that unifies text, image understanding, image generation, and reasoning within a single autoregressive framework. In a world increasingly dominated by proprietary multimodal giants, HunyuanImage 3.0 signals that open research still has the power to lead.

The technical report describes a system trained at enormous scale: an 80-billion-parameter Mixture-of-Experts backbone (with 13B active parameters per token), deeply integrated with a dual-image-encoder design, a native Chain-of-Thought (CoT) workflow, and an ambitious post-training pipeline involving SFT, DPO[6], GRPO variants, and novel reward-distribution alignment. The result rivals state-of-the-art closed-source models—GPT-Image[5], Seedream 4.0 [3] , Nano Banana[4]—while remaining fully open.

This article unpacks the core ideas behind the model: its data pipeline, architecture, training strategy, evaluative benchmarks, and scientific insights. It also situates the model in the broader evolution of multimodal AI.

1. Why HunyuanImage 3.0 Matters

The report presents the model as the culmination of a native multimodal framework rather than a traditional diffusion model[2] bolted onto an LLM. Most recent industry systems still separate “language understanding” from “image generation,” often using different encoders, tokenization schemes, and reasoning modules. HunyuanImage collapses these boundaries: it predicts text tokens autoregressively and image tokens via diffusion, but both exist inside the same transformer sequence.

This has implications beyond image generation:

multimodal CoT becomes possible,
instructions can flow through text and visual cues without switching models,
image editing and interleaving become natural extensions rather than add-ons,
reasoning before generation emerges as a trained behavior.

In the benchmark results presented in the report—particularly GSB and SSAE—HunyuanImage 3.0 performs at or near the top among leading systems, while remaining the only fully open model of comparable scale.

2. Building a Massive, High-Quality Data Pipeline

One of the most striking parts of the report is how much effort went into constructing the dataset. The authors start with over 10 billion images and retain less than 45% after a three-stage filtering process. This is unusually aggressive, reflecting a belief that generative quality depends more on data cleanliness than sheer volume.

2.1 Technical cleansing and deduplication

Low-resolution, corrupted, oversaturated, and exposure-imbalanced images are removed. MD5 hashing handles basic deduplication.

2.2 A sophisticated network of filtering operators

The second stage is more novel. It combines:

Objective detectors: watermarking, logos, collages, borders, and AI-generated content. The paper notes that AIGC contamination is a real threat to model convergence, so they use both model-based AIGC detectors and removal of entire sources known to contain large proportions of synthetic images.
Subject-scoring models: image clarity, aesthetics, and artistic criteria defined by expert curators. Their aesthetic scoring is explicitly broken down into color, light & shadow, and composition, anchoring model preference to human-interpretable dimensions.

2.3 Dataset augmentation for diversity

The surviving dataset is complemented with:

knowledge-augmented samples,
heavy-text images (for OCR robustness),
stylized content,
graphic design material,
a 100+ million multi-image dataset created from clusters and video.

This last point is central because the model is multimodal from the start: clustering yields images exhibiting editing relationships, while video segmentation produces natural temporal variations—ideal for interleaving tasks.

2.4 A sophisticated captioning ecosystem

Captions are not simple descriptions. The pipeline includes:

a bilingual hierarchical schema (short → ultra-long),
structured fields for style, composition, atmosphere,
named-entity extraction,
OCR-based grounding,
a bidirectional verification loop ensuring captions and detected text/entities mutually agree.

For paired images, they introduce an image difference captioner, which explicitly learns how one frame differs from another—useful for image editing prompts.

3. Reasoning for Image Generation

A central innovation of HunyuanImage 3.0 is its ability to reason before generating an image. Instead of mapping a prompt directly to pixels, the model follows a deliberate, multi-step workflow:

Interpret the user prompt, identifying key objects, attributes, and stylistic cues.
Produce an internal Chain-of-Thought (CoT) that rewrites, clarifies, or expands the prompt.
Translate that reasoning into a structured “visual specification” describing composition, lighting, relationships, and scene layout.
Generate the final image using this refined plan.

This resembles how a human artist thinks: first understanding the request, then planning, then illustrating.

Two specialized datasets make this behavior possible:

3.1 Text-to-Text Reasoning Corpus

This corpus consists of real prompts from diverse domains—photography, illustration, UI design, technical diagrams, and more. The model is trained to produce a coherent reasoning trace for each prompt, learning to:

disambiguate vague instructions,
expand short prompts into richer descriptions,
break complex scenes into logical components.

This improves the model’s ability to follow nuanced or multi-attribute instructions.

3.2 Text-to-Text-to-Image (T2TI)

The second dataset pairs each image with:

a short caption,
an expanded caption,
and a full reasoning trace.

These examples teach the model how a conceptual explanation maps onto an actual visual outcome. By repeatedly seeing “concept → reasoning → image,” the model internalizes a structured generation pipeline.

Together, T2T and T2TI encourage HunyuanImage 3.0 to form an internal plan before synthesizing pixels, resulting in better compositional accuracy, more faithful prompt alignment, and fewer logical inconsistencies in the final images.

4. Architecture: A Multimodal Transformer With Dual Encoders

The next figure shows the full architecture: a Transformer built on Hunyuan-A13B, an MoE language model with 80B total parameters but only 13B active per token.

4.1 Mixture-of-Experts Backbone

The MoE design includes 64 experts, with 8 activated per token, plus a shared MLP. This reduces compute while allowing specialization—an effect confirmed later in the expert-activation analysis.

4.2 Dual Image Encoders

A key architectural novelty is using:

a VAE for latent pixel representation (downsampling factor 16),
a vision encoder (ViT) for semantic image understanding.

Unlike most prior unified models, HunyuanImage concatenates both into a single multimodal sequence.

This dual-encoder fusion enables:

image understanding tasks (classification, Q&A, captioning),
text-to-image generation,
multimodal reasoning,
image editing by conditioning on both VAE and semantic features.

4.3 Two projection modules

The VAE uses a timestep-modulated residual block for diffusion conditioning. The ViT uses an MLP projector. This separation allows the model to treat generative and semantic features differently while unifying them in sequence.

4.4 Generalized Causal Attention

A conceptually elegant piece of the model, described in the next figure. The idea:

Text tokens obey standard causal attention.
Image tokens may attend to all previous multimodal tokens and all tokens within the same image segment—effectively a structured blend of autoregressive and full spatial attention.

This hybrid attention makes it possible to:

generate images autoregressively (via diffusion),
but still capture global spatial relationships.

In multi-image training sequences, the mask includes “holes” to prevent later tokens from attending to earlier generated images—preserving causal correctness during training.

4.5 Generalized 2D Rotary Position Embedding

The paper adopts Su’s generalized 2D RoPE, mapping 1D embeddings to a 2D coordinate system for image tokens. This preserves backward compatibility: text remains 1D RoPE, images become 2D RoPE, but the model interprets both seamlessly.

4.6 Automatic Image Resolution Selection

The authors extend the tokenizer with shape tokens for:

image resolution (<img_size_256>, <img_size_768>, …),
aspect ratio (<img_ratio_0> … <img_ratio_32>).

During inference, the model predicts its own target image dimensions—unless the user specifies a ratio (e.g., “vertical 3:4”). This is particularly powerful for chat-like interfaces.

5. A Four-Stage Training Strategy

The training pipeline behind HunyuanImage 3.0 follows a carefully staged progression, where each phase targets a different aspect of multimodal competence. Rather than training all components simultaneously from the start, the authors structure the process so that the model gradually acquires linguistic alignment, visual understanding, high-resolution generative ability, and finally reasoning. This curriculum-like design ensures stability and makes efficient use of the billions of training samples.

Stage I: Low-Resolution Multitask Pretraining

The first stage establishes the foundation of the model’s multimodal capabilities. The VAE operates at 256px resolution, which keeps training efficient while exposing the model to vast amounts of image–text data. Three tasks run in parallel:

Text-to-Image generation (T2I)
Language modeling (LM)
Multimodal understanding (MMU)

During this phase, the Transformer backbone is actively trained, while the ViT encoder remains frozen. This allows the Transformer to learn how to interpret captions, align them with latent image features, and build a coherent multimodal representation space. Because the dataset is at billion scale during this stage, the model develops strong general alignment between textual semantics and visual structure.

Stage II: Visual Understanding Refinement

Once the Transformer has learned stable multimodal representations, training shifts focus to the visual encoder. In this stage:

The Transformer is frozen to preserve the generative foundations learned in Stage I.
The ViT encoder and its projector are fine-tuned exclusively on MMU tasks.

The purpose is to strengthen the model’s visual understanding—object recognition, scene reasoning, spatial relations—without destabilizing the generative behavior the Transformer has already learned. This isolated refinement step gives the model a cleaner, more reliable visual backbone for downstream multimodal tasks.

Stage III: High-Resolution Multimodal Training

With both components now aligned, Stage III reunites them. The VAE resolution increases to 512px, giving the model access to richer visual detail. At this point:

Both the Transformer and ViT train jointly,
The dataset shifts to higher-quality image subsets,
And new training regimes are introduced: – interleaved text–image sequences (INTL), – image editing, – image-to-image generation.

These additional tasks transform the model from a simple caption-conditioned generator into a flexible multimodal system capable of understanding and producing complex, sequential, or edited imagery. The jump in resolution also trains the diffusion module to handle finer textures, sharper edges, and more intricate compositions.

Stage IV: Ultra-High-Resolution + Reasoning

The final stage is where HunyuanImage becomes truly distinctive. The VAE now operates at 1024px, exposing the model to ultra-high-definition samples. More importantly, this is the phase where reasoning data—including the T2T and T2TI datasets—are introduced.

Here, the model learns to:

integrate Chain-of-Thought reasoning directly into its multimodal workflow,
refine prompts into internal visual plans before generating images,
maintain semantic coherence at larger resolutions.

Both understanding and generation improve as the model receives higher-fidelity visual feedback and learns how reasoning traces map to actual images.

Instruction Tuning

After the four pretraining stages, the final step is to adapt the model for interactive use. The researchers convert:

Text-to-Image tasks,
Language modeling tasks,
Chain-of-Thought reasoning tasks

into instruction-style templates, enabling the model to behave naturally in conversational interfaces. This tuning ensures that the model not only generates images well but also interprets instructions, follows stylistic cues, and responds coherently to user dialogue.

6. Post-Training: SFT → DPO → MixGRPO → SRPO → ReDA

Few technical reports describe post-training in such detail. Here’s a synthesis of the methods:

6.1 Supervised Fine-Tuning (SFT)

High-quality human-annotated images across many categories. Multi-stage SFT progressively increases image quality.

6.2 Direct Preference Optimization (DPO)

Used to correct structural distortions—an issue common in large diffusion models. The team generates a large set of images, labels high- vs low-quality pairs, and trains with DPO to suppress errors such as:

limb distortions,
incorrect geometry,
malformed objects.

6.3 MixGRPO

A hybrid ODE-SDE gradient-based RL optimization technique[7] adapted for flow-based models. It optimizes for:

aesthetics,
composition,
lighting consistency,
text-image alignment.

6.4 SRPO

A novel one-step denoising RL technique: noise is injected into latent space and optimized during the early denoising interval, improving:

realism,
skin texture,
lighting coherence,
oversaturation mitigation.

6.5 Reward Distribution Alignment (ReDA)

A new algorithm aligning generated images with the distribution of high-reward samples. Unlike pointwise rewards, ReDA attempts to minimize divergence between the generator’s output distribution and a curated high-quality dataset.

7. Evaluating Image Quality: SSAE and GSB

7.1 SSAE: Structured Semantic Alignment Evaluation

A new metric designed to overcome limitations in CLIP-based evaluation.

SSAE uses:

500 prompts,
3,500 key semantic points extracted via LLMs,
Chain-of-Thought reasoning from an MLLM during scoring,
12 semantic fields (nouns, attributes, actions, scene attributes, style, composition, etc.).

The model’s alignment is evaluated with:

Mean Image Accuracy,
Global Accuracy.

As shown in the next figure, HunyuanImage 3.0 matches or exceeds competitors across all fields.

7.2 GSB: Human Comparative Evaluation

A “Good / Same / Bad” study across 1,000 prompts and over 100 human evaluators. The findings:

Beats HunyuanImage 2.1 by 14.10% (significant leap for open-source).
Slightly surpasses Seedream 4.0, Nano Banana, and GPT-Image, with win rates of 1–5%.

This suggests HunyuanImage 3.0 has caught up to the best global models, despite being fully open.

8. Scientific Insight: Expert Specialization in MoE Models

The report includes an analysis (Figure 8, p.11) showing:

Experts become increasingly specialized for text vs. image tokens as depth increases.
KL divergence between image-activated and text-activated expert distributions rises with layer depth.
Early layers share representations; later layers diverge strongly, suggesting natural modality specialization.

This supports the hypothesis that MoE architectures are naturally suited for multimodal integration—letting experts “choose” their preferred modality.

9. Conclusion: A Blueprint for Future Multimodal Models

HunyuanImage 3.0 is more than a text-to-image generator. It is:

a unified multimodal model,
trained with progressive curriculum strategies,
equipped with native reasoning abilities,
optimized through a sophisticated post-training suite,
evaluated with richer semantic metrics,
and fully open-source.

The report emphasizes that only text-to-image functionality is currently released, but image-to-image capabilities are under active development and will follow.

In an ecosystem where closed foundation models increasingly dominate, HunyuanImage 3.0 demonstrates that innovation and openness can coexist. Its architecture—dual encoders, generalized causal attention, multimodal CoT—will likely influence future multimodal research for years to come.

References

[1] Cao, S., Chen, H., Chen, P., Cheng, Y., Cui, Y., Deng, X., ... & Zhong, Z. (2025). Hunyuanimage 3.0 technical report. arXiv preprint arXiv:2509.23951.

[2] Diffusion Models: From Noise to Masterpiece, Transcendent AI

[3] Seedream, T., Chen, Y., Gao, Y., Gong, L., Guo, M., Guo, Q., ... & Zhu, W. (2025). Seedream 4.0: Toward next-generation multimodal image generation. arXiv preprint arXiv:2509.20427.

[4] Google. Nano banana, 2025. URL https://developers.googleblog.com/en/introdu

cing-gemini-2-5-flash-image.

[5] OpenAI. Gpt-image, 2025. URL https://platform.openai.com/docs/models/gpt-i

mage-1.

[6] Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., ... & Naik, N. (2024). Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8228-8238).

[7] Li, J., Cui, Y., Huang, T., Ma, Y., Fan, C., Yang, M., & Zhong, Z. (2025). Mixgrpo: Unlocking flow-based grpo efficiency with mixed ode-sde. arXiv preprint arXiv:2507.21802.