Introducing RepFusion, a text-to-image model that repurposes frozen multimodal LLMs as noisy latent encoders for denoising in representation space.
They are often large, frozen, and knowledgeable, but reduced to one-shot text embedding modules.
Representation-space diffusion creates latent spaces that are much closer to what MLLMs already understand.
MLLMs read noisy representations at each step, so the condition changes with the denoising trajectory.
At matched inference FLOPs, a frozen MLLM reading noisy representations plus a small DiT beats baselines that spend the budget on larger newly initialized denoisers.
Click any icon to jump to the corresponding section.
VAE latents are optimized for reconstruction, while RAE representations preserve much richer visual semantics. This gives the generator a denoising space closer to the feature spaces used by MLLMs. RepFusion reuses the MLP projector that aligns visual representations with an LLM, and applies it to encode noisy representations during generation. The frozen MLLM processes the prompt and current noisy representation, then its outputs condition a DiT denoiser.
The architecture stays simple: train the projector and DiT, keep the LLM backbone frozen, and recompute the conditioning signal as the noisy representation evolves.
Because the selected MLLM hidden states are token-aligned with DiT tokens, the MLLM output can modulate corresponding DiT tokens through AdaLN.
The important distinction: recomputation only helps when the encoder sees the evolving noisy representation. The learnable-query baseline reaches 0.55 GenEval; making the queries timestep-dependent to match RepFusion's inference FLOPs reaches 0.54. RepFusion reaches 0.70 because the MLLM observes the current denoising state.
The next two comparisons ask what makes the MLLM prior useful for denoising: a representation space it can interpret, and a capacity allocation that lets that prior participate in the generation process.
RAEs help all methods, but RepFusion benefits most because the noisy representation is now an input the frozen MLLM can interpret.
Once the representation space is compatible, a frozen pretrained conditional encoder can beat spending nearly the whole parameter budget on newly initialized denoising modules.
Both ablation paths point to the same ingredients: representation space, noisy RAE latents input, and a preserved perception-pretrained MLLM backbone.
With only around 30M image-caption pairs, RepFusion achieves strong T2I prompt alignment across GenEval, GenEval++, GenEval2, and DPG-Bench.
| Methods | GenEval ↑ | GenEval++ ↑ | GenEval2 ↑ | DPG-Bench ↑ |
|---|---|---|---|---|
| Prior Work | ||||
| Transfusion | 0.63 | – | – | – |
| MetaQuery-XL | 0.80† | – | – | 82.05 |
| BLIP-3o 8B | 0.84 | 0.307 | – | 81.60 |
| OmniGen2 | 0.80 | 0.325 | – | 83.57 |
| BAGEL | 0.82 | 0.371 | 23.1† | 84.03 |
| Scale-RAE | 0.83 | – | – | 79.70 |
| RepFusion | ||||
| RepFusion w/ RAE Decoder | 0.73 | 0.432 | 30.2 | 82.75 |
| RepFusion w/ Diffusion Decoder | 0.78 | 0.443 | 29.9 | 84.41 |
| RepFusion-SFT w/ RAE Decoder | 0.85 | 0.707 | 35.1 | 84.17 |
| RepFusion-SFT w/ Diffusion Decoder | 0.87 | 0.669 | 34.9 | 85.11 |
Similar to learnable-query methods, RepFusion can leverage the capabilities of a frozen LLM to follow prompts requiring world knowledge and reasoning.
| Model | Cultural | Time | Space | Biology | Physics | Chemistry | Overall |
|---|---|---|---|---|---|---|---|
| MetaQuery-XL | 0.56 | 0.55 | 0.62 | 0.49 | 0.63 | 0.41 | 0.55 |
| BLIP-3o 8B | – | – | – | – | – | – | 0.62 |
| BAGEL | 0.44 | 0.55 | 0.68 | 0.44 | 0.60 | 0.39 | 0.52 |
| RepFusion-SFT w/ RAE Decoder | 0.55 | 0.53 | 0.70 | 0.51 | 0.57 | 0.41 | 0.55 |
| RepFusion-SFT w/ Diff. Decoder | 0.65 | 0.63 | 0.79 | 0.63 | 0.67 | 0.44 | 0.64 |
The gains above depend on the conditional encoder being able to interpret structured visual representations along the denoising trajectory. Replacing a language-only LLM with a perception-pretrained MLLM improves the denoising setup under matched denoiser and token budgets.
Preserve the perception prior. Fine-tuning helps when starting from a language-only LLM, but can degrade performance when the backbone is already multimodally pretrained. For RepFusion, freezing preserves that prior.
RepFusion has two scaling axes: the frozen MLLM that repeatedly reads evolving noisy representations, and the DiT denoiser that predicts the velocity.
The iso-FLOPs comparison answers a within-family question: among RepFusion variants, allocating more compute to the DiT is generally more favorable. The comparison with TextEmbed answers a different question. TextEmbed spends nearly all sampling compute on the DiT, but its condition is still a static text embedding. RepFusion remains stronger because part of its test-time compute goes to repeated MLLM conditioning over evolving noisy representations, giving the denoiser a changing, input-dependent condition.
| LLM Size | DiT Size | FLOPs Split (LLM | DiT) | GenEval ↑ | GenEval++ ↑ | GenEval2 ↑ | DPG-Bench ↑ |
|---|---|---|---|---|---|---|
| ~280T inference FLOPs | ||||||
| 1.0B | 3.2B | 26% | 74% | 0.70 | 0.289 | 31.18 | 82.68 |
| 3.0B | 1.3B | 71% | 28% | 0.67 | 0.282 | 26.97 | 81.31 |
| ~540T inference FLOPs | ||||||
| 7.0B (TextEmbed) | 8.0B | 3% | 97% | 0.64 | 0.321 | 26.60 | 81.34 |
| 1.0B | 7.3B | 13% | 87% | 0.70 | 0.443 | 30.84 | 82.58 |
| 3.0B | 5.5B | 37% | 63% | 0.69 | 0.382 | 30.67 | 82.20 |
| 7.0B | 1.3B | 85% | 15% | 0.70 | 0.321 | 24.84 | 82.08 |
RAEs make denoising latents semantic enough for pretrained multimodal priors to matter.
Repeated encoder compute only helps when the encoder sees the evolving noisy representation.
Preserving the multimodal perception prior can be better than joint optimization for generation.
DiT capacity remains valuable, while repeated MLLM conditioning is useful when it reads evolving noisy representations.
RepFusion turns the conditional encoder from a static prompt reader into an active denoising-time module. The broader message is simple: once the generation space is semantic enough, frozen MLLMs can contribute as priors over evolving visual representations, not just as text encoders.
@article{pan2026repfusion,
title={RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space},
author={Pan, Xichen and Singh, Aashu and Shukla, Satya Narayan and Fan, Xiangjun and Mishra, Shlok Kumar and Xie, Saining},
journal={arXiv preprint arXiv:2606.14700},
year={2026}
}