Problem

LLMs are underused in T2I

They are often large, frozen, and knowledgeable, but reduced to one-shot text embedding modules.

Opportunity

Representations are readable

Representation-space diffusion creates latent spaces that are much closer to what MLLMs already understand.

Mechanism

Condition on evolving noisy representations

MLLMs read noisy representations at each step, so the condition changes with the denoising trajectory.

Evidence

Pretrained priors can outperform newly initialized denoisers

At matched inference FLOPs, a frozen MLLM reading noisy representations plus a small DiT beats baselines that spend the budget on larger newly initialized denoisers.

Method
Let the MLLM Read the Noisy Representation

VAE latents are optimized for reconstruction, while RAE representations preserve much richer visual semantics. This gives the generator a denoising space closer to the feature spaces used by MLLMs. RepFusion reuses the MLP projector that aligns visual representations with an LLM, and applies it to encode noisy representations during generation. The frozen MLLM processes the prompt and current noisy representation, then its outputs condition a DiT denoiser.

RepFusion architecture — **Figure 1:** Overview of RepFusion. Blue modules are frozen; red modules are trainable. A noisy representation is projected into the MLLM input space, the frozen MLLM reads it together with the caption, and its outputs condition the DiT.

The architecture stays simple: train the projector and DiT, keep the LLM backbone frozen, and recompute the conditioning signal as the noisy representation evolves.

Because the selected MLLM hidden states are token-aligned with DiT tokens, the MLLM output can modulate corresponding DiT tokens through AdaLN.

MetaQuery vs RepFusion conditioning — **Figure 2:** A new axis for test-time scaling. MetaQuery (left) runs the conditional encoder once and reuses a static condition across denoising steps. RepFusion (right) feeds evolving noisy representations into the MLLM, making the conditioning signal change along the denoising trajectory and making per-step MLLM recomputation useful.

The important distinction: recomputation only helps when the encoder sees the evolving noisy representation. The learnable-query baseline reaches 0.55 GenEval; making the queries timestep-dependent to match RepFusion's inference FLOPs reaches 0.54. RepFusion reaches 0.70 because the MLLM observes the current denoising state.

Key Evidence
Why the Representation Input Matters

The next two comparisons ask what makes the MLLM prior useful for denoising: a representation space it can interpret, and a capacity allocation that lets that prior participate in the generation process.

RepFusion motivation figure — **Figure 3:** Moving from VAEs to RAEs unlocks pretrained multimodal priors. The RAE transition separates RepFusion from static text-embedding and unified-architecture baselines, suggesting that the representation space is what lets the MLLM prior become useful.

RAEs help all methods, but RepFusion benefits most because the noisy representation is now an input the frozen MLLM can interpret.

RepFusion parameter efficiency — **Figure 4:** Capacity allocation under similar inference FLOPs. Circle diameter denotes total parameters; the inner disk denotes trainable parameters. TextEmbed uses a 7B frozen MLLM text encoder with an 8B DiT; Transfusion uses an 8B joint denoising transformer; RepFusion uses the same 7B frozen MLLM with a 1.3B DiT.

Once the representation space is compatible, a frozen pretrained conditional encoder can beat spending nearly the whole parameter budget on newly initialized denoising modules.

Putting the Pieces Together

Both ablation paths point to the same ingredients: representation space, noisy RAE latents input, and a preserved perception-pretrained MLLM backbone.

Roadmap from TextEmbed — **Figure 5:** Roadmap of building RepFusion. Top: path from TextEmbed. Bottom: path from Transfusion. Bars show GenEval scores; a hatched bar means the modification is not adopted in the final model.

Roadmap from Transfusion — **Figure 5:** Roadmap of building RepFusion. Top: path from TextEmbed. Bottom: path from Transfusion. Bars show GenEval scores; a hatched bar means the modification is not adopted in the final model.

T2I Results
Text-to-Image Generation

With only around 30M image-caption pairs, RepFusion achieves strong T2I prompt alignment across GenEval, GenEval++, GenEval2, and DPG-Bench.

Prompt alignment We evaluate the largest configuration, a 7B MLLM with a 3.2B DiT, on standard prompt-alignment benchmarks.

Robust evaluation GenEval2 uses Soft-TIFA to reduce benchmark drift from synthetic-data SFT and benchmark-specific optimization.

Reasoning prompts WISE evaluates world-knowledge reasoning capability.

Methods	GenEval ↑	GenEval++ ↑	GenEval2 ↑	DPG-Bench ↑
Prior Work
Transfusion	0.63	–	–	–
MetaQuery-XL	0.80^†	–	–	82.05
BLIP-3o 8B	0.84	0.307	–	81.60
OmniGen2	0.80	0.325	–	83.57
BAGEL	0.82	0.371	23.1^†	84.03
Scale-RAE	0.83	–	–	79.70
RepFusion
RepFusion w/ RAE Decoder	0.73	0.432	30.2	82.75
RepFusion w/ Diffusion Decoder	0.78	0.443	29.9	84.41
RepFusion-SFT w/ RAE Decoder	0.85	0.707	35.1	84.17
RepFusion-SFT w/ Diffusion Decoder	0.87	0.669	34.9	85.11

Table 1: Text-to-image generation results. ^†denotes rewritten prompts. For GenEval2, we report the prompt-level Soft-TIFA_GM metric.

RepFusion text-to-image samples — **Figure 6:** Visual samples of text-to-image generation. RepFusion follows prompts involving fine-grained attributes, object relations, camera motion, and rendered text.

Reasoning-Based Generation

Similar to learnable-query methods, RepFusion can leverage the capabilities of a frozen LLM to follow prompts requiring world knowledge and reasoning.

Model	Cultural	Time	Space	Biology	Physics	Chemistry	Overall
MetaQuery-XL	0.56	0.55	0.62	0.49	0.63	0.41	0.55
BLIP-3o 8B	–	–	–	–	–	–	0.62
BAGEL	0.44	0.55	0.68	0.44	0.60	0.39	0.52
RepFusion-SFT w/ RAE Decoder	0.55	0.53	0.70	0.51	0.57	0.41	0.55
RepFusion-SFT w/ Diff. Decoder	0.65	0.63	0.79	0.63	0.67	0.44	0.64

Table 2: Reasoning-based generation on WISE.

Analysis & Scaling
Multimodal Perception Pretraining

The gains above depend on the conditional encoder being able to interpret structured visual representations along the denoising trajectory. Replacing a language-only LLM with a perception-pretrained MLLM improves the denoising setup under matched denoiser and token budgets.

Effect of perception pretraining — **Figure 7:** Effect of multimodal perception pretraining. Replacing a language-only LLM with a perception-pretrained MLLM improves both Transfusion-RAE and RepFusion, indicating that perception pretraining is a transferable prior for diffusion in representation space.

Preserve the perception prior. Fine-tuning helps when starting from a language-only LLM, but can degrade performance when the backbone is already multimodally pretrained. For RepFusion, freezing preserves that prior.

Effect of freezing vs fine-tuning the LLM — **Figure 8:** Freezing vs. fine-tuning the LLM backbone. Fine-tuning helps a language-only LLM, but freezing is better for a perception-pretrained MLLM in RepFusion-RAE.

Scaling
Scaling Behavior & Compute Allocation

RepFusion has two scaling axes: the frozen MLLM that repeatedly reads evolving noisy representations, and the DiT denoiser that predicts the velocity.

**Figure 9:** MLLM and DiT co-scaling. RepFusion benefits from scaling both axes, with the clearest trends on GenEval and GenEval++.

The iso-FLOPs comparison answers a within-family question: among RepFusion variants, allocating more compute to the DiT is generally more favorable. The comparison with TextEmbed answers a different question. TextEmbed spends nearly all sampling compute on the DiT, but its condition is still a static text embedding. RepFusion remains stronger because part of its test-time compute goes to repeated MLLM conditioning over evolving noisy representations, giving the denoiser a changing, input-dependent condition.

LLM Size	DiT Size	FLOPs Split (LLM \| DiT)	GenEval ↑	GenEval++ ↑	GenEval2 ↑	DPG-Bench ↑
~280T inference FLOPs
1.0B	3.2B	26% \| 74%	0.70	0.289	31.18	82.68
3.0B	1.3B	71% \| 28%	0.67	0.282	26.97	81.31
~540T inference FLOPs
7.0B (TextEmbed)	8.0B	3% \| 97%	0.64	0.321	26.60	81.34
1.0B	7.3B	13% \| 87%	0.70	0.443	30.84	82.58
3.0B	5.5B	37% \| 63%	0.69	0.382	30.67	82.20
7.0B	1.3B	85% \| 15%	0.70	0.321	24.84	82.08

Table 3: Iso-FLOPs comparison of compute allocation between the MLLM and DiT within RepFusion, with TextEmbed as a reference baseline.

Takeaways
The Design Principle

Use a representation space the MLLM can understand

RAEs make denoising latents semantic enough for pretrained multimodal priors to matter.

Make the condition input-dependent

Repeated encoder compute only helps when the encoder sees the evolving noisy representation.

Preserve strong perception priors

Preserving the multimodal perception prior can be better than joint optimization for generation.

Scale the encoder and denoiser together

DiT capacity remains valuable, while repeated MLLM conditioning is useful when it reads evolving noisy representations.

RepFusion turns the conditional encoder from a static prompt reader into an active denoising-time module. The broader message is simple: once the generation space is semantic enough, frozen MLLMs can contribute as priors over evolving visual representations, not just as text encoders.

BibTeX

@article{pan2026repfusion,
  title={RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space},
  author={Pan, Xichen and Singh, Aashu and Shukla, Satya Narayan and Fan, Xiangjun and Mishra, Shlok Kumar and Xie, Saining},
  journal={arXiv preprint arXiv:2606.14700},
  year={2026}
}

RepFusion

Leveraging Multimodal Priors for Denoising in Representation Space