RepFusion

Leveraging Multimodal Priors for Denoising in Representation Space

Introducing RepFusion, a text-to-image model that repurposes frozen multimodal LLMs as noisy latent encoders for denoising in representation space.

Representation-space generation: RAEs move denoising into latent spaces that are much closer to what MLLMs already understand.
From clean perception to noisy encoding: RepFusion repurposes a frozen MLLM to encode noisy visual representations, then uses its outputs to condition a DiT denoiser.
Test-time compute scaling: Repeated MLLM conditioning becomes useful because the conditioning signal evolves along the denoising trajectory.
RepFusion teaser image
Problem

LLMs are underused in T2I

They are often large, frozen, and knowledgeable, but reduced to one-shot text embedding modules.

Opportunity

Representations are readable

Representation-space diffusion creates latent spaces that are much closer to what MLLMs already understand.

Mechanism

Condition on evolving noisy representations

MLLMs read noisy representations at each step, so the condition changes with the denoising trajectory.

Evidence

Pretrained priors can outperform newly initialized denoisers

At matched inference FLOPs, a frozen MLLM reading noisy representations plus a small DiT beats baselines that spend the budget on larger newly initialized denoisers.

Method Key Evidence T2I Results Analysis & Scaling

Click any icon to jump to the corresponding section.


Method
Let the MLLM Read the Noisy Representation

VAE latents are optimized for reconstruction, while RAE representations preserve much richer visual semantics. This gives the generator a denoising space closer to the feature spaces used by MLLMs. RepFusion reuses the MLP projector that aligns visual representations with an LLM, and applies it to encode noisy representations during generation. The frozen MLLM processes the prompt and current noisy representation, then its outputs condition a DiT denoiser.

RepFusion architecture
Figure 1: Overview of RepFusion. Blue modules are frozen; red modules are trainable. A noisy representation is projected into the MLLM input space, the frozen MLLM reads it together with the caption, and its outputs condition the DiT.

The architecture stays simple: train the projector and DiT, keep the LLM backbone frozen, and recompute the conditioning signal as the noisy representation evolves.

Because the selected MLLM hidden states are token-aligned with DiT tokens, the MLLM output can modulate corresponding DiT tokens through AdaLN.

MetaQuery vs RepFusion conditioning
Figure 2: A new axis for test-time scaling. MetaQuery (left) runs the conditional encoder once and reuses a static condition across denoising steps. RepFusion (right) feeds evolving noisy representations into the MLLM, making the conditioning signal change along the denoising trajectory and making per-step MLLM recomputation useful.

The important distinction: recomputation only helps when the encoder sees the evolving noisy representation. The learnable-query baseline reaches 0.55 GenEval; making the queries timestep-dependent to match RepFusion's inference FLOPs reaches 0.54. RepFusion reaches 0.70 because the MLLM observes the current denoising state.

Key Evidence
Why the Representation Input Matters

The next two comparisons ask what makes the MLLM prior useful for denoising: a representation space it can interpret, and a capacity allocation that lets that prior participate in the generation process.

RepFusion motivation figure
Figure 3: Moving from VAEs to RAEs unlocks pretrained multimodal priors. The RAE transition separates RepFusion from static text-embedding and unified-architecture baselines, suggesting that the representation space is what lets the MLLM prior become useful.

RAEs help all methods, but RepFusion benefits most because the noisy representation is now an input the frozen MLLM can interpret.

RepFusion parameter efficiency
Figure 4: Capacity allocation under similar inference FLOPs. Circle diameter denotes total parameters; the inner disk denotes trainable parameters. TextEmbed uses a 7B frozen MLLM text encoder with an 8B DiT; Transfusion uses an 8B joint denoising transformer; RepFusion uses the same 7B frozen MLLM with a 1.3B DiT.

Once the representation space is compatible, a frozen pretrained conditional encoder can beat spending nearly the whole parameter budget on newly initialized denoising modules.

Putting the Pieces Together

Both ablation paths point to the same ingredients: representation space, noisy RAE latents input, and a preserved perception-pretrained MLLM backbone.

Roadmap from TextEmbed Roadmap from Transfusion
Figure 5: Roadmap of building RepFusion. Top: path from TextEmbed. Bottom: path from Transfusion. Bars show GenEval scores; a hatched bar means the modification is not adopted in the final model.

T2I Results
Text-to-Image Generation

With only around 30M image-caption pairs, RepFusion achieves strong T2I prompt alignment across GenEval, GenEval++, GenEval2, and DPG-Bench.

Prompt alignment We evaluate the largest configuration, a 7B MLLM with a 3.2B DiT, on standard prompt-alignment benchmarks.
Robust evaluation GenEval2 uses Soft-TIFA to reduce benchmark drift from synthetic-data SFT and benchmark-specific optimization.
Reasoning prompts WISE evaluates world-knowledge reasoning capability.
MethodsGenEval ↑GenEval++ ↑GenEval2 ↑DPG-Bench ↑
Prior Work
Transfusion0.63
MetaQuery-XL0.8082.05
BLIP-3o 8B0.840.30781.60
OmniGen20.800.32583.57
BAGEL0.820.37123.184.03
Scale-RAE0.8379.70
RepFusion
RepFusion w/ RAE Decoder0.730.43230.282.75
RepFusion w/ Diffusion Decoder0.780.44329.984.41
RepFusion-SFT w/ RAE Decoder0.850.70735.184.17
RepFusion-SFT w/ Diffusion Decoder0.870.66934.985.11
Table 1: Text-to-image generation results. denotes rewritten prompts. For GenEval2, we report the prompt-level Soft-TIFAGM metric.
RepFusion text-to-image samples
Figure 6: Visual samples of text-to-image generation. RepFusion follows prompts involving fine-grained attributes, object relations, camera motion, and rendered text.

Reasoning-Based Generation

Similar to learnable-query methods, RepFusion can leverage the capabilities of a frozen LLM to follow prompts requiring world knowledge and reasoning.

ModelCulturalTimeSpace BiologyPhysicsChemistryOverall
MetaQuery-XL0.560.550.620.490.630.410.55
BLIP-3o 8B0.62
BAGEL0.440.550.680.440.600.390.52
RepFusion-SFT w/ RAE Decoder0.550.530.700.510.570.410.55
RepFusion-SFT w/ Diff. Decoder0.650.630.790.630.670.440.64
Table 2: Reasoning-based generation on WISE.

Analysis & Scaling
Multimodal Perception Pretraining

The gains above depend on the conditional encoder being able to interpret structured visual representations along the denoising trajectory. Replacing a language-only LLM with a perception-pretrained MLLM improves the denoising setup under matched denoiser and token budgets.

Effect of perception pretraining
Figure 7: Effect of multimodal perception pretraining. Replacing a language-only LLM with a perception-pretrained MLLM improves both Transfusion-RAE and RepFusion, indicating that perception pretraining is a transferable prior for diffusion in representation space.

Preserve the perception prior. Fine-tuning helps when starting from a language-only LLM, but can degrade performance when the backbone is already multimodally pretrained. For RepFusion, freezing preserves that prior.

Effect of freezing vs fine-tuning the LLM
Figure 8: Freezing vs. fine-tuning the LLM backbone. Fine-tuning helps a language-only LLM, but freezing is better for a perception-pretrained MLLM in RepFusion-RAE.

Scaling
Scaling Behavior & Compute Allocation

RepFusion has two scaling axes: the frozen MLLM that repeatedly reads evolving noisy representations, and the DiT denoiser that predicts the velocity.

MLLM and DiT co-scaling
Figure 9: MLLM and DiT co-scaling. RepFusion benefits from scaling both axes, with the clearest trends on GenEval and GenEval++.

The iso-FLOPs comparison answers a within-family question: among RepFusion variants, allocating more compute to the DiT is generally more favorable. The comparison with TextEmbed answers a different question. TextEmbed spends nearly all sampling compute on the DiT, but its condition is still a static text embedding. RepFusion remains stronger because part of its test-time compute goes to repeated MLLM conditioning over evolving noisy representations, giving the denoiser a changing, input-dependent condition.

LLM SizeDiT SizeFLOPs Split (LLM | DiT) GenEval ↑GenEval++ ↑GenEval2 ↑DPG-Bench ↑
~280T inference FLOPs
1.0B3.2B26% | 74%0.700.28931.1882.68
3.0B1.3B71% | 28%0.670.28226.9781.31
~540T inference FLOPs
7.0B (TextEmbed)8.0B3% | 97%0.640.32126.6081.34
1.0B7.3B13% | 87%0.700.44330.8482.58
3.0B5.5B37% | 63%0.690.38230.6782.20
7.0B1.3B85% | 15%0.700.32124.8482.08
Table 3: Iso-FLOPs comparison of compute allocation between the MLLM and DiT within RepFusion, with TextEmbed as a reference baseline.

Takeaways
The Design Principle

Use a representation space the MLLM can understand

RAEs make denoising latents semantic enough for pretrained multimodal priors to matter.

Make the condition input-dependent

Repeated encoder compute only helps when the encoder sees the evolving noisy representation.

Preserve strong perception priors

Preserving the multimodal perception prior can be better than joint optimization for generation.

Scale the encoder and denoiser together

DiT capacity remains valuable, while repeated MLLM conditioning is useful when it reads evolving noisy representations.

RepFusion turns the conditional encoder from a static prompt reader into an active denoising-time module. The broader message is simple: once the generation space is semantic enough, frozen MLLMs can contribute as priors over evolving visual representations, not just as text encoders.

BibTeX

@article{pan2026repfusion,
  title={RepFusion: Leveraging Multimodal Priors for Denoising in Representation Space},
  author={Pan, Xichen and Singh, Aashu and Shukla, Satya Narayan and Fan, Xiangjun and Mishra, Shlok Kumar and Xie, Saining},
  journal={arXiv preprint arXiv:2606.14700},
  year={2026}
}