Xichen Pan^1,2 Satya Narayan Shukla^1,† Aashu Singh¹ Zhuokai Zhao¹ Shlok Kumar Mishra¹ Jialiang Wang¹ Zhiyang Xu¹ Jiuhai Chen¹ Kunpeng Li¹ Felix Juefei-Xu¹ Ji Hou^1,† Saining Xie^2,†

^†Equal advising

We introduce MetaQueries, a set of learnable queries that efficiently connect autoregressive multimodal LLMs (MLLMs) with diffusion-based image generators. This simple approach enables the generative decoder to directly leverage the rich semantic understanding, reasoning capabilities, and world knowledge inherent in MLLMs for knowledge-augmented image generation.

This blogpost is structured around three key components:

§MetaQuery: We introduce MetaQuery and carefully analyze the impact of applying MetaQuery on image generation performance in a controlled setting.
§Instruction Tuning Data: We proposed a scalable data curation pipeline that directly leverages naturally occurring image pairs from web corpora, surprisingly unlocks several new capabilities like visual association and logo design.
§Performance: We demonstrate our method can preserve SOTA multimodal understanding capabilities while achieving SOTA-level generative performance. We also show great performance on image reconstruction and editing, subject-driven generation, reasoning- and knowledge-based image generation.

MetaQuery

Instruction Tuning Data

Performance

Click to jump to each section.

Text-to-Image Generation

A british shorthair wearing sunglasses

An old rusted robot wearing pants and a jacket riding skis in a supermarket.

A giant humanoid, made of fluffy blue cotton candy, stomping on the ground, and roaring to the sky, clear blue sky behind them.

The word 'START' written on a street surface.

Close-up of a bright blue parrot's feathers glittering in the light, showing its unique plumage and vibrant colors.

A sunken ship at the bottom of the ocean.

Top view of the same berry bowl

The same robot in Minecraft

The same model but a real one in New York city

The skyline view of the city from this building

The statue in the same city

A logo for the same teapot

The national flag of the country where Yellowstone National Park is located.

The animal associated with having (2+7) lives.

The flower celebrated in spring festivals in the country where sushi originated.

The tallest building dominates the skyline of the city known as the City of Light.

A phone with a drained battery.

A night sky on a new moon night.

Add a chef hat to the dog

Replace the dog with a golden retriever

Change to cartoon style

Change it into lineart style

Chenage the bird to a blue one

Replace the fries with salad

MetaQuery

MetaQuery bridges frozen MLLMs with diffusion models. We use randomly initialized learnable queries to query out the conditions for generation. For simplicity and compatibility, we continue to use causal masking for the entire sequence. The conditions are then fed into a trainable connector to align with the input space of text-to-image diffusion models. The whole model is trained with the original generation objective on paired data.

The proposed architecture involves two design choices: using learnable queries and keeping the MLLM backbone frozen. We explain the reasons why we adopted these choices and how they impact performance. We report FID score on MJHQ-30K for visual aesthetic quality, and GenEval and DPG-Bench (both without prompt rewriting) for prompt alignment, respectively.

Learnable Queries: While many models use the (M)LLM's last layer embedding of input tokens for image generation, this approach limits unified modeling capabilities such as in-context learning and multimodal outputs. Our experiments show that learnable queries with just 64 tokens achieve comparable image generation quality to using last layer embeddings while unlocking the MLLM's in-context learning capability. Increasing to 512 tokens further improves performance, even surpassing the last layer embedding approach.

Methods	Number of Tokens	MJHQ-30K FID ↓	GenEval ↑	DPG-Bench ↑
LLM last layer embedding^*	-	7.49	0.55	78.41
Learnable queries	64	7.43	0.56	75.35
Learnable queries	512	7.34	0.56	78.43

Table 1: Study on different conditions for image generation. ^* denotes the embeddings of input tokens.

Frozen MLLMs: We keep the MLLM backbone frozen to preserve its understanding capabilities while avoiding complex training. Our experiments show that frozen MLLMs perform comparably to fully-tuned models, with slightly better visual quality but lower prompt alignment. This suggests that MetaQuery is another possible training strategy, one that is simpler but also effective, as an alternative to fine-tuning the entire MLLM.

Methods	Train LLM	Train DiT	MJHQ-30K FID ↓	GenEval ↑	DPG-Bench ↑
MLLM tuning	✓	✗	7.75	0.58	78.97
E2E tuning	✓	✓	6.28	0.61	79.39
Frozen MLLM	✗	✗	7.43	0.56	75.35
Frozen MLLM	✗	✓	6.06	0.61	76.66

Table 2: Study on strategies for adapting MLLMs. The methods without training LLM do not suffer from multimodal understanding degradation.

Training Recipe

We further study key training options for the two main components of MetaQuery: the number of tokens and connector design.

Number of Tokens: We observe promising scaling results on both text-to-image generation and image reconstruction.

Number of Tokens Rec Samples — **Figure 3:** Visaul samples for image reconstruction with different numbers of tokens.

Connector Design: We study two connector designs: Projection Before Encoder (Proj-Enc) and Projection After Encoder (Enc-Proj). Enc-Proj first aligns conditions in the MLLM hidden dimension before projecting to the diffusion decoder input dimension, achieving better performance with fewer parameters than Proj-Enc.

Architecture	# of Layers	Dims	# of Params	Rel. Wall Time	MJHQ-30K FID ↓	GenEval ↑	DPG-Bench ↑
Proj-Enc	6	2304	517M	1.06x	7.80	0.53	73.37
Proj-Enc	24	2304	2046M	1.23x	7.41	0.51	73.75
Enc-Proj	6	896	84M	1x	7.73	0.49	71.39
Enc-Proj	24	896	316M	1.06x	7.43	0.56	75.35

Table 3: Study on connector design.

Instruction Tuning Data

We choose to use a scalable data curation pipeline that directly leverages naturally occurring image pairs from web corpora, instead of depending on human-created pairs or synthetically generated data. These image pairs often exhibit meaningful associations and specific relationships spanning a broad spectrum.

We first cluster images with similar captions and designate one image as the target. This process yields 2.4M image pairs. Finally, we employ Qwen2.5-VL 3B to generate instructions for each pair, describing how to transform the source images into the target image.

Instruction Tuning Data Construction — **Figure 4:** Overview of instruction tuning data curation pipeline. We group images from web corpora based on caption similarity, then construct instruction-tuning data from these image pairs using an MLLM.

Image Understanding and Generation

Finally, We train our models on three different MLLM backbones for different sizes: Base (LLaVA-OneVision 0.5B), Large (Qwen2.5-VL 3B), and X-Large (Qwen2.5-VL 7B). We set the number of tokens to 256 for all models, and utilize a 24-layer connector with Enc-Proj architecture. For image generation heads, we tested two different diffusion models: Stable Diffusion v1.5 and Sana-1.6B. Our model family demonstrates strong capabilities across both understanding and generation tasks. All of our models in different sizes exhibit competitive performance on all understanding benchmarks. In terms of image generation, MetaQuery achieves SOTA visual quality on MJHQ-30K, and closely match the SOTA prompt alignment results on GenEval and DPG-Bench.

Methods	Base (M)LLM	MME-P	MMB	SEED	MMMU	MM-Vet	COCO FID ↓	MJHQ FID ↓	GenEval ↑	DPG-Bench ↑
Emu	LLaMA 13B	-	-	-	-	-	11.66	-	-	-
DreamLLM	Vicuna 7B	-	-	-	-	36.6	8.46	-	-	-
Chameleon	From Scratch 7B	-	-	-	22.4	8.3	26.74	-	0.39	-
Show-o-512	Phi-1.5 1.3B	1097.2	-	-	26.7	-	9.24	15.18	0.68	-
VILA-U	LLaMA-2 7B	1401.8	-	59.0	-	33.5	-	7.69	-	-
Emu3	From Scratch 7B	-	58.5	68.2	31.6	37.2	12.80	-	0.66†	80.60
MetaMorph	LLaMA-3 8B	-	75.2	71.8	-	-	11.8	-	-	-
TokenFlow-XL	Qwen-2.5 14B	1551.1	76.8	72.6	43.2	48.2	-	-	0.63†	73.38
Transfusion	From Scratch 7B	-	-	-	-	-	8.70	-	0.63	-
LMFusion	LLaVA-Next 8B	1603.7	72.1	72.5	41.7	-	8.20	-	-	-
Janus	DeepSeek-LLM 1.5B	1338.0	69.4	63.7	30.5	34.3	8.53	10.10	0.61	-
JanusFlow	DeepSeek-LLM 1.5B	1333.1	74.9	70.5	29.3	30.9	-	9.51	0.63	80.09
Janus-Pro-1B	DeepSeek-LLM 1.5B	1444.0	75.5	68.3	36.3	39.8	-	14.33‡	0.73	82.63
Janus-Pro-7B	DeepSeek-LLM 7B	1567.1	79.2	72.1	41.0	50.0	-	13.48‡	0.80	84.19

MetaQuery-B	LLaVA-ov 0.5B	1238.0	58.5	66.6	31.4	29.1	8.91	6.28	0.74†	80.04
MetaQuery-L	Qwen2.5-VL 3B	1574.3	78.6	73.8	53.1	63.2	8.87	6.35	0.78†	81.10
MetaQuery-XL	Qwen2.5-VL 7B	1685.2	83.5	76.9	58.6	66.6	8.69	6.02	0.80†	82.05

Table 4: Quantitative results on multimodal understanding and generation benchmarks. We report the COCO FID with Stable Diffusion v1.5, and other metrics with Sana. † denotes rewritten prompts. ‡ denotes results tested by us under the same settings.

Qualitative Results — **Figure 5:** Qualitative results on text-to-image generation.

Image Reconstruction

We demonstrate that MetaQuery can be easily fine-tuned for image reconstruction tasks with a frozen MLLM. Our model achieves comparable quality to SOTA models.

Image Editing

We demonstrate that MetaQuery can transfer its image reconstruction capability to perform image editing. We keep the MLLM backbone frozen and fine-tune our pre-trained Base model for only 1,000 steps on publicly available image editing data. Qualitative results demonstrate that our model performs effectively in these image-editing scenarios.

Instruction Tuning

We show that after being instruction-tuned on the proposed 2.4M dataset, MetaQuery can achieve impressive zero-shot subject-driven generation performance, producing coherent results even with multiple highly customized subjects (the first row of Figure 8). Using various supervision signals, the instruction-tuned MetaQuery model surprisingly unlocks novel capabilities like visual association and logo design that go beyond copy-pasting (the second row of Figure 8).

Reasoning- and Knowledge-Augmented Generation

Our learnable queries effectively leverage the frozen LLM's capabilities, enabling better understanding of complex prompts requiring real-world knowledge and reasoning.

We evaluate MetaQuery's world knowledge reasoning capability on the WISE benchmark, which contains similar test cases to the knowledge-augmented generation examples shown in Figure 9. MetaQuery achieves SOTA performance, significantly outperforming all other unified models. Our work stands as the first unified model to successfully transfer the advanced capabilities of frozen MLLMs to image generation and exceed the performance of SOTA text-to-image models.

Methods	Cultural	Time	Space	Biology	Physics	Chemistry	Overall
GPT-4o	0.94	0.64	0.98	0.93	0.98	0.95	0.89
Text-to-Image Models
SD-v1-5	0.34	0.35	0.32	0.28	0.29	0.21	0.32
SD-XL	0.43	0.48	0.47	0.44	0.45	0.27	0.43
PixArt-Alpha	0.45	0.50	0.48	0.49	0.56	0.34	0.47
playground-v2.5	0.49	0.58	0.55	0.43	0.48	0.33	0.49
SD-3.5-large	0.44	0.50	0.58	0.44	0.52	0.31	0.46
FLUX.1-dev	0.48	0.58	0.62	0.42	0.51	0.35	0.50
Unified Models
show-o-512	0.28	0.40	0.48	0.30	0.46	0.30	0.35
vila-u-7b-256	0.26	0.33	0.37	0.35	0.39	0.23	0.31
Emu3	0.34	0.45	0.48	0.41	0.45	0.27	0.39
Janus-1.3B	0.16	0.26	0.35	0.28	0.30	0.14	0.23
JanusFlow-1.3B	0.13	0.26	0.28	0.20	0.19	0.11	0.18
Janus-Pro-1B	0.20	0.28	0.45	0.24	0.32	0.16	0.26
Janus-Pro-7B	0.30	0.37	0.49	0.36	0.42	0.26	0.35
MetaQuery-B	0.44	0.49	0.58	0.41	0.49	0.34	0.46
MetaQuery-L	0.56	0.57	0.62	0.48	0.63	0.42	0.55
MetaQuery-XL	0.56	0.55	0.62	0.49	0.63	0.41	0.55

Table 5: Comparison of world knowledge reasoning on WISE.

We also quantitatively evaluate MetaQuery's commonsense reasoning capability on the CommonsenseT2I benchmark. Results show that MetaQuery significantly improves the performance of the base Sana model, achieving SOTA performance.

Methods	w/o Neg. Prompt	w/ Neg. Prompt
DALL-E 3 w/ rewrite	40.17	N/A
SD-XL	26.00	44.83
SD-3-medium	26.17	47.17
FLUX.1-dev	24.50	22.50
Sana-1.6B	25.17	43.33
MetaQuery-B	27.33	51.50
MetaQuery-L	28.83	57.67

Table 6: Comparison of visual commonsense reasoning capability on CommonsenseT2I.

Discussion

While our learnable queries approach matches the image quality of using LLM's last layer embeddings, the latter treats the LLM merely as a text encoder, limiting in-context learning. As shown in Figure 9 and confirmed by WiScore and CommonsenseT2I benchmarks, MetaQuery significantly outperforms the last layer embedding approach by natively integrating with the LLM to leverage its reasoning capabilities for generating appropriate images.

Methods	MJHQ-30K FID ↓	GenEval ↑	DPG-Bench ↑	WiScore ↑	CommonsenseT2I ↑
Ours-L w/ Last Layer Embed^*	6.41	0.78	81.23	0.48	52.83
Ours-L w/ MetaQuery	6.35	0.78	81.10	0.55	57.67

Table 7: Comparison between MetaQuery and LLM last layer embedding. ^* denotes that the LLM last layer embeddings of input tokens are used for image generation.

Conclusion

We presented MetaQueries, a simple interface connecting MLLMs (for understanding) and diffusion decoders (for generation), effective even when the MLLM is frozen. This approach yields state-of-the-art understanding and generation performance with straightforward implementation. By enabling transfer between modalities, MetaQueries successfully channels MLLM knowledge and reasoning into multimodal generation. While effective, we hypothesize that bridging the remaining gap to leading proprietary systems may primarily involve further data scaling. We hope MetaQueries provides a powerful, accessible baseline for future unified multimodal model development.

Transfer between Modalities with MetaQueries

MetaQuery

Training Recipe

Instruction Tuning Data

Image Understanding and Generation

Image Reconstruction

Image Editing

Instruction Tuning

Reasoning- and Knowledge-Augmented Generation

Discussion

Conclusion

BibTeX