Home
Blog
AI News
Meta’s AI Image Generator: CM3leon, The Chameleon of Generative Models

Meta’s AI Image Generator: CM3leon, The Chameleon of Generative Models

Updated:June 13, 2025

Reading Time: 3 minutes

Pogla_Dive_into_Metas_AI_image_generator_CM3leon_a_state-of-the_bf8db5db-db77-48c7-8419-2d542a9f99bb

Introduction to CM3leon

In the rapidly evolving world of AI, Meta has introduced a state-of-the-art generative model, CM3leon. This model is unique in its ability to perform both text-to-image and image-to-text generation, making it a versatile tool in the realm of generative AI.

CM3leon: A Multimodal Model

CM3leon is the first multimodal model trained with a recipe adapted from text-only language models. This includes a large-scale retrieval-augmented pre-training stage and a second multitask supervised fine-tuning (SFT) stage. The model is efficient, strong, and shows that tokenizer-based transformers can be trained as effectively as existing generative diffusion-based models.

Performance of CM3leon

CM3leon achieves state-of-the-art performance for text-to-image generation, despite being trained with five times less compute than previous transformer-based methods. It has the versatility and effectiveness of autoregressive models, while maintaining low training costs and inference efficiency. It is a causal masked mixed-modal (CM3) model because it can generate sequences of text and images conditioned on arbitrary sequences of other image and text content.

Multitask Instruction Tuning

CM3leon has been subjected to large-scale multitask instruction tuning for both image and text generation. This has significantly improved its performance on tasks such as image caption generation, visual question answering, text-based editing, and conditional image generation. This showcases how the scaling recipes developed for text-only models generalize directly to tokenization-based image generation models.

Benchmark Performance

When comparing performance on the most widely used image generation benchmark (zero-shot MS-COCO), CM3Leon achieves an FID (Fréchet Inception Distance) score of 4.88, establishing a new state of the art in text-to-image generation. It even outperforms Google’s text-to-image model, Parti. This achievement underscores the potential of retrieval augmentation and highlights the impact of scaling strategies on the performance of autoregressive models.

CM3leon’s Capabilities

CM3leon’s capabilities extend to producing more coherent imagery that better follows the input prompts. It performs strongly in recovering global shapes and local details. It excels in text-guided image generation and editing, and can generate short or long captions and answer questions about an image.

Structure-guided Image Editing

CM3leon models can create visually coherent and contextually appropriate edits to an image while adhering to the given structure or layout guidelines. This enables them to understand and interpret not only textual instructions but also structural or layout information that’s provided as input.

Super-resolution Results

A common trick for image generation is to add a separately trained super-resolution stage to produce higher-resolution images from the original model outputs. This works very well with CM3leon too, as shown in the examples for the text-to-image generation task.

Building CM3leon

CM3Leon’s architecture uses a decoder-only transformer akin to well-established text-based models. However, what sets CM3Leon apart is its ability to input and generate both text and images. This empowers CM3Leon to successfully handle a variety of tasks.

Training CM3leon

CM3leon’s training is retrieval augmented, following recent work, greatly improving efficiency and controllability of the resulting model. It was also subjected to instruction fine-tuning on a wide range of different image and text generation tasks.

Conclusion

As the AI industry continues to evolve, generative models like CM3leon are becoming increasingly sophisticated. These models learn the relationship between visuals and text by training on millions of example images, but they can also reflect any biases present in the training data. By making our work transparent, we hope to encourage collaboration and innovation in the field of generative AI.

FAQs

What is CM3leon? CM3leon is a state-of-the-art generative model introduced by Meta that can perform both text-to-image and image-to-text generation.
How does CM3leon perform compared to other models? CM3leon achieves state-of-the-art performance for text-to-image generation, outperforming even Google’s text-to-image model, Parti, on the most widely used image generation benchmark.
What tasks can CM3leon handle? CM3leon can handle a variety of tasks including text-guided image generation and editing, image caption generation, visual question answering, and conditional image generation.
How was CM3leon trained? CM3leon’s training is retrieval augmented, improving efficiency and controllability of the resulting model. It was also subjected to instruction fine-tuning on a wide range of different image and text generation tasks.

Tags:

AI Performance, CM3leon Model, Generative Model, Image Editing AI, Multimodal AI

Matic

Contributor & AI Expert