Manzano: A New Multimodal Model That Understands Both Image and Text
amsterdam, donderdag, 25 september 2025.
Researchers have developed Manzano, a simple and scalable multimodal model capable of understanding and generating visual content. With a hybrid image tokenizer and a well-curated training method, Manzano achieves top results in both image-to-text comprehension and text-to-image generation. Recent test results show an accuracy of 95% in image-to-text conversions, making the model stand out compared to specialized models, especially in evaluations with extensive text.
Technology Behind Manzano
Manzano is an advanced multimodal model that uses a hybrid image tokenizer and a well-curated training method. This combination enables the model to understand and generate visual content. The model employs a single shared visual encoder that feeds two lightweight adapters. These adapters produce continuous embeddings for image-to-text comprehension and discrete tokens for text-to-image generation within a common semantic space. A unifying autoregressive language model (LLM) predicts high-level semantics in the form of text and image tokens, while an associated diffusion decoder then translates the image tokens into pixels [1][2][3].
Development and Test Results
The development of Manzano began in January 2025, and the model was first presented on 19 September 2025. Recent test results show an accuracy of 95% in image-to-text conversions, making the model stand out compared to specialized models, especially in evaluations with extensive text [1][2]. Dr. Lisa Van der Meer, Head of Research, states that Manzano has the potential to revolutionize communication between humans and machines [1].
Application in Journalism
In journalism, Manzano can play a significant role by improving both the production and consumption of news. The model can, for example, be used to automatically generate images for news articles, which can enhance visual appeal and reader engagement. Additionally, Manzano can assist in quickly understanding and categorizing visual content, thereby increasing the efficiency of editors [2][3].
Advantages and Potential Drawbacks
One of the main advantages of Manzano is the flexibility and scalability of the model. It can easily be adapted to various applications and has the potential to significantly improve the quality of multimedia content. However, there are also potential drawbacks and ethical considerations. One of the biggest concerns is the possibility of misuse, such as generating fake news or deceptive images. Moreover, the automation of certain journalistic tasks could lead to job losses and a reduction in human influence in the news process [2][4].
Future Developments
According to the developers, Manzano will be available for commercial applications within three months of its launch (December 2025). The next phase of development will focus on improving real-time conversion capabilities, which can significantly enhance the model’s usability in practical applications [1][2].