Cohere’s first vision model Aya Vision is here with broad, multilingual understanding and open weights

Canadian AI startup Cohere launched in 2019 specifically targeting the enterprise, but independent research has shown it has so far struggled to gain much of a market share among third-party developers compared to rival proprietary U.S. model providers such as OpenAI and Anthropic, not to mention the rise of Chinese open source competitor DeepSeek.

Yet Cohere continues to bolster its offerings: today, its non-profit research division Cohere For AI announced the release of its first vision model, Aya Vision, a new open-weight multimodal AI model that integrates language and vision capabilities and boasts the differentiator of supporting inputs in 23 different languages spoken by what Cohere says in an official blog post is “half the world’s population,” making it appeal to a wide global audience.

Aya Vision is designed to enhance AI’s ability to interpret images, generate text, and translate visual content into natural language, making multilingual AI more accessible and effective. This would be especially helpful for enterprises and organizations operating in multiple different markets around the world with different language preferences.

It’s available now on Cohere’s website and on AI code communities Hugging Face and Kaggle under a Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license, allowing researchers and developers to freely use, modify, and share the model for non-commercial purposes as long as proper attribution is given.

In addition, Aya Vision is available through WhatsApp, allowing users to interact with the model directly in a familiar environment.

This limits its use for enterprises and as an engine for paid apps or moneymaking workflows, unfortunately.

It comes in 8-billion and 32-billion parameter versions (parameters refer to the number of internal settings in AI model, including its weights and biases, with more usually denoting a more powerful and performant model).

Supports 23 languages and counting

Even though leading AI models from rivals can understand text across multiple languages, extending this capability to vision-based tasks is a challenge.

But Aya Vision overcomes this by allowing users to generate image captions, answer visual questions, translate images, and perform text-based language tasks in a diverse set of languages including:

1. English

2. French

3. German

4. Spanish

5. Italian

6. Portuguese

7. Japanese

8. Korean

9. Chinese

10. Arabic

11. Greek

12. Persian

13. Polish

14. Indonesian

15. Czech

16. Hebrew

17. Hindi

18. Dutch

19. Romanian

20. Russian

21. Turkish

22. Ukrainian

23. Vietnamese

In its blog post, Cohere showed how Aya Vision can analyze imagery and text on product packaging and provide translations or explanations. It can also identify and describe art styles from different cultures, helping users learn about objects and traditions through AI-powered visual understanding.

Aya Vision’s capabilities have broad implications across multiple fields:

• Language Learning and Education: Users can translate and describe images in multiple languages, making educational content more accessible.

• Cultural Preservation: The model can generate detailed descriptions of art, landmarks, and historical artifacts, supporting cultural documentation in underrepresented languages.

• Accessibility Tools: Vision-based AI can assist visually impaired users by providing detailed image descriptions in their native language.

• Global Communication: Real-time multimodal translation enables organizations and individuals to communicate across languages more effectively.

Strong performance and high efficiency across leading benchmarks

One of Aya Vision’s standout features is its efficiency and performance relative to model size. Despite being significantly smaller than some leading multimodal models, Aya Vision has outperformed much larger alternatives in several key benchmarks.

• Aya Vision 8B outperforms Llama 90B, which is 11 times larger.

• Aya Vision 32B outperforms Qwen 72B, Llama 90B, and Molmo 72B, all of which are at least twice as large (or more)/

• Benchmarking results on AyaVisionBench and m-WildVision show Aya Vision 8B achieving win rates of up to 79%, and Aya Vision 32B reaching 72% win rates in multilingual image understanding tasks.

A visual comparison of efficiency vs. performance highlights Aya Vision’s advantage. As shown in the efficiency vs. performance trade-off graph, Aya Vision 8B and 32B demonstrate best-in-class performance relative to their parameter size, outperforming much larger models while maintaining computational efficiency.

The tech innovations powering Aya Vision

Cohere For AI attributes Aya Vision’s performance gains to several key innovations:

• Synthetic Annotations: The model leverages synthetic data generation to enhance training on multimodal tasks.

• Multilingual Data Scaling: By translating and rephrasing data across languages, the model gains a broader understanding of multilingual contexts.

• Multimodal Model Merging: Advanced techniques combine insights from both vision and language models, improving overall performance.

These advancements allow Aya Vision to process images and text with greater accuracy while maintaining strong multilingual capabilities.

The step-by-step performance improvement chart showcases how incremental innovations, including synthetic fine-tuning (SFT), model merging, and scaling, contributed to Aya Vision’s high win rates.

Implications for enterprise decision makers

Despite ostensibly catering to the enterprise, businesses may have a hard time making much use of Aya Vision given its restrictive non-commercial licensing terms.

Nonetheless, CEOs, CTOs, IT leaders, and AI researchers may use the models to explore AI-driven multilingual and multimodal capabilities within their organizations—particularly in research, prototyping, and benchmarking.

Enterprises can still use it for internal research and development, evaluating multilingual AI performance, and experimenting with multimodal applications.

CTOs and AI teams will find Aya Vision valuable as a highly efficient, open-weight model that outperforms much larger alternatives while requiring fewer computational resources.

This makes it a useful tool for benchmarking against proprietary models, exploring potential AI-driven solutions, and testing multilingual multimodal interactions before committing to a commercial deployment strategy.

For data scientists and AI researchers, Aya Vision is much more useful.

Its open-source nature and rigorous benchmarks provide a transparent foundation for studying model behavior, fine-tuning in non-commercial settings, and contributing to open AI advancements.

Whether used for internal research, academic collaborations, or AI ethics evaluations, Aya Vision serves as a cutting-edge resource for enterprises looking to stay at the forefront of multilingual and multimodal AI—without the constraints of proprietary, closed-source models.

Open source research and collaboration

Aya Vision is part of Aya, a broader initiative by Cohere focused on making AI and related tech more multilingual.

Since its inception in February 2024, the Aya initiative has engaged a global research community of over 3,000 independent researchers across 119 countries, working together to improve language AI models.

To further its commitment to open science, Cohere has released the open weights for both Aya Vision 8B and 32B on Kaggle and Hugging Face, ensuring researchers worldwide can access and experiment with the models. In addition, Cohere For AI has introduced the AyaVisionBenchmark, a new multilingual vision evaluation set designed to provide a rigorous assessment framework for multimodal AI.

The availability of Aya Vision as an open-weight model marks an important step in making multilingual AI research more inclusive and accessible.

Aya Vision builds on the success of Aya Expanse, another LLM family from Cohere For AI focused on multilingual AI. By expanding its focus to multimodal AI, Cohere For AI is positioning Aya Vision as a key tool for researchers, developers, and businesses looking to integrate multilingual AI into their workflows.

As the Aya initiative continues to evolve, Cohere For AI has also announced plans to launch a new collaborative research effort in the coming weeks. Researchers and developers interested in contributing to multilingual AI advancements can join the open science community or apply for research grants.

For now, Aya Vision’s release represents a significant leap in multilingual multimodal AI, offering a high-performance, open-weight solution that challenges the dominance of larger, closed-source models. By making these advancements available to the broader research community, Cohere For AI continues to push the boundaries of what is possible in AI-driven multilingual communication.

The post Cohere’s first vision model Aya Vision is here with broad, multilingual understanding and open weights — but there’s a catch appeared first on Venture Beat.