Meta’s Recent AI Release is Capable of both Seeing and Reading

Meta unveils Llama 3.2, an open-source AI model processing images and text, empowering developers to create advanced multimodal applications.

Tech giant Meta has recently unveiled an open-source AI model capable of processing both images and text. The new model, dubbed Llama 3.2, comes just two months after the company’s release of Llama 3.1 and is set to empower developers to create more sophisticated AI applications.

‍

Potential uses include augmented reality apps with real-time video comprehension, visual search engines that categorise images based on content, and document analysis tools that can summarise lengthy texts.

‍

Ahmad Al-Dahle, Meta's vice president of generative AI, emphasised the model's user-friendly nature, stating that developers would need to do little more than add this “new multimodality and be able to show Llama images and have it communicate."

‍

While Meta wasn't the first to introduce multimodal models, with competitors like OpenAI and Google getting a head start introducing them last year, the addition of vision support is crucial for Meta's ongoing development of AI capabilities in hardware, such as their Ray-Ban Meta glasses.

‍

Llama 3.2 comprises two vision models (11 billion and 90 billion parameters) and two lightweight text-only models (1 billion and 3 billion parameters). The small language models (SLMs) have the exciting potential to be used in mobile devices and tablets, as they don’t require the same extensive resources as large language models (LLMs).

‍

Despite this exciting new release, Meta's Llama 3.1 (with a 405 billion parameter version) remains relevant whereas this latest release might struggle with more complex text generation tasks that might still require a larger model.