Google is introducing Gemma 4 12B, a new multimodal AI model designed to run locally on consumer hardware while supporting text, image, and audio inputs. Positioned between the company’s smaller Gemma E4B model and the larger 26B Mixture of Experts model, Gemma 4 12B is built to deliver advanced reasoning capabilities with a reduced memory footprint suitable for laptops.
The release marks the first mid-sized Gemma model to support native audio input. According to Google, the model can run locally on systems with 16GB of VRAM or unified memory, bringing multimodal AI capabilities and agent-focused workflows to devices without requiring cloud-based inference.
Google said Gemma models have now surpassed 150 million downloads, with developers using them for projects ranging from assistive robotics to enterprise security applications. The company is releasing Gemma 4 12B under an Apache 2.0 license and supporting deployment across a broad range of development tools and platforms.
A key feature of Gemma 4 12B is its unified architecture. Unlike many multimodal models that rely on separate encoders to process images and audio before passing data to a language model, Gemma 4 12B handles those inputs directly within the model’s core architecture. Google said this approach reduces memory requirements and helps lower latency.
For visual processing, the company replaced the traditional vision encoder with a lightweight embedding module that allows the language model backbone to perform image understanding tasks directly. Audio processing has been simplified further, with raw audio signals projected into the same dimensional space as text tokens rather than being routed through a dedicated audio encoder.
Google said the model delivers benchmark performance approaching that of its larger 26B Mixture of Experts model while using less than half the memory. The company also equipped Gemma 4 12B with Multi-Token Prediction drafters, a feature designed to reduce response latency.
Developers can access the model through platforms including Hugging Face, Kaggle, LM Studio, Ollama, Google AI Edge Gallery, and Google’s AI Edge Eloquent app. Support is also available through frameworks such as Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM.
Alongside the model release, Google is launching an official Gemma Skills Repository, a collection of skills intended to support agent development using Gemma models. The company is also offering deployment options through Google Cloud services, including Model Garden, Cloud Run, and Google Kubernetes Engine.
With Gemma 4 12B, Google is expanding the capabilities available to developers who want to build multimodal and agent-based AI applications locally, combining audio, image, and text processing in a model designed to run on everyday hardware.
About this article: This article was generated with AI assistance and reviewed by our editorial team to ensure it follows our editorial standards for accuracy and independence. We maintain strict fact-checking protocols and cite all sources.
Word count: 451Reading time: 0 minutes
Explore More AI Resources
Continue with high-value guides related to this topic.