Microsoft has introduced three new multimodal AI models—MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2—marking an expansion of its in-house AI development as it builds out its own model stack alongside its partnership with OpenAI.
The models, announced Thursday by Microsoft AI, span speech-to-text, voice generation, and visual content creation. MAI-Transcribe-1 supports transcription across 25 languages and, according to the company, operates 2.5 times faster than its Azure Fast alternative. MAI-Voice-1 can generate up to 60 seconds of audio in one second and includes tools for creating custom voices. MAI-Image-2, a video-generating model, had previously launched on March 19 through the company’s MAI Playground and is now being rolled out more broadly.