Models

Multimodal Model

A model that can process and generate across multiple data types such as text, images, and audio.

Full Definition

Multimodal models accept and produce more than one modality — typically combining text with images, audio, video, or code within a single architecture. Early multimodal work stitched together separate encoders (e.g., CLIP for images, a language model for text), but modern models like GPT-4o, Gemini, and Claude 3 process all modalities through unified transformers. This enables cross-modal reasoning: answering questions about images, generating image captions, describing audio, or combining visual and textual instructions. Native multimodality is increasingly the default for frontier models because real-world tasks rarely involve text alone.

Examples

Uploading a photo of a broken circuit board to GPT-4 Vision and asking 'What component is likely causing the short circuit?'

Using Gemini to transcribe and summarise a one-hour meeting recording, outputting structured meeting notes.

Apply this in your prompts

Prompt𝙸t𝙸n automatically uses techniques like Multimodal Model to build better prompts for you.

✦ Try it free

Related Terms

Vision-Language Model

A model capable of jointly reasoning over both images and text.…

View →

Large Language Model

A neural network with billions of parameters trained on text to understand and g…

View →

Foundation Model

A large model trained on broad data that can be adapted to many downstream tasks…

View →

← Browse all 100 terms