Multimodal
Expand your context by sending images, audio, and documents vianative multimodal adapters for OpenAI and Anthropic.
How It Works
Multimodal support in Orka AI is built on the ChatMessage and ContentPart types. Instead of sending a plain string prompt, you compose messages with mixed content parts: text, images (URL or base64), and audio.
Textual Content
Standard UTF-8 plain text for prompts and system instructions.
textRemote Vision
Reference images via URL. Supports granularity control (auto, low, high).
image_urlEmbedded Image
Direct binary encoding (PNG, JPEG, WebP). Ideal for transient data.
image_base64Aural Data
Native WAV/MP3 processing for speech-to-text or sound analysis (OpenAI).
audio# Image Analysis (URL)
The simplest way to analyze an image is to pass its URL. The LLM will download and process the image automatically. This works with both OpenAI (GPT-4o, GPT-4o-mini) and Anthropic (Claude 3.5 Sonnet, Claude 3 Opus).
import { createOrka } from '@orka-js/core';import { OpenAIAdapter } from 'orkajs'; const orka = createOrka({ llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o' // Must use a vision-capable model })}); const result = await orka.getLLM().generate('', { messages: [ { role: 'user', content: [ { type: 'text', text: 'What do you see in this image? Describe it in detail.' }, { type: 'image_url', image_url: { url: 'https://example.com/photo.jpg', detail: 'high' // 'auto' | 'low' | 'high' } } ] } ]}); console.log(result.content);// "The image shows a sunset over the ocean with..."auto
The model decides the detail level based on the image size. Best default choice.
low
Faster and cheaper. Uses a 512×512 thumbnail. Good for simple classification.
high
Full resolution analysis. Best for OCR, detailed descriptions, and small text reading.
# Image Analysis (Base64)
For local files or dynamically generated images, encode them in base64. This avoids the need for a public URL and works with both OpenAI and Anthropic.
import { readFileSync } from 'fs'; // Read local image fileconst imageBuffer = readFileSync('./screenshot.png');const base64Image = imageBuffer.toString('base64'); const result = await orka.getLLM().generate('', { messages: [ { role: 'user', content: [ { type: 'text', text: 'Extract all text from this screenshot.' }, { type: 'image_base64', data: base64Image, mimeType: 'image/png' // 'image/png' | 'image/jpeg' | 'image/gif' | 'image/webp' } ] } ]}); console.log(result.content);# Multiple Images
You can send multiple images in a single message for comparison, analysis, or multi-page document processing.
const result = await orka.getLLM().generate('', { messages: [ { role: 'user', content: [ { type: 'text', text: 'Compare these two UI designs. Which one is better and why?' }, { type: 'image_url', image_url: { url: 'https://example.com/design-a.png', detail: 'high' } }, { type: 'image_url', image_url: { url: 'https://example.com/design-b.png', detail: 'high' } } ] } ]});# Audio Input (OpenAI)
OpenAI's GPT-4o models support audio input. Send audio data in WAV or MP3 format for transcription, analysis, or voice-based interaction.
import { readFileSync } from 'fs'; const audioBuffer = readFileSync('./recording.wav');const base64Audio = audioBuffer.toString('base64'); const result = await orka.getLLM().generate('', { messages: [ { role: 'user', content: [ { type: 'text', text: 'Transcribe this audio and summarize the key points.' }, { type: 'audio', data: base64Audio, format: 'wav' // 'wav' | 'mp3' } ] } ]}); console.log(result.content);// "The speaker discusses three main topics: ..."⚠️ Audio Limitations
- Audio input is currently supported only by OpenAI (GPT-4o models)
- Anthropic (Claude) supports audio input starting with Claude 4.6 Sonnet
- Gemini best model for audio are (3.1 Pro) & (3 Flash)
- Maximum audio length depends on the model and your API plan
# With System Prompt
Combine multimodal content with system prompts for specialized analysis tasks.
const result = await orka.getLLM().generate('', { messages: [ { role: 'system', content: 'You are an expert radiologist. Analyze medical images with precision and provide structured reports.' }, { role: 'user', content: [ { type: 'text', text: 'Please analyze this X-ray image.' }, { type: 'image_url', image_url: { url: 'https://example.com/xray.jpg', detail: 'high' } } ] } ]});# Multi-turn Conversations
Build multi-turn conversations that reference previously shared images.
const result = await orka.getLLM().generate('', { messages: [ { role: 'user', content: [ { type: 'text', text: 'Here is a photo of my living room.' }, { type: 'image_url', image_url: { url: 'https://example.com/room.jpg' } } ] }, { role: 'assistant', content: 'I can see a modern living room with a gray sofa, wooden coffee table...' }, { role: 'user', content: 'What color should I paint the walls to complement the furniture?' } ]});Provider Compatibility
| Media Capability | Technical Context | Provider Availability |
|---|---|---|
| Image (URL / Base64) | Standard vision processing for OCR & analysis. | OpenAI Anthropic Mistral Ollama |
| Audio Processing | Native speech analysis and sound recognition. | OpenAI Anthropic Mistral Ollama |
| Multi-Image Support | Comparative vision or multi-page document analysis. | OpenAI Anthropic Mistral Ollama |
Use Cases
Document OCR
Extract high-accuracy text from scans, receipts, and handwritten notes using Vision LLMs.
UI/UX Analysis
Audit screenshots for accessibility, design consistency, and component mapping.
Chart Extraction
Convert visual charts and complex tables into structured JSON for analytical processing.
Voice Intelligence
Native audio transcription with context-aware summarization of meetings and memos.
TypeScript Types
import type { ChatMessage, ContentPart } from 'orkajs'; // ChatMessageinterface ChatMessage { role: 'system' | 'user' | 'assistant'; content: string | ContentPart[];} // ContentPart — union typetype ContentPart = | { type: 'text'; text: string } | { type: 'image_url'; image_url: { url: string; detail?: 'auto' | 'low' | 'high' } } | { type: 'image_base64'; data: string; mimeType: 'image/png' | 'image/jpeg' | 'image/gif' | 'image/webp' } | { type: 'audio'; data: string; format: 'wav' | 'mp3' };Best Practices
1. Choose the Right Detail Level
Use 'low' for simple classification tasks to save tokens and cost. Use 'high' for OCR and detailed analysis.
2. Optimize Image Size
Resize large images before sending to reduce token usage. Most models work well with images under 2048×2048.
3. Use Base64 for Sensitive Data
For private or sensitive images, use base64 encoding instead of URLs to avoid exposing data publicly.
Multimodal Package
For advanced multimodal workflows, use the dedicated @orka-js/multimodal package. It provides specialized agents, utilities for vision and audio processing, and cross-modal workflows.
npm install @orka-js/multimodal# Vision Utilities
High-level functions for common vision tasks: image analysis, description, OCR, and comparison.
import { analyzeImage, describeImage, extractTextFromImage, compareImages } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai'; const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o' }); // Analyze an imageconst analysis = await analyzeImage(llm, { type: 'url', url: 'https://example.com/photo.jpg'}, { prompt: 'What emotions are expressed in this image?' }); console.log(analysis.analysis);// "The image conveys a sense of joy and celebration..." // Get structured descriptionconst description = await describeImage(llm, { type: 'url', url: 'https://example.com/photo.jpg'}); console.log(description);// { description: "A sunset over the ocean", objects: ["sun", "ocean", "clouds"], colors: ["orange", "purple"], scene: "outdoor" } // Extract text (OCR)const ocr = await extractTextFromImage(llm, { type: 'base64', data: base64Image, mimeType: 'image/png'}); console.log(ocr.text);// "Invoice #12345\nDate: 2024-01-15..." // Compare two imagesconst comparison = await compareImages(llm, { type: 'url', url: 'https://example.com/before.jpg' }, { type: 'url', url: 'https://example.com/after.jpg' }); console.log(comparison.analysis);// "The main differences are..."# Audio Utilities (Whisper & TTS)
Transcribe audio with OpenAI Whisper and generate speech with TTS. The OpenAI adapter now includes built-in audio methods.
import { OpenAIAdapter } from '@orka-js/openai';import { transcribeAudio, synthesizeSpeech } from '@orka-js/multimodal';import { readFileSync, writeFileSync } from 'fs'; const adapter = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, whisperModel: 'whisper-1', ttsModel: 'tts-1', ttsVoice: 'nova'}); // Direct adapter methodsconst transcription = await adapter.transcribe( readFileSync('./audio.wav'), { language: 'en', responseFormat: 'verbose_json' }); console.log(transcription.text);// "Hello, this is a test recording..."console.log(transcription.words);// [{ word: "Hello", start: 0.0, end: 0.5 }, ...] // Text-to-Speechconst audioBuffer = await adapter.textToSpeech( 'Welcome to OrkaJS! This is a test of the text-to-speech feature.', { voice: 'nova', responseFormat: 'mp3', speed: 1.0 }); writeFileSync('./output.mp3', Buffer.from(audioBuffer)); // Using multimodal utilitiesconst result = await transcribeAudio(adapter, { type: 'base64', data: base64Audio, format: 'wav'}, { includeTimestamps: true }); const speech = await synthesizeSpeech(adapter, 'Hello world!', { voice: 'alloy', format: 'mp3'});# VisionAgent
A specialized agent for image understanding tasks with batch processing support.
import { VisionAgent } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai'; const agent = new VisionAgent({ llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o' }), systemPrompt: 'You are an expert image analyst.', detail: 'high', temperature: 0.3}); // Ask questions about imagesconst answer = await agent.ask( { type: 'url', url: 'https://example.com/chart.png' }, 'What trend does this chart show?'); // Run batch tasksconst results = await agent.runTasks([ { type: 'analyze', image: { type: 'url', url: 'https://example.com/1.jpg' } }, { type: 'ocr', image: { type: 'url', url: 'https://example.com/document.png' } }, { type: 'describe', image: { type: 'url', url: 'https://example.com/photo.jpg' } }]); results.forEach(r => console.log(r.task, r.result));# AudioAgent
A specialized agent for audio processing: transcription, text-to-speech, and audio workflows.
import { AudioAgent } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai'; const agent = new AudioAgent({ adapter: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }), defaultLanguage: 'en', defaultVoice: 'nova', defaultFormat: 'mp3'}); // Transcribe audioconst transcription = await agent.transcribe({ type: 'url', url: 'https://example.com/meeting.mp3'}); console.log(transcription.result);// "In today's meeting, we discussed..." // Generate speechconst speech = await agent.speak('Hello, how can I help you today?');// speech.result is an ArrayBuffer // Transcribe and processconst processed = await agent.transcribeAndProcess( { type: 'base64', data: audioBase64 }, async (text) => { // Process the transcription (e.g., summarize with LLM) return `Summary: ${text.slice(0, 100)}...`; });# MultimodalAgent
Combines vision and audio capabilities for complex multimodal workflows. Automatically transcribes audio and processes images together.
import { MultimodalAgent } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai'; const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o' }); const agent = new MultimodalAgent({ llm, audioAdapter: llm, // OpenAI adapter supports both systemPrompt: 'You are a helpful multimodal assistant.', maxTokens: 2048}); // Process mixed inputsconst result = await agent.process({ text: 'Analyze this image and audio together.', images: [{ type: 'url', url: 'https://example.com/presentation.png' }], audio: [{ type: 'base64', data: audioBase64 }]}); console.log(result.response);// "Based on the presentation slide and the audio explanation..."console.log(result.transcriptions);// ["The speaker explains that..."] // Ask with contextconst answer = await agent.ask( 'What are the key points?', { images: [{ type: 'url', url: 'https://example.com/slide1.png' }], audio: [{ type: 'base64', data: voiceNote }] }); // Analyze multiple imagesconst imageAnalysis = await agent.analyzeImages( [ { type: 'url', url: 'https://example.com/before.jpg' }, { type: 'url', url: 'https://example.com/after.jpg' } ], 'Compare these two images and describe the changes.');✅ Capability Detection
Use helper functions to check if an adapter supports specific capabilities:
import { isVisionCapable, isAudioCapable } from '@orka-js/multimodal'; if (isVisionCapable(llm)) { // Safe to use vision features const result = await analyzeImage(llm, image);} if (isAudioCapable(adapter)) { // Safe to use audio features const transcription = await transcribeAudio(adapter, audio);}