Multimodal Processing
Build applications that understand images, audio, and text together using VisionAgent, AudioAgent, and MultimodalAgent.
Multimodal AI combines vision, audio, and text processing to create rich, context-aware applications. OrkaJS provides specialized agents and utilities for each modality.
ORKA — MULTIMODAL PIPELINE ARCHITECTURE
📸 Image
🎙️ Audio
📝 Text
VisionAgent
analyzeImage()
extractText()
describeImage()
extractText()
describeImage()
AudioAgent
transcribe()
speak()
Whisper API
speak()
Whisper API
MultimodalAgent
Combines vision + audio + text context
GPT-4o / Claude 3.5
Vision-capable LLM
✨ Multimodal Response
🖼️ Vision
Image analysis, OCR, comparison
🎙️ Audio
Whisper transcription, TTS
🔀 Cross-modal
Combined vision + audio workflows
Document Analysis with VisionAgent
Extract text and analyze documents from images using the VisionAgent.
import { VisionAgent } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai';import { readFileSync } from 'fs'; const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o'}); const visionAgent = new VisionAgent({ llm, systemPrompt: 'You are an expert document analyst. Extract information accurately.', detail: 'high', temperature: 0.1}); // Process an invoiceconst invoiceImage = readFileSync('./invoice.png');const base64 = invoiceImage.toString('base64'); const ocrResult = await visionAgent.extractText({ type: 'base64', data: base64, mimeType: 'image/png'}); console.log('Extracted text:', ocrResult.result); // Ask specific questions about the documentconst answer = await visionAgent.ask( { type: 'base64', data: base64, mimeType: 'image/png' }, 'What is the total amount and due date on this invoice?'); console.log('Invoice details:', answer); // Batch process multiple documentsconst results = await visionAgent.runTasks([ { type: 'ocr', image: { type: 'url', url: 'https://example.com/doc1.png' } }, { type: 'ocr', image: { type: 'url', url: 'https://example.com/doc2.png' } }, { type: 'describe', image: { type: 'url', url: 'https://example.com/chart.png' } }]); results.forEach((r, i) => { console.log(`Document ${i + 1} (${r.task}):`, r.result);});Meeting Transcription with AudioAgent
Transcribe meetings and generate audio responses using the AudioAgent.
import { AudioAgent } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai';import { readFileSync, writeFileSync } from 'fs'; const adapter = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, whisperModel: 'whisper-1', ttsModel: 'tts-1-hd', ttsVoice: 'nova'}); const audioAgent = new AudioAgent({ adapter, defaultLanguage: 'en', defaultVoice: 'nova', defaultFormat: 'mp3'}); // Transcribe a meeting recordingconst meetingAudio = readFileSync('./meeting.mp3');const transcription = await audioAgent.transcribe({ type: 'buffer', data: meetingAudio.buffer}, { includeTimestamps: true }); console.log('Meeting transcript:', transcription.result);console.log('Duration:', transcription.metadata?.duration, 'seconds'); // Generate a voice summaryconst summaryText = 'The meeting covered three main topics: Q4 results, 2024 roadmap, and team expansion.';const voiceSummary = await audioAgent.speak(summaryText, { voice: 'onyx', speed: 1.1}); writeFileSync('./meeting-summary.mp3', Buffer.from(voiceSummary.result as ArrayBuffer)); // Transcribe and process in one stepconst processed = await audioAgent.transcribeAndProcess( { type: 'buffer', data: meetingAudio.buffer }, async (text) => { // You could use an LLM here to summarize const textSentences = text.split('. '); return `Key points (${textSentences.length} sentences): ${textSentences.slice(0, 3).join('. ')}...`; }); console.log('Processed:', processed.processed);Presentation Analysis with MultimodalAgent
Analyze presentations by combining slides (images) with speaker notes (audio).
import { MultimodalAgent } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai'; const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o'}); const multimodalAgent = new MultimodalAgent({ llm, audioAdapter: llm, systemPrompt: `You are an expert presentation analyst. Analyze slides and speaker audio together to provide comprehensive insights.Focus on: key messages, data points, and recommendations.`, maxTokens: 4096}); // Analyze a presentation with slides and audioconst result = await multimodalAgent.process({ text: 'Analyze this presentation. What are the key takeaways?', images: [ { type: 'url', url: 'https://example.com/slide1.png' }, { type: 'url', url: 'https://example.com/slide2.png' }, { type: 'url', url: 'https://example.com/slide3.png' } ], audio: [ { type: 'base64', data: speakerAudioBase64 } ]}); console.log('Analysis:', result.response);console.log('Transcribed audio:', result.transcriptions);console.log('Tokens used:', result.usage.totalTokens); // Follow-up questionsconst followUp = await multimodalAgent.ask( 'What specific metrics were mentioned in the presentation?', { images: [{ type: 'url', url: 'https://example.com/slide2.png' }] }); console.log('Metrics:', followUp); // Compare before/after slidesconst comparison = await multimodalAgent.analyzeImages( [ { type: 'url', url: 'https://example.com/q3-results.png' }, { type: 'url', url: 'https://example.com/q4-results.png' } ], 'Compare Q3 and Q4 performance. What improved and what declined?'); console.log('Comparison:', comparison);Customer Support Bot
Build a support bot that can understand screenshots, voice messages, and text.
import { MultimodalAgent, isVisionCapable, isAudioCapable } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai'; const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o'}); // Verify capabilitiesconsole.log('Vision support:', isVisionCapable(llm));console.log('Audio support:', isAudioCapable(llm)); const supportBot = new MultimodalAgent({ llm, audioAdapter: llm, systemPrompt: `You are a helpful customer support agent for a software product.When users share screenshots, identify the issue and provide step-by-step solutions.When users share voice messages, transcribe and respond appropriately.Be concise, friendly, and solution-oriented.`}); // Handle a support request with screenshotasync function handleSupportRequest(request: { text?: string; screenshot?: string; // base64 voiceMessage?: string; // base64}) { const images = request.screenshot ? [{ type: 'base64' as const, data: request.screenshot, mimeType: 'image/png' as const }] : undefined; const audio = request.voiceMessage ? [{ type: 'base64' as const, data: request.voiceMessage }] : undefined; const result = await supportBot.process({ text: request.text || 'Please help me with this issue.', images, audio }); return { response: result.response, transcription: result.transcriptions?.[0], processingTime: result.latencyMs };} // Example usageconst response = await handleSupportRequest({ text: 'I keep getting this error when I try to export', screenshot: errorScreenshotBase64}); console.log('Support response:', response.response);💡 Tips for Production
- Use 'low' detail for simple classification, 'high' for OCR
- Compress images before sending to reduce costs
- Cache transcriptions for repeated audio content
- Use isVisionCapable() and isAudioCapable() to check adapter support
- Set appropriate timeouts for large audio files
Full Example: Image-to-Audio Pipeline
import { VisionAgent, AudioAgent } from '@orka-js/multimodal';import { OpenAIAdapter } from '@orka-js/openai';import { writeFileSync } from 'fs'; const adapter = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o', ttsModel: 'tts-1-hd', ttsVoice: 'nova'}); const visionAgent = new VisionAgent({ llm: adapter });const audioAgent = new AudioAgent({ adapter }); // Pipeline: Image → Description → Audioasync function imageToAudio(imageUrl: string): Promise<ArrayBuffer> { // Step 1: Analyze the image const description = await visionAgent.describe({ type: 'url', url: imageUrl }); console.log('Image description:', description.result); // Step 2: Generate audio narration const narration = typeof description.result === 'object' ? (description.result as { description: string }).description : String(description.result); const audio = await audioAgent.speak( `This image shows: ${narration}`, { voice: 'nova', speed: 0.9 } ); return audio.result as ArrayBuffer;} // Usageconst audioBuffer = await imageToAudio('https://example.com/landscape.jpg');writeFileSync('./image-narration.mp3', Buffer.from(audioBuffer));console.log('Audio narration saved!');