Multimodal Processing

Build applications that understand images, audio, and text together using VisionAgent, AudioAgent, and MultimodalAgent.

Multimodal AI combines vision, audio, and text processing to create rich, context-aware applications. OrkaJS provides specialized agents and utilities for each modality.

ORKA — MULTIMODAL PIPELINE ARCHITECTURE

📸 Image

🎙️ Audio

📝 Text

VisionAgent

analyzeImage()
extractText()
describeImage()

AudioAgent

transcribe()
speak()
Whisper API

MultimodalAgent

Combines vision + audio + text context

GPT-4o / Claude 3.5

Vision-capable LLM

✨ Multimodal Response

🖼️ Vision

Image analysis, OCR, comparison

🎙️ Audio

Whisper transcription, TTS

🔀 Cross-modal

Combined vision + audio workflows

Document Analysis with VisionAgent

Extract text and analyze documents from images using the VisionAgent.

import { VisionAgent } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
import { readFileSync } from 'fs';
 
const llm = new OpenAIAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o'
});
 
const visionAgent = new VisionAgent({
  llm,
  systemPrompt: 'You are an expert document analyst. Extract information accurately.',
  detail: 'high',
  temperature: 0.1
});
 
// Process an invoice
const invoiceImage = readFileSync('./invoice.png');
const base64 = invoiceImage.toString('base64');
 
const ocrResult = await visionAgent.extractText({
  type: 'base64',
  data: base64,
  mimeType: 'image/png'
});
 
console.log('Extracted text:', ocrResult.result);
 
// Ask specific questions about the document
const answer = await visionAgent.ask(
  { type: 'base64', data: base64, mimeType: 'image/png' },
  'What is the total amount and due date on this invoice?'
);
 
console.log('Invoice details:', answer);
 
// Batch process multiple documents
const results = await visionAgent.runTasks([
  { type: 'ocr', image: { type: 'url', url: 'https://example.com/doc1.png' } },
  { type: 'ocr', image: { type: 'url', url: 'https://example.com/doc2.png' } },
  { type: 'describe', image: { type: 'url', url: 'https://example.com/chart.png' } }
]);
 
results.forEach((r, i) => {
  console.log(`Document ${i + 1} (${r.task}):`, r.result);
});

Meeting Transcription with AudioAgent

Transcribe meetings and generate audio responses using the AudioAgent.

import { AudioAgent } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
import { readFileSync, writeFileSync } from 'fs';
 
const adapter = new OpenAIAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  whisperModel: 'whisper-1',
  ttsModel: 'tts-1-hd',
  ttsVoice: 'nova'
});
 
const audioAgent = new AudioAgent({
  adapter,
  defaultLanguage: 'en',
  defaultVoice: 'nova',
  defaultFormat: 'mp3'
});
 
// Transcribe a meeting recording
const meetingAudio = readFileSync('./meeting.mp3');
const transcription = await audioAgent.transcribe({
  type: 'buffer',
  data: meetingAudio.buffer
}, { includeTimestamps: true });
 
console.log('Meeting transcript:', transcription.result);
console.log('Duration:', transcription.metadata?.duration, 'seconds');
 
// Generate a voice summary
const summaryText = 'The meeting covered three main topics: Q4 results, 2024 roadmap, and team expansion.';
const voiceSummary = await audioAgent.speak(summaryText, {
  voice: 'onyx',
  speed: 1.1
});
 
writeFileSync('./meeting-summary.mp3', Buffer.from(voiceSummary.result as ArrayBuffer));
 
// Transcribe and process in one step
const processed = await audioAgent.transcribeAndProcess(
  { type: 'buffer', data: meetingAudio.buffer },
  async (text) => {
    // You could use an LLM here to summarize
    const textSentences = text.split('. ');
    return `Key points (${textSentences.length} sentences): ${textSentences.slice(0, 3).join('. ')}...`;
  }
);
 
console.log('Processed:', processed.processed);

Presentation Analysis with MultimodalAgent

Analyze presentations by combining slides (images) with speaker notes (audio).

import { MultimodalAgent } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
 
const llm = new OpenAIAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o'
});
 
const multimodalAgent = new MultimodalAgent({
  llm,
  audioAdapter: llm,
  systemPrompt: `You are an expert presentation analyst. 
Analyze slides and speaker audio together to provide comprehensive insights.
Focus on: key messages, data points, and recommendations.`,
  maxTokens: 4096
});
 
// Analyze a presentation with slides and audio
const result = await multimodalAgent.process({
  text: 'Analyze this presentation. What are the key takeaways?',
  images: [
    { type: 'url', url: 'https://example.com/slide1.png' },
    { type: 'url', url: 'https://example.com/slide2.png' },
    { type: 'url', url: 'https://example.com/slide3.png' }
  ],
  audio: [
    { type: 'base64', data: speakerAudioBase64 }
  ]
});
 
console.log('Analysis:', result.response);
console.log('Transcribed audio:', result.transcriptions);
console.log('Tokens used:', result.usage.totalTokens);
 
// Follow-up questions
const followUp = await multimodalAgent.ask(
  'What specific metrics were mentioned in the presentation?',
  {
    images: [{ type: 'url', url: 'https://example.com/slide2.png' }]
  }
);
 
console.log('Metrics:', followUp);
 
// Compare before/after slides
const comparison = await multimodalAgent.analyzeImages(
  [
    { type: 'url', url: 'https://example.com/q3-results.png' },
    { type: 'url', url: 'https://example.com/q4-results.png' }
  ],
  'Compare Q3 and Q4 performance. What improved and what declined?'
);
 
console.log('Comparison:', comparison);

Customer Support Bot

Build a support bot that can understand screenshots, voice messages, and text.

import { MultimodalAgent, isVisionCapable, isAudioCapable } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
 
const llm = new OpenAIAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o'
});
 
// Verify capabilities
console.log('Vision support:', isVisionCapable(llm));
console.log('Audio support:', isAudioCapable(llm));
 
const supportBot = new MultimodalAgent({
  llm,
  audioAdapter: llm,
  systemPrompt: `You are a helpful customer support agent for a software product.
When users share screenshots, identify the issue and provide step-by-step solutions.
When users share voice messages, transcribe and respond appropriately.
Be concise, friendly, and solution-oriented.`
});
 
// Handle a support request with screenshot
async function handleSupportRequest(request: {
  text?: string;
  screenshot?: string; // base64
  voiceMessage?: string; // base64
}) {
  const images = request.screenshot 
    ? [{ type: 'base64' as const, data: request.screenshot, mimeType: 'image/png' as const }]
    : undefined;
 
  const audio = request.voiceMessage
    ? [{ type: 'base64' as const, data: request.voiceMessage }]
    : undefined;
 
  const result = await supportBot.process({
    text: request.text || 'Please help me with this issue.',
    images,
    audio
  });
 
  return {
    response: result.response,
    transcription: result.transcriptions?.[0],
    processingTime: result.latencyMs
  };
}
 
// Example usage
const response = await handleSupportRequest({
  text: 'I keep getting this error when I try to export',
  screenshot: errorScreenshotBase64
});
 
console.log('Support response:', response.response);

💡 Tips for Production

Use 'low' detail for simple classification, 'high' for OCR
Compress images before sending to reduce costs
Cache transcriptions for repeated audio content
Use isVisionCapable() and isAudioCapable() to check adapter support
Set appropriate timeouts for large audio files

Full Example: Image-to-Audio Pipeline

import { VisionAgent, AudioAgent } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
import { writeFileSync } from 'fs';
 
const adapter = new OpenAIAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o',
  ttsModel: 'tts-1-hd',
  ttsVoice: 'nova'
});
 
const visionAgent = new VisionAgent({ llm: adapter });
const audioAgent = new AudioAgent({ adapter });
 
// Pipeline: Image → Description → Audio
async function imageToAudio(imageUrl: string): Promise<ArrayBuffer> {
  // Step 1: Analyze the image
  const description = await visionAgent.describe({ type: 'url', url: imageUrl });
 
  console.log('Image description:', description.result);
 
  // Step 2: Generate audio narration
  const narration = typeof description.result === 'object' 
    ? (description.result as { description: string }).description
    : String(description.result);
 
  const audio = await audioAgent.speak(
    `This image shows: ${narration}`,
    { voice: 'nova', speed: 0.9 }
  );
 
  return audio.result as ArrayBuffer;
}
 
// Usage
const audioBuffer = await imageToAudio('https://example.com/landscape.jpg');
writeFileSync('./image-narration.mp3', Buffer.from(audioBuffer));
console.log('Audio narration saved!');