OrkaJS
Orka.JS

Multimodal Processing

Build applications that understand images, audio, and text together using VisionAgent, AudioAgent, and MultimodalAgent.

Multimodal AI combines vision, audio, and text processing to create rich, context-aware applications. OrkaJS provides specialized agents and utilities for each modality.

ORKA — MULTIMODAL PIPELINE ARCHITECTURE
📸 Image
🎙️ Audio
📝 Text
VisionAgent
analyzeImage()
extractText()
describeImage()
AudioAgent
transcribe()
speak()
Whisper API
MultimodalAgent
Combines vision + audio + text context
GPT-4o / Claude 3.5
Vision-capable LLM
Multimodal Response

🖼️ Vision

Image analysis, OCR, comparison

🎙️ Audio

Whisper transcription, TTS

🔀 Cross-modal

Combined vision + audio workflows

Document Analysis with VisionAgent

Extract text and analyze documents from images using the VisionAgent.

import { VisionAgent } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
import { readFileSync } from 'fs';
 
const llm = new OpenAIAdapter({
apiKey: process.env.OPENAI_API_KEY!,
model: 'gpt-4o'
});
 
const visionAgent = new VisionAgent({
llm,
systemPrompt: 'You are an expert document analyst. Extract information accurately.',
detail: 'high',
temperature: 0.1
});
 
// Process an invoice
const invoiceImage = readFileSync('./invoice.png');
const base64 = invoiceImage.toString('base64');
 
const ocrResult = await visionAgent.extractText({
type: 'base64',
data: base64,
mimeType: 'image/png'
});
 
console.log('Extracted text:', ocrResult.result);
 
// Ask specific questions about the document
const answer = await visionAgent.ask(
{ type: 'base64', data: base64, mimeType: 'image/png' },
'What is the total amount and due date on this invoice?'
);
 
console.log('Invoice details:', answer);
 
// Batch process multiple documents
const results = await visionAgent.runTasks([
{ type: 'ocr', image: { type: 'url', url: 'https://example.com/doc1.png' } },
{ type: 'ocr', image: { type: 'url', url: 'https://example.com/doc2.png' } },
{ type: 'describe', image: { type: 'url', url: 'https://example.com/chart.png' } }
]);
 
results.forEach((r, i) => {
console.log(`Document ${i + 1} (${r.task}):`, r.result);
});

Meeting Transcription with AudioAgent

Transcribe meetings and generate audio responses using the AudioAgent.

import { AudioAgent } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
import { readFileSync, writeFileSync } from 'fs';
 
const adapter = new OpenAIAdapter({
apiKey: process.env.OPENAI_API_KEY!,
whisperModel: 'whisper-1',
ttsModel: 'tts-1-hd',
ttsVoice: 'nova'
});
 
const audioAgent = new AudioAgent({
adapter,
defaultLanguage: 'en',
defaultVoice: 'nova',
defaultFormat: 'mp3'
});
 
// Transcribe a meeting recording
const meetingAudio = readFileSync('./meeting.mp3');
const transcription = await audioAgent.transcribe({
type: 'buffer',
data: meetingAudio.buffer
}, { includeTimestamps: true });
 
console.log('Meeting transcript:', transcription.result);
console.log('Duration:', transcription.metadata?.duration, 'seconds');
 
// Generate a voice summary
const summaryText = 'The meeting covered three main topics: Q4 results, 2024 roadmap, and team expansion.';
const voiceSummary = await audioAgent.speak(summaryText, {
voice: 'onyx',
speed: 1.1
});
 
writeFileSync('./meeting-summary.mp3', Buffer.from(voiceSummary.result as ArrayBuffer));
 
// Transcribe and process in one step
const processed = await audioAgent.transcribeAndProcess(
{ type: 'buffer', data: meetingAudio.buffer },
async (text) => {
// You could use an LLM here to summarize
const textSentences = text.split('. ');
return `Key points (${textSentences.length} sentences): ${textSentences.slice(0, 3).join('. ')}...`;
}
);
 
console.log('Processed:', processed.processed);

Presentation Analysis with MultimodalAgent

Analyze presentations by combining slides (images) with speaker notes (audio).

import { MultimodalAgent } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
 
const llm = new OpenAIAdapter({
apiKey: process.env.OPENAI_API_KEY!,
model: 'gpt-4o'
});
 
const multimodalAgent = new MultimodalAgent({
llm,
audioAdapter: llm,
systemPrompt: `You are an expert presentation analyst.
Analyze slides and speaker audio together to provide comprehensive insights.
Focus on: key messages, data points, and recommendations.`,
maxTokens: 4096
});
 
// Analyze a presentation with slides and audio
const result = await multimodalAgent.process({
text: 'Analyze this presentation. What are the key takeaways?',
images: [
{ type: 'url', url: 'https://example.com/slide1.png' },
{ type: 'url', url: 'https://example.com/slide2.png' },
{ type: 'url', url: 'https://example.com/slide3.png' }
],
audio: [
{ type: 'base64', data: speakerAudioBase64 }
]
});
 
console.log('Analysis:', result.response);
console.log('Transcribed audio:', result.transcriptions);
console.log('Tokens used:', result.usage.totalTokens);
 
// Follow-up questions
const followUp = await multimodalAgent.ask(
'What specific metrics were mentioned in the presentation?',
{
images: [{ type: 'url', url: 'https://example.com/slide2.png' }]
}
);
 
console.log('Metrics:', followUp);
 
// Compare before/after slides
const comparison = await multimodalAgent.analyzeImages(
[
{ type: 'url', url: 'https://example.com/q3-results.png' },
{ type: 'url', url: 'https://example.com/q4-results.png' }
],
'Compare Q3 and Q4 performance. What improved and what declined?'
);
 
console.log('Comparison:', comparison);

Customer Support Bot

Build a support bot that can understand screenshots, voice messages, and text.

import { MultimodalAgent, isVisionCapable, isAudioCapable } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
 
const llm = new OpenAIAdapter({
apiKey: process.env.OPENAI_API_KEY!,
model: 'gpt-4o'
});
 
// Verify capabilities
console.log('Vision support:', isVisionCapable(llm));
console.log('Audio support:', isAudioCapable(llm));
 
const supportBot = new MultimodalAgent({
llm,
audioAdapter: llm,
systemPrompt: `You are a helpful customer support agent for a software product.
When users share screenshots, identify the issue and provide step-by-step solutions.
When users share voice messages, transcribe and respond appropriately.
Be concise, friendly, and solution-oriented.`
});
 
// Handle a support request with screenshot
async function handleSupportRequest(request: {
text?: string;
screenshot?: string; // base64
voiceMessage?: string; // base64
}) {
const images = request.screenshot
? [{ type: 'base64' as const, data: request.screenshot, mimeType: 'image/png' as const }]
: undefined;
 
const audio = request.voiceMessage
? [{ type: 'base64' as const, data: request.voiceMessage }]
: undefined;
 
const result = await supportBot.process({
text: request.text || 'Please help me with this issue.',
images,
audio
});
 
return {
response: result.response,
transcription: result.transcriptions?.[0],
processingTime: result.latencyMs
};
}
 
// Example usage
const response = await handleSupportRequest({
text: 'I keep getting this error when I try to export',
screenshot: errorScreenshotBase64
});
 
console.log('Support response:', response.response);

💡 Tips for Production

  • Use 'low' detail for simple classification, 'high' for OCR
  • Compress images before sending to reduce costs
  • Cache transcriptions for repeated audio content
  • Use isVisionCapable() and isAudioCapable() to check adapter support
  • Set appropriate timeouts for large audio files

Full Example: Image-to-Audio Pipeline

import { VisionAgent, AudioAgent } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
import { writeFileSync } from 'fs';
 
const adapter = new OpenAIAdapter({
apiKey: process.env.OPENAI_API_KEY!,
model: 'gpt-4o',
ttsModel: 'tts-1-hd',
ttsVoice: 'nova'
});
 
const visionAgent = new VisionAgent({ llm: adapter });
const audioAgent = new AudioAgent({ adapter });
 
// Pipeline: Image → Description → Audio
async function imageToAudio(imageUrl: string): Promise<ArrayBuffer> {
// Step 1: Analyze the image
const description = await visionAgent.describe({ type: 'url', url: imageUrl });
 
console.log('Image description:', description.result);
 
// Step 2: Generate audio narration
const narration = typeof description.result === 'object'
? (description.result as { description: string }).description
: String(description.result);
 
const audio = await audioAgent.speak(
`This image shows: ${narration}`,
{ voice: 'nova', speed: 0.9 }
);
 
return audio.result as ArrayBuffer;
}
 
// Usage
const audioBuffer = await imageToAudio('https://example.com/landscape.jpg');
writeFileSync('./image-narration.mp3', Buffer.from(audioBuffer));
console.log('Audio narration saved!');