Multimodal

Expand your context by sending images, audio, and documents vianative multimodal adapters for OpenAI and Anthropic.

How It Works

Multimodal support in Orka AI is built on the ChatMessage and ContentPart types. Instead of sending a plain string prompt, you compose messages with mixed content parts: text, images (URL or base64), and audio.

Textual Content

Standard UTF-8 plain text for prompts and system instructions.

text

Remote Vision

Reference images via URL. Supports granularity control (auto, low, high).

image_url

Embedded Image

Direct binary encoding (PNG, JPEG, WebP). Ideal for transient data.

image_base64

Aural Data

Native WAV/MP3 processing for speech-to-text or sound analysis (OpenAI).

audio

# Image Analysis (URL)

The simplest way to analyze an image is to pass its URL. The LLM will download and process the image automatically. This works with both OpenAI (GPT-4o, GPT-4o-mini) and Anthropic (Claude 3.5 Sonnet, Claude 3 Opus).

import { createOrka } from '@orka-js/core';
import { OpenAIAdapter } from 'orkajs';
 
const orka = createOrka({
  llm: new OpenAIAdapter({
    apiKey: process.env.OPENAI_API_KEY!,
    model: 'gpt-4o'  // Must use a vision-capable model
  })
});
 
const result = await orka.getLLM().generate('', {
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'What do you see in this image? Describe it in detail.' },
        {
          type: 'image_url',
          image_url: {
            url: 'https://example.com/photo.jpg',
            detail: 'high'  // 'auto' | 'low' | 'high'
          }
        }
      ]
    }
  ]
});
 
console.log(result.content);
// "The image shows a sunset over the ocean with..."

`auto`

The model decides the detail level based on the image size. Best default choice.

`low`

Faster and cheaper. Uses a 512×512 thumbnail. Good for simple classification.

`high`

Full resolution analysis. Best for OCR, detailed descriptions, and small text reading.

# Image Analysis (Base64)

For local files or dynamically generated images, encode them in base64. This avoids the need for a public URL and works with both OpenAI and Anthropic.

import { readFileSync } from 'fs';
 
// Read local image file
const imageBuffer = readFileSync('./screenshot.png');
const base64Image = imageBuffer.toString('base64');
 
const result = await orka.getLLM().generate('', {
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Extract all text from this screenshot.' },
        {
          type: 'image_base64',
          data: base64Image,
          mimeType: 'image/png'  // 'image/png' | 'image/jpeg' | 'image/gif' | 'image/webp'
        }
      ]
    }
  ]
});
 
console.log(result.content);

# Multiple Images

You can send multiple images in a single message for comparison, analysis, or multi-page document processing.

const result = await orka.getLLM().generate('', {
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Compare these two UI designs. Which one is better and why?' },
        {
          type: 'image_url',
          image_url: { url: 'https://example.com/design-a.png', detail: 'high' }
        },
        {
          type: 'image_url',
          image_url: { url: 'https://example.com/design-b.png', detail: 'high' }
        }
      ]
    }
  ]
});

# Audio Input (OpenAI)

OpenAI's GPT-4o models support audio input. Send audio data in WAV or MP3 format for transcription, analysis, or voice-based interaction.

import { readFileSync } from 'fs';
 
const audioBuffer = readFileSync('./recording.wav');
const base64Audio = audioBuffer.toString('base64');
 
const result = await orka.getLLM().generate('', {
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Transcribe this audio and summarize the key points.' },
        {
          type: 'audio',
          data: base64Audio,
          format: 'wav'  // 'wav' | 'mp3'
        }
      ]
    }
  ]
});
 
console.log(result.content);
// "The speaker discusses three main topics: ..."

⚠️ Audio Limitations

Audio input is currently supported only by OpenAI (GPT-4o models)
Anthropic (Claude) supports audio input starting with Claude 4.6 Sonnet
Gemini best model for audio are (3.1 Pro) & (3 Flash)
Maximum audio length depends on the model and your API plan

# With System Prompt

Combine multimodal content with system prompts for specialized analysis tasks.

const result = await orka.getLLM().generate('', {
  messages: [
    {
      role: 'system',
      content: 'You are an expert radiologist. Analyze medical images with precision and provide structured reports.'
    },
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Please analyze this X-ray image.' },
        {
          type: 'image_url',
          image_url: { url: 'https://example.com/xray.jpg', detail: 'high' }
        }
      ]
    }
  ]
});

# Multi-turn Conversations

Build multi-turn conversations that reference previously shared images.

const result = await orka.getLLM().generate('', {
  messages: [
    {
      role: 'user',
      content: [
        { type: 'text', text: 'Here is a photo of my living room.' },
        { type: 'image_url', image_url: { url: 'https://example.com/room.jpg' } }
      ]
    },
    {
      role: 'assistant',
      content: 'I can see a modern living room with a gray sofa, wooden coffee table...'
    },
    {
      role: 'user',
      content: 'What color should I paint the walls to complement the furniture?'
    }
  ]
});

Provider Compatibility

Media Capability	Technical Context	Provider Availability
Image (URL / Base64)	Standard vision processing for OCR & analysis.	OpenAI Anthropic Mistral Ollama
Audio Processing	Native speech analysis and sound recognition.	OpenAI Anthropic Mistral Ollama
Multi-Image Support	Comparative vision or multi-page document analysis.	OpenAI Anthropic Mistral Ollama

Use Cases

Document OCR

Extract high-accuracy text from scans, receipts, and handwritten notes using Vision LLMs.

Vision / Data

UI/UX Analysis

Audit screenshots for accessibility, design consistency, and component mapping.

Vision / Design

Chart Extraction

Convert visual charts and complex tables into structured JSON for analytical processing.

Vision / Analytics

Voice Intelligence

Native audio transcription with context-aware summarization of meetings and memos.

Audio / NLP

TypeScript Types

import type { ChatMessage, ContentPart } from 'orkajs';
 
// ChatMessage
interface ChatMessage {
  role: 'system' | 'user' | 'assistant';
  content: string | ContentPart[];
}
 
// ContentPart — union type
type ContentPart =
  | { type: 'text'; text: string }
  | { type: 'image_url'; image_url: { url: string; detail?: 'auto' | 'low' | 'high' } }
  | { type: 'image_base64'; data: string; mimeType: 'image/png' | 'image/jpeg' | 'image/gif' | 'image/webp' }
  | { type: 'audio'; data: string; format: 'wav' | 'mp3' };

Best Practices

1. Choose the Right Detail Level

Use 'low' for simple classification tasks to save tokens and cost. Use 'high' for OCR and detailed analysis.

2. Optimize Image Size

Resize large images before sending to reduce token usage. Most models work well with images under 2048×2048.

3. Use Base64 for Sensitive Data

For private or sensitive images, use base64 encoding instead of URLs to avoid exposing data publicly.

Multimodal Package

For advanced multimodal workflows, use the dedicated @orka-js/multimodal package. It provides specialized agents, utilities for vision and audio processing, and cross-modal workflows.

npm install @orka-js/multimodal

# Vision Utilities

High-level functions for common vision tasks: image analysis, description, OCR, and comparison.

import { analyzeImage, describeImage, extractTextFromImage, compareImages } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
 
const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o' });
 
// Analyze an image
const analysis = await analyzeImage(llm, {
  type: 'url',
  url: 'https://example.com/photo.jpg'
}, { prompt: 'What emotions are expressed in this image?' });
 
console.log(analysis.analysis);
// "The image conveys a sense of joy and celebration..."
 
// Get structured description
const description = await describeImage(llm, {
  type: 'url',
  url: 'https://example.com/photo.jpg'
});
 
console.log(description);
// { description: "A sunset over the ocean", objects: ["sun", "ocean", "clouds"], colors: ["orange", "purple"], scene: "outdoor" }
 
// Extract text (OCR)
const ocr = await extractTextFromImage(llm, {
  type: 'base64',
  data: base64Image,
  mimeType: 'image/png'
});
 
console.log(ocr.text);
// "Invoice #12345\nDate: 2024-01-15..."
 
// Compare two images
const comparison = await compareImages(llm, 
  { type: 'url', url: 'https://example.com/before.jpg' },
  { type: 'url', url: 'https://example.com/after.jpg' }
);
 
console.log(comparison.analysis);
// "The main differences are..."

# Audio Utilities (Whisper & TTS)

Transcribe audio with OpenAI Whisper and generate speech with TTS. The OpenAI adapter now includes built-in audio methods.

import { OpenAIAdapter } from '@orka-js/openai';
import { transcribeAudio, synthesizeSpeech } from '@orka-js/multimodal';
import { readFileSync, writeFileSync } from 'fs';
 
const adapter = new OpenAIAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  whisperModel: 'whisper-1',
  ttsModel: 'tts-1',
  ttsVoice: 'nova'
});
 
// Direct adapter methods
const transcription = await adapter.transcribe(
  readFileSync('./audio.wav'),
  { language: 'en', responseFormat: 'verbose_json' }
);
 
console.log(transcription.text);
// "Hello, this is a test recording..."
console.log(transcription.words);
// [{ word: "Hello", start: 0.0, end: 0.5 }, ...]
 
// Text-to-Speech
const audioBuffer = await adapter.textToSpeech(
  'Welcome to OrkaJS! This is a test of the text-to-speech feature.',
  { voice: 'nova', responseFormat: 'mp3', speed: 1.0 }
);
 
writeFileSync('./output.mp3', Buffer.from(audioBuffer));
 
// Using multimodal utilities
const result = await transcribeAudio(adapter, {
  type: 'base64',
  data: base64Audio,
  format: 'wav'
}, { includeTimestamps: true });
 
const speech = await synthesizeSpeech(adapter, 'Hello world!', {
  voice: 'alloy',
  format: 'mp3'
});

# VisionAgent

A specialized agent for image understanding tasks with batch processing support.

import { VisionAgent } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
 
const agent = new VisionAgent({
  llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o' }),
  systemPrompt: 'You are an expert image analyst.',
  detail: 'high',
  temperature: 0.3
});
 
// Ask questions about images
const answer = await agent.ask(
  { type: 'url', url: 'https://example.com/chart.png' },
  'What trend does this chart show?'
);
 
// Run batch tasks
const results = await agent.runTasks([
  { type: 'analyze', image: { type: 'url', url: 'https://example.com/1.jpg' } },
  { type: 'ocr', image: { type: 'url', url: 'https://example.com/document.png' } },
  { type: 'describe', image: { type: 'url', url: 'https://example.com/photo.jpg' } }
]);
 
results.forEach(r => console.log(r.task, r.result));

# AudioAgent

A specialized agent for audio processing: transcription, text-to-speech, and audio workflows.

import { AudioAgent } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
 
const agent = new AudioAgent({
  adapter: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }),
  defaultLanguage: 'en',
  defaultVoice: 'nova',
  defaultFormat: 'mp3'
});
 
// Transcribe audio
const transcription = await agent.transcribe({
  type: 'url',
  url: 'https://example.com/meeting.mp3'
});
 
console.log(transcription.result);
// "In today's meeting, we discussed..."
 
// Generate speech
const speech = await agent.speak('Hello, how can I help you today?');
// speech.result is an ArrayBuffer
 
// Transcribe and process
const processed = await agent.transcribeAndProcess(
  { type: 'base64', data: audioBase64 },
  async (text) => {
    // Process the transcription (e.g., summarize with LLM)
    return `Summary: ${text.slice(0, 100)}...`;
  }
);

# MultimodalAgent

Combines vision and audio capabilities for complex multimodal workflows. Automatically transcribes audio and processes images together.

import { MultimodalAgent } from '@orka-js/multimodal';
import { OpenAIAdapter } from '@orka-js/openai';
 
const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o' });
 
const agent = new MultimodalAgent({
  llm,
  audioAdapter: llm, // OpenAI adapter supports both
  systemPrompt: 'You are a helpful multimodal assistant.',
  maxTokens: 2048
});
 
// Process mixed inputs
const result = await agent.process({
  text: 'Analyze this image and audio together.',
  images: [{ type: 'url', url: 'https://example.com/presentation.png' }],
  audio: [{ type: 'base64', data: audioBase64 }]
});
 
console.log(result.response);
// "Based on the presentation slide and the audio explanation..."
console.log(result.transcriptions);
// ["The speaker explains that..."]
 
// Ask with context
const answer = await agent.ask(
  'What are the key points?',
  {
    images: [{ type: 'url', url: 'https://example.com/slide1.png' }],
    audio: [{ type: 'base64', data: voiceNote }]
  }
);
 
// Analyze multiple images
const imageAnalysis = await agent.analyzeImages(
  [
    { type: 'url', url: 'https://example.com/before.jpg' },
    { type: 'url', url: 'https://example.com/after.jpg' }
  ],
  'Compare these two images and describe the changes.'
);

✅ Capability Detection

Use helper functions to check if an adapter supports specific capabilities:

import { isVisionCapable, isAudioCapable } from '@orka-js/multimodal';
 
if (isVisionCapable(llm)) {
  // Safe to use vision features
  const result = await analyzeImage(llm, image);
}
 
if (isAudioCapable(adapter)) {
  // Safe to use audio features
  const transcription = await transcribeAudio(adapter, audio);
}