OCR & Document Extraction

Extract text from images, PDFs, and scanned documents. Support for local (Tesseract) and cloud OCR engines.

80%Dark Data

Unlocking Enterprise Knowledge

Most enterprise assets live in 'Dark Data' formats—scanned PDFs, signed contracts, and invoices. Without OCR, these are invisible to your LLM. OrkaJS transforms these images into structured, searchable text, turning dead archives into live RAG context.

Scanned PDFs

Handwritten

Invoices

Screenshots

Installation

# Install OCR package
npm install @orka-js/ocr
 
# For local OCR (Tesseract) - RGPD-friendly
npm install tesseract.js
 
# For cloud OCR (OpenAI Vision) - no additional deps needed

Quick Start

import { OCR } from '@orka-js/ocr';
 
// Using Tesseract (local, RGPD-friendly)
const ocr = new OCR();
const result = await ocr.process('./document.png');
console.log(result.text);
 
// Using OpenAI Vision (cloud, high precision)
const cloudOcr = new OCR({
  type: 'openai-vision',
  config: { apiKey: process.env.OPENAI_API_KEY! },
});
const cloudResult = await cloudOcr.process('./invoice.pdf');

OCR Engines

Tesseract OCR

Privacy-First / Local Engine

"The open-source standard for on-premise document processing."

GDPR Compliance

Data never leaves your server

Zero Opex

No per-page API fees

Lower accuracy on messy scans

GPT-4o Vision

Intelligence-First / Cloud

"State-of-the-art vision for complex layout analysis."

SOTA Accuracy

Deep document understanding

Structural Logic

Extracts tables & forms

Requires external data transfer

Tesseract Engine

import { TesseractEngine } from '@orka-js/ocr';
 
const engine = new TesseractEngine({
  cacheWorker: true, // Reuse worker for better performance
});
 
const result = await engine.process('./document.png', {
  languages: ['eng', 'fra'], // English + French
  minConfidence: 0.7,
});
 
console.log(result.text);
console.log(`Confidence: ${result.confidence}`);
 
// Don't forget to cleanup when done
await engine.terminate();

OpenAI Vision Engine

import { OpenAIVisionEngine } from '@orka-js/ocr';
 
const engine = new OpenAIVisionEngine({
  apiKey: process.env.OPENAI_API_KEY!,
  model: 'gpt-4o', // or 'gpt-4-vision-preview'
  maxTokens: 4096,
});
 
const result = await engine.process('./complex-form.pdf', {
  extractTables: true,
  extractFields: true,
});
 
console.log(result.text);
console.log('Tables:', result.tables);
console.log('Fields:', result.fields);

Structured Document Extraction

Extract structured data from documents using a schema definition. Combines OCR with LLM for intelligent extraction.

import { DocumentExtractor } from '@orka-js/ocr';
import { OpenAIAdapter } from '@orka-js/openai';
 
const extractor = new DocumentExtractor();
 
const result = await extractor.extract({
  file: './invoice.pdf',
  schema: {
    invoiceNumber: 'string',
    date: 'date',
    total: 'number',
    client: {
      type: 'object',
      properties: {
        name: { type: 'string', required: true },
        address: { type: 'string' },
      },
    },
    items: {
      type: 'array',
      items: {
        type: 'object',
        properties: {
          description: { type: 'string' },
          quantity: { type: 'number' },
          unitPrice: { type: 'number' },
        },
      },
    },
  },
  llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }),
  includeRawText: true,
});
 
console.log(result.data);
// {
//   invoiceNumber: 'INV-2024-001',
//   date: '2024-03-15',
//   total: 1250.00,
//   client: { name: 'Acme Corp', address: '123 Main St' },
//   items: [
//     { description: 'Widget A', quantity: 10, unitPrice: 50 },
//     { description: 'Widget B', quantity: 5, unitPrice: 150 },
//   ]
// }

Schema Field Types

Primitive Type	Parsing Logic	Input Example
`string`	Raw UTF-8 text extraction	`"OrkaJS SDK"`
`number`	Float/Integer conversion	`42.50`
`date`	ISO 8601 Temporal casting	`"2026-03-14"`
`boolean`	Binary truth evaluation	`true`
`array`	Iterative collection mapping	`['tag1', 'tag2']`
`object`	Recursive nested structuring	`{ id: 1 }`

OCR Result Structure

interface OCRResult {
  text: string;           // Full extracted text
  pages: OCRPage[];       // Per-page results
  confidence: number;     // Overall confidence (0-1)
  tables?: ExtractedTable[];  // Extracted tables
  fields?: ExtractedField[];  // Form fields (key-value)
  metadata: {
    engine: string;
    processingTimeMs: number;
    pageCount: number;
    language?: string;
  };
}
 
interface OCRPage {
  pageNumber: number;
  text: string;
  blocks: OCRBlock[];     // Text blocks/paragraphs
  confidence: number;
}
 
interface OCRBlock {
  text: string;
  lines: OCRLine[];
  confidence: number;
  type?: 'text' | 'table' | 'figure' | 'header' | 'footer';
}

Integration with Knowledge (RAG)

Use OCR to process scanned documents before ingesting them into your knowledge base:

import { OCR } from '@orka-js/ocr';
import { createOrka } from '@orka-js/core';
 
const ocr = new OCR();
const orka = createOrka({ /* config */ });
 
// Process scanned documents
async function ingestScannedDocument(filePath: string, knowledgeName: string) {
  // Step 1: Extract text with OCR
  const ocrResult = await ocr.process(filePath);
 
  if (ocrResult.confidence < 0.7) {
    console.warn(`Low confidence OCR: ${ocrResult.confidence}`);
  }
 
  // Step 2: Add to knowledge base
  await orka.knowledge.add(knowledgeName, {
    content: ocrResult.text,
    metadata: {
      source: filePath,
      ocrEngine: ocrResult.metadata.engine,
      ocrConfidence: ocrResult.confidence,
      pageCount: ocrResult.metadata.pageCount,
    },
  });
}
 
// Process a directory of scanned PDFs
const files = ['./docs/invoice1.pdf', './docs/contract.pdf'];
for (const file of files) {
  await ingestScannedDocument(file, 'company-docs');
}

Deployment Best Practices

Standard Operating Procedures for OCR Pipelines

Security & Compliance

Leverage Tesseract for GDPR-sensitive data to ensure zero external data leak.

Cognitive Depth

Deploy OpenAI Vision for high-stakes analysis of complex tables and forms.

Quality Assurance

Programmatically validate confidence scores before committing to your Vector DB.

Performance & Cost

Implement a caching layer for OCR tokens to prevent redundant billing and compute.

Semantic Structuring

Always use schemas for invoices and contracts to turn pixels into business logic.

Resource Hygiene

Explicitly terminate Tesseract workers to prevent memory leaks in production.