OCR & Document Extraction
Extract text from images, PDFs, and scanned documents. Support for local (Tesseract) and cloud OCR engines.
Unlocking Enterprise Knowledge
Most enterprise assets live in 'Dark Data' formats—scanned PDFs, signed contracts, and invoices. Without OCR, these are invisible to your LLM. OrkaJS transforms these images into structured, searchable text, turning dead archives into live RAG context.
Installation
# Install OCR packagenpm install @orka-js/ocr # For local OCR (Tesseract) - RGPD-friendlynpm install tesseract.js # For cloud OCR (OpenAI Vision) - no additional deps neededQuick Start
import { OCR } from '@orka-js/ocr'; // Using Tesseract (local, RGPD-friendly)const ocr = new OCR();const result = await ocr.process('./document.png');console.log(result.text); // Using OpenAI Vision (cloud, high precision)const cloudOcr = new OCR({ type: 'openai-vision', config: { apiKey: process.env.OPENAI_API_KEY! },});const cloudResult = await cloudOcr.process('./invoice.pdf');OCR Engines
Tesseract OCR
Privacy-First / Local Engine
"The open-source standard for on-premise document processing."
GDPR Compliance
Data never leaves your server
Zero Opex
No per-page API fees
GPT-4o Vision
Intelligence-First / Cloud
"State-of-the-art vision for complex layout analysis."
SOTA Accuracy
Deep document understanding
Structural Logic
Extracts tables & forms
Tesseract Engine
import { TesseractEngine } from '@orka-js/ocr'; const engine = new TesseractEngine({ cacheWorker: true, // Reuse worker for better performance}); const result = await engine.process('./document.png', { languages: ['eng', 'fra'], // English + French minConfidence: 0.7,}); console.log(result.text);console.log(`Confidence: ${result.confidence}`); // Don't forget to cleanup when doneawait engine.terminate();OpenAI Vision Engine
import { OpenAIVisionEngine } from '@orka-js/ocr'; const engine = new OpenAIVisionEngine({ apiKey: process.env.OPENAI_API_KEY!, model: 'gpt-4o', // or 'gpt-4-vision-preview' maxTokens: 4096,}); const result = await engine.process('./complex-form.pdf', { extractTables: true, extractFields: true,}); console.log(result.text);console.log('Tables:', result.tables);console.log('Fields:', result.fields);Structured Document Extraction
Extract structured data from documents using a schema definition. Combines OCR with LLM for intelligent extraction.
import { DocumentExtractor } from '@orka-js/ocr';import { OpenAIAdapter } from '@orka-js/openai'; const extractor = new DocumentExtractor(); const result = await extractor.extract({ file: './invoice.pdf', schema: { invoiceNumber: 'string', date: 'date', total: 'number', client: { type: 'object', properties: { name: { type: 'string', required: true }, address: { type: 'string' }, }, }, items: { type: 'array', items: { type: 'object', properties: { description: { type: 'string' }, quantity: { type: 'number' }, unitPrice: { type: 'number' }, }, }, }, }, llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }), includeRawText: true,}); console.log(result.data);// {// invoiceNumber: 'INV-2024-001',// date: '2024-03-15',// total: 1250.00,// client: { name: 'Acme Corp', address: '123 Main St' },// items: [// { description: 'Widget A', quantity: 10, unitPrice: 50 },// { description: 'Widget B', quantity: 5, unitPrice: 150 },// ]// }Schema Field Types
| Primitive Type | Parsing Logic | Input Example |
|---|---|---|
string | Raw UTF-8 text extraction | "OrkaJS SDK" |
number | Float/Integer conversion | 42.50 |
date | ISO 8601 Temporal casting | "2026-03-14" |
boolean | Binary truth evaluation | true |
array | Iterative collection mapping | ['tag1', 'tag2'] |
object | Recursive nested structuring | { id: 1 } |
OCR Result Structure
interface OCRResult { text: string; // Full extracted text pages: OCRPage[]; // Per-page results confidence: number; // Overall confidence (0-1) tables?: ExtractedTable[]; // Extracted tables fields?: ExtractedField[]; // Form fields (key-value) metadata: { engine: string; processingTimeMs: number; pageCount: number; language?: string; };} interface OCRPage { pageNumber: number; text: string; blocks: OCRBlock[]; // Text blocks/paragraphs confidence: number;} interface OCRBlock { text: string; lines: OCRLine[]; confidence: number; type?: 'text' | 'table' | 'figure' | 'header' | 'footer';}Integration with Knowledge (RAG)
Use OCR to process scanned documents before ingesting them into your knowledge base:
import { OCR } from '@orka-js/ocr';import { createOrka } from '@orka-js/core'; const ocr = new OCR();const orka = createOrka({ /* config */ }); // Process scanned documentsasync function ingestScannedDocument(filePath: string, knowledgeName: string) { // Step 1: Extract text with OCR const ocrResult = await ocr.process(filePath); if (ocrResult.confidence < 0.7) { console.warn(`Low confidence OCR: ${ocrResult.confidence}`); } // Step 2: Add to knowledge base await orka.knowledge.add(knowledgeName, { content: ocrResult.text, metadata: { source: filePath, ocrEngine: ocrResult.metadata.engine, ocrConfidence: ocrResult.confidence, pageCount: ocrResult.metadata.pageCount, }, });} // Process a directory of scanned PDFsconst files = ['./docs/invoice1.pdf', './docs/contract.pdf'];for (const file of files) { await ingestScannedDocument(file, 'company-docs');}Deployment Best Practices
Standard Operating Procedures for OCR Pipelines
Security & Compliance
Leverage Tesseract for GDPR-sensitive data to ensure zero external data leak.
Cognitive Depth
Deploy OpenAI Vision for high-stakes analysis of complex tables and forms.
Quality Assurance
Programmatically validate confidence scores before committing to your Vector DB.
Performance & Cost
Implement a caching layer for OCR tokens to prevent redundant billing and compute.
Semantic Structuring
Always use schemas for invoices and contracts to turn pixels into business logic.
Resource Hygiene
Explicitly terminate Tesseract workers to prevent memory leaks in production.