Document Loaders

Load data from various sources: PDF, CSV, JSON, Markdown, text file and folders.

Overview

Document loaders transform raw data from different sources into a unified Document format that Orka AI can process. Each loader handles specific file types and extraction logic.

What is a Document?

All loaders return an array of Document objects with this structure:

interface Document {
  id: string;              // Unique identifier
  content: string;         // The actual text content
  metadata: {
    source?: string;       // File path or URL
    loader?: string;       // Loader name
    ...customFields        // Your custom metadata
  };
}

# TextLoader

Load plain text files with custom encoding support. The simplest loader — reads a file and returns its content as a single document.

import { TextLoader } from '@orka-js/tools';
 
const loader = new TextLoader('./document.txt', {
  encoding: 'utf-8',
  metadata: { source: 'documentation' }
});
 
const documents = await loader.load();
// [{ id: '...', content: '...', metadata: { source: 'documentation', loader: 'TextLoader' } }]

Constructor Parameters

path: string

Absolute or relative path to the text file.

options.encoding?: string

Character encoding (default: 'utf-8'). Supports 'utf-8', 'ascii', 'latin1', etc.

options.metadata?: Record<string, unknown>

Custom metadata to attach to the document. Useful for categorization, filtering, or tracking.

# CSVLoader

Parse CSV files with support for custom separators, column selection, and content extraction. Each row becomes a separate document, making it perfect for loading structured data like product catalogs, user lists, or FAQ databases.

import { CSVLoader } from '@orka-js/tools';
 
// Option 1: Use specific column as content
const loader = new CSVLoader('./data.csv', {
  separator: ',',
  contentColumn: 'description',  // Use this column as document content
  metadata: { type: 'product_data' }
});
 
// Option 2: Combine multiple columns
const loader2 = new CSVLoader('./users.csv', {
  columns: ['name', 'bio', 'interests'],  // Combine these columns
});
 
const documents = await loader.load();
// Each row becomes a separate document

Advanced CSV Parsing

Engineered for Structured Data Ingestion

RFC 4180

Quoted Integrity

Accurately parses fields containing commas or line breaks wrapped in quotes.

Flexible

Dynamic Delimiters

Full support for custom separators: commas, semicolons, or tab-separated values.

RAG-Ready

Auto-Metadata

Automatically converts columns into searchable metadata tags for your vector store.

# JSONLoader

Load JSON files or objects with JSONPath support for nested data extraction. Handles both single objects and arrays, with flexible field mapping for content and metadata.

import { JSONLoader } from '@orka-js/tools';
 
// Load from file
const loader = new JSONLoader('./data.json', {
  contentField: 'text',           // Use this field as content
  metadataFields: ['author', 'date'],  // Extract these as metadata
  jsonPath: '$.articles'          // Extract nested array
});
 
// Load from object
const data = [
  { text: 'Article 1', author: 'Alice' },
  { text: 'Article 2', author: 'Bob' }
];
const loader2 = new JSONLoader(data, {
  contentField: 'text'
});
 
const documents = await loader.load();

# MarkdownLoader

Load Markdown files with frontmatter extraction and header parsing. Perfect for documentation, blog posts, or any content with YAML frontmatter metadata.

import { MarkdownLoader } from '@orka-js/tools';
 
const loader = new MarkdownLoader('./README.md', {
  removeFrontmatter: true,   // Extract YAML frontmatter
  includeHeaders: true,      // Extract all headers as metadata
  metadata: { type: 'documentation' }
});
 
const documents = await loader.load();
// Frontmatter fields are added to metadata
// Headers are available in metadata.headers

Example Markdown with Frontmatter

---
title: Getting Started
author: Alice
date: 2024-01-15
---
 
# Introduction
 
This is the content...

# PDFLoader

Extract text from PDF files with page selection and metadata extraction. Each page becomes a separate document with page number tracking, ideal for research papers, reports, or manuals.

📦 Installation Required

PDFLoader requires the pdf-parse package:

npm install pdf-parse

import { PDFLoader } from '@orka-js/tools';
 
// Load entire PDF
const loader = new PDFLoader('./document.pdf', {
  metadata: { source: 'research_paper' }
});
 
// Load specific pages
const loader2 = new PDFLoader('./report.pdf', {
  pages: [1, 2, 3],      // Only pages 1, 2, 3
  maxPages: 10           // Or limit to first 10 pages
});
 
const documents = await loader.load();
// Each page becomes a separate document with page number in metadata

# DirectoryLoader

Recursively load all supported files from a directory with automatic loader detection. The most powerful loader — scans entire directory trees, detects file types, and applies the appropriate loader automatically.

import { DirectoryLoader } from '@orka-js/tools';
 
const loader = new DirectoryLoader('./docs', {
  recursive: true,                    // Scan subdirectories
  glob: '*.md',                       // Filter by pattern
  exclude: ['node_modules', '.git'],  // Exclude folders
  metadata: { project: 'orka-docs' }
});
 
const documents = await loader.load();
// Automatically detects and uses the right loader for each file type

File Extension	Internal Loader	Strategy
`.txt`	`TextLoader`	Plain Text
`.md, .mdx`	`MarkdownLoader`	Structure Aware
`.csv`	`CSVLoader`	Row-to-Document
`.json, .jsonl`	`JSONLoader`	Object Mapping
`.ts, .js, .py`	`TextLoader`	Raw Code Ingestion

Using with Knowledge

Loaders integrate seamlessly with Orka's Knowledge system. Here's a complete pipeline from loading documents to creating a searchable knowledge base:

create-knowledge-base.ts

import { createOrka } from 'orkajs';
import { DirectoryLoader } from '@orka-js/tools';
import { RecursiveCharacterTextSplitter } from '@orka-js/tools';
 
const orka = createOrka({ /* config */ });
 
// Load documents
const loader = new DirectoryLoader('./knowledge-base');
const documents = await loader.load();
 
// Split into chunks
const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200
});
const chunks = splitter.splitDocuments(documents);
 
// Create knowledge base
await orka.knowledge.create({
  name: 'my-knowledge',
  source: documents.map(d => ({ text: d.content, metadata: d.metadata }))
});

Tree-shaking Imports

Import only what you need to minimize bundle size:

// ❌ Imports everything
import { CSVLoader, PDFLoader } from 'orkajs';
 
// ✅ Tree-shakeable - only bundles CSVLoader
import { CSVLoader } from '@orka-js/tools';
 
// ✅ Import all loaders from index
import { CSVLoader, PDFLoader, JSONLoader } from '@orka-js/tools';

API-Based Loaders

Load data from external services via their APIs. These loaders require authentication credentials.

# NotionLoader

Load pages and databases from Notion. Extracts rich text content and properties.

import { NotionLoader } from '@orka-js/tools';
 
const loader = new NotionLoader({
  apiKey: process.env.NOTION_API_KEY!,
  pageIds: ['page-id-1', 'page-id-2'],
  // Or load from databases
  databaseIds: ['database-id'],
  includeChildPages: true,
  maxDepth: 3,
});
 
const documents = await loader.load();
// Each page becomes a document with title, content, and Notion properties

# SlackLoader

Load messages from Slack channels with thread support and date filtering.

import { SlackLoader } from '@orka-js/tools';
 
const loader = new SlackLoader({
  token: process.env.SLACK_BOT_TOKEN!,
  channelIds: ['C123456', 'C789012'],
  startDate: new Date('2024-01-01'),
  endDate: new Date(),
  includeThreads: true,
  includeFiles: false,
  limit: 1000,
});
 
const documents = await loader.load();
// Each message becomes a document with channel info and timestamp

# GitHubLoader

Load files from GitHub repositories. Perfect for loading documentation, code, or README files.

import { GitHubLoader } from '@orka-js/tools';
 
const loader = new GitHubLoader({
  token: process.env.GITHUB_TOKEN, // Optional for public repos
  owner: 'orka-ai',
  repo: 'orkajs',
  branch: 'main',
  path: 'docs',
  recursive: true,
  fileExtensions: ['.md', '.txt', '.ts'],
  excludePaths: ['node_modules', 'dist'],
  includeReadme: true,
});
 
const documents = await loader.load();
// Each file becomes a document with path, SHA, and repo info

# GoogleDriveLoader

Load files from Google Drive. Supports Google Docs, Sheets, and regular files.

import { GoogleDriveLoader } from '@orka-js/tools';
 
const loader = new GoogleDriveLoader({
  credentials: {
    clientId: process.env.GOOGLE_CLIENT_ID!,
    clientSecret: process.env.GOOGLE_CLIENT_SECRET!,
    refreshToken: process.env.GOOGLE_REFRESH_TOKEN!,
  },
  folderId: 'folder-id', // Load all files from folder
  // Or specific files
  fileIds: ['file-id-1', 'file-id-2'],
  recursive: true,
  maxFiles: 100,
  mimeTypes: ['text/plain', 'application/vnd.google-apps.document'],
});
 
const documents = await loader.load();
// Google Docs are exported as text, regular files are downloaded

Security Note: Never hardcode API keys or tokens. Use environment variables and follow the principle of least privilege when creating API credentials.