Text Splitters
Split documents into chunks optimized for embeddings and retrieval with intelligent strategies.
Why Split Text?
Large documents need to be split into smaller chunks for effective semantic search and context injection. Good splitting preserves meaning and respects document structure.
Optimal Sizing
chunk_size500-1000 chars for precise retrieval; 2k for broader context.
Contextual Overlap
overlap_ratio10-20% buffer ensures meaning isn't lost at the split boundaries.
Structural Logic
semantic_splitPrioritize paragraph and sentence breaks over fixed character counts.
# RecursiveCharacterTextSplitter
The most versatile splitter. Uses hierarchical separators to split text while preserving semantic boundaries. Tries to split on paragraphs first, then sentences, then words, and finally characters as a last resort.
import { RecursiveCharacterTextSplitter } from '@orka-js/tools'; const splitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, // Target chunk size in characters chunkOverlap: 200, // Overlap between chunks separators: ['\n\n', '\n', '. ', ' ', ''], // Try these in order keepSeparator: true, // Keep separators in chunks trimWhitespace: true // Remove leading/trailing whitespace}); const text = `Long document content here...`;const chunks = splitter.split(text); // Or split multiple documentsconst documents = [ { id: '1', content: 'Doc 1...', metadata: {} }, { id: '2', content: 'Doc 2...', metadata: {} }];const allChunks = splitter.splitDocuments(documents);Paragraph Integrity
The primary goal: keeping full paragraphs together to preserve the strongest semantic units.
\n\nPriority 1Line Breaks
If paragraphs exceed size limits, the splitter falls back to individual lines.
\nPriority 2Sentence Logic
Third fallback: splitting at periods to ensure thoughts remain complete.
. Priority 3Word Continuity
Splits on spaces to avoid breaking words in half, maintaining readability.
spacePriority 4Contextual Buffer
Applied at every split to ensure context flows smoothly between consecutive chunks.
overlapSafety Net# MarkdownTextSplitter
Specialized splitter for Markdown that respects document structure: headers, code blocks, and lists. Perfect for documentation where maintaining the hierarchy and code examples is crucial.
import { MarkdownTextSplitter } from '@orka-js/tools'; const splitter = new MarkdownTextSplitter({ chunkSize: 1000, chunkOverlap: 200}); const markdown = `## Introduction This is a paragraph... ### Subsection More content here... \`\`\`typescriptconst code = 'example';\`\`\``; More content here... \`\`\`typescriptconst code = `example`;\`\`\``; const chunks = splitter.split(markdown);// Splits at headers, preserving structureH2 Sectioning
Preserves major document sections. Ideal for high-level thematic consistency.
\n## Priority 1H3 Sub-sectioning
Maintains sub-topics together, ensuring detailed explanations stay within context.
\n### Priority 2Code Block Integrity
Ensures snippets and their syntax remain undivided for accurate code interpretation.
\n```\nPriority 3Thematic Breaks
Uses horizontal rules as natural boundaries between distinct logical concepts.
\n---\nPriority 4Paragraph Flow
The standard unit of meaning. Fallback when larger structures exceed chunk limits.
\n\nPriority 5# CodeTextSplitter
Language-aware splitter that respects code structure: classes, functions, and blocks. Uses language-specific separators to split at natural boundaries like function definitions, class declarations, and import statements.
import { CodeTextSplitter } from '@orka-js/tools'; const splitter = new CodeTextSplitter({ language: 'typescript', // or 'python', 'javascript', 'java', etc. chunkSize: 1000, chunkOverlap: 200}); const code = `export class MyClass { constructor() {} method1() { // implementation } method2() { // implementation }} export function helperFunction() { // implementation}`; const chunks = splitter.split(code);// Splits at class/function boundariesWeb Ecosystem
Deep awareness of TypeScript & JavaScript structures, classes, and arrow functions.
*.ts, *.jsFull SupportBackend & Systems
Respects Pythonic indentation and Go's structural patterns to prevent logic fragmentation.
*.py, *.goNative ParsingLow-Level Languages
Optimized for Rust and C++ source files, maintaining macro and block integrity.
*.rs, *.cppMemory SafeEnterprise Logic
Handles verbose Java class structures and decorators without breaking context.
*.javaStrict StructureDeclarative Styles
Preserves HTML tag nesting and CSS rule blocks for accurate layout analysis.
*.html, *.cssTag Awareness# TokenTextSplitter
Split text based on estimated token count, useful for staying within LLM context limits. Uses a character-to-token ratio estimation (default: 4 chars per token for English) to ensure chunks fit within model constraints.
import { TokenTextSplitter } from '@orka-js/tools'; const splitter = new TokenTextSplitter({ chunkSize: 500, // Target tokens per chunk chunkOverlap: 50, // Overlap in tokens estimatedTokensPerChar: 0.25 // ~4 chars per token (English)}); const text = `Long document...`;const chunks = splitter.split(text); // Each chunk is approximately 500 tokens⚠️ Token Estimation
This splitter uses character-based estimation. For precise token counting, consider using a tokenizer library like tiktoken or gpt-tokenizer.
Comparison
| Splitter | Best For | Preserves |
|---|---|---|
| RecursiveCharacter | General text, articles, books | Paragraphs, sentences |
| Markdown | Documentation, READMEs | Headers, code blocks |
| Code | Source code files | Classes, functions |
| Token | LLM context limits | Token boundaries |
Complete Example
Here's a complete pipeline showing how to load, split, and index documents:
import { createOrka } from '@orka-js/core';import { MarkdownLoader, RecursiveCharacterTextSplitter } from '@orka-js/tools'; const orka = createOrka({ /* config */ }); // 1. Load documentsconst loader = new MarkdownLoader('./docs/guide.md');const documents = await loader.load(); // 2. Split into chunksconst splitter = new RecursiveCharacterTextSplitter({ chunkSize: 1000, chunkOverlap: 200});const chunks = splitter.splitDocuments(documents); // 3. Create knowledge baseawait orka.knowledge.create({ name: 'documentation', source: chunks.map(c => ({ text: c.content, metadata: c.metadata }))}); // 4. Queryconst result = await orka.ask({ knowledge: 'documentation', question: 'How do I configure Orka AI?'});Tree-shaking Imports
Import only what you need to minimize bundle size:
// ✅ Import only what you needimport { RecursiveCharacterTextSplitter } from '@orka-js/tools';import { MarkdownTextSplitter } from '@orka-js/tools'; // ✅ Or import from indeximport { RecursiveCharacterTextSplitter, CodeTextSplitter } from '@orka-js/tools';