Evaluation Metrics
Measure response quality with built-in and custom metrics, ensuringproduction-grade reliability.
Why Evaluation?
LLM outputs are non-deterministic and can vary in quality. Evaluation metrics help you measure and track the quality of your RAG system, detect regressions, and compare different configurations. Orka provides both built-in metrics and the ability to create custom ones.
import { createOrka } from '@orka-js/core';import { OpenAIAdapter } from '@orka-js/openai';import { MemoryVectorAdapter } from '@orka-js/memory'; const orka = createOrka({ llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }), vectorDB: new MemoryVectorAdapter(),}); // Define your evaluation datasetconst dataset = [ { input: 'What is Orka AI?', expectedOutput: 'A TypeScript framework for building LLM applications.', knowledge: 'docs', // Knowledge base to use for RAG }, { input: 'How do I install Orka?', expectedOutput: 'Run npm install orkajs', knowledge: 'docs', },]; // Run evaluation with multiple metricsconst summary = await orka.evaluate({ dataset, metrics: ['relevance', 'correctness', 'faithfulness', 'hallucination'],}); console.log(summary.metrics);// {// relevance: { average: 0.95, min: 0.9, max: 1.0 },// correctness: { average: 0.88, min: 0.85, max: 0.92 },// faithfulness: { average: 0.92, min: 0.88, max: 0.96 },// hallucination: { average: 0.05, min: 0.0, max: 0.1 },// } console.log(summary.results); // Detailed per-case resultsconsole.log(summary.passed); // true if all thresholds met# Built-in Metrics
Orka provides five built-in metrics that cover the most important aspects of RAG quality:
relevanceContextual Relevance
Measures the alignment between the user's intent and the generated response.
correctnessSemantic Correctness
Validates factual accuracy against ground truth, tolerating linguistic variations.
faithfulnessGrounded Faithfulness
Ensures the response is derived exclusively from the retrieved document chunks.
hallucinationHallucination Rate
Identifies fabricated information not present in the source context.
costOperational Cost
Aggregated token consumption (Input + Output) per inference cycle.
# RAGAS Metrics
The RAGAS suite provides four production-grade metrics for RAG evaluation. Import them from @orka-js/evaluation — they share the same MetricFn interface as built-in metrics.
contextPrecisionContext Precision
LLM JudgeLLM judge: what fraction of retrieved context chunks are actually useful for the answer?
contextRecallContext Recall
LLM JudgeLLM judge: does the retrieved context cover all aspects of the expected answer?
answerRelevanceAnswer Relevance
EmbeddingCosine similarity between the question embedding and the answer embedding. High score = answer stays on topic.
semanticSimilaritySemantic Similarity
EmbeddingCosine similarity between the generated output and the expected output embeddings. Requires expectedOutput.
import { contextPrecision, contextRecall, answerRelevance, semanticSimilarity, cosineSimilarity, ragasMetrics,} from '@orka-js/evaluation';import { OpenAIAdapter } from '@orka-js/openai'; const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }); // Use individuallyconst precision = await contextPrecision({ input: 'What is the capital of France?', output: 'Paris is the capital of France.', context: ['France is a country in Europe. Its capital is Paris.', 'Unrelated text.'], llm,});console.log(precision.score); // e.g. 0.87 // Use the ragasMetrics bundle with orka.evaluate()const summary = await orka.evaluate({ dataset, metrics: Object.values(ragasMetrics),}); // cosineSimilarity helperconst sim = cosineSimilarity([0.1, 0.9, 0.3], [0.2, 0.8, 0.4]);console.log(sim); // 0.998# Custom Metrics
Create custom metrics for domain-specific quality checks. A custom metric is an async function that receives the evaluation context and returns a score.
import type { MetricFn } from 'orkajs'; // Custom metric: Check professionalism of toneconst toneCheck: MetricFn = async ({ input, output, context, llm }) => { const result = await llm.generate( `Rate the professionalism of this response on a scale of 0.0 to 1.0. Question: ${input} Response: ${output} Reply with ONLY a number.`, { temperature: 0, maxTokens: 10 } ); const score = parseFloat(result.content.trim()); return { name: 'professionalism', score: isNaN(score) ? 0 : Math.min(1, Math.max(0, score)), };}; // Custom metric: Check response lengthconst lengthCheck: MetricFn = async ({ output }) => { const wordCount = output.split(/\s+/).length; // Penalize very short or very long responses const idealLength = 50; const score = Math.max(0, 1 - Math.abs(wordCount - idealLength) / idealLength); return { name: 'length_appropriateness', score };}; // Use alongside built-in metricsconst summary = await orka.evaluate({ dataset: [...], metrics: ['relevance', 'faithfulness', toneCheck, lengthCheck],});Evaluation Context Interface
MetricFn Schema & LLM-as-Judge Readiness
input / outputCore DataType: string
The core inference pair: original prompt vs. generated response. Primary source for semantic drift analysis.
expectedOutputGround TruthType: string | undefined
Ground truth reference. Essential for calculating exact match, rouge scores, or correctness.
contextRetrievalType: ChunkResult[]
The retrieval evidence. Used to measure Faithfulness (hallucination check) and Context Precision.
llmJudgeType: LLMAdapter
Enables the LLM-as-Judge pattern. Allows recursive evaluation using a secondary high-reasoning model.
# Evaluation Result
interface EvaluationSummary { metrics: { [metricName: string]: { average: number; // Average score across all cases min: number; // Minimum score max: number; // Maximum score stdDev: number; // Standard deviation }; }; results: EvaluationResult[]; // Per-case detailed results passed: boolean; // True if all thresholds met totalCases: number; // Number of test cases totalLatencyMs: number; // Total evaluation time totalTokens: number; // Total tokens consumed}Quality Assurance Framework
Production Readiness Standards
Target Benchmarks
Strict factual grounding.
Maximum tolerated deviance.
Strategic Best Practices
Deploy Holistic Evaluation: Mix relevance, correctness, and faithfulness for a 360° view.
Domain Customization: Tailor metrics for specific constraints (Tone, PII, Format).
CI/CD Gatekeeping: Automated regressing testing at every commit to prevent drift.
Tree-shaking Imports
// ✅ Import evaluation typesimport type { MetricFn, EvaluationSummary } from 'orkajs'; // ✅ Import built-in metrics individuallyimport { relevance, correctness, faithfulness, hallucination } from '@orka-js/evaluation';