OrkaJS
Orka.JS

Evaluation Metrics

Measure response quality with built-in and custom metrics, ensuringproduction-grade reliability.

Why Evaluation?

LLM outputs are non-deterministic and can vary in quality. Evaluation metrics help you measure and track the quality of your RAG system, detect regressions, and compare different configurations. Orka provides both built-in metrics and the ability to create custom ones.

dataset-evaluation.ts
import { createOrka } from '@orka-js/core';
import { OpenAIAdapter } from '@orka-js/openai';
import { MemoryVectorAdapter } from '@orka-js/memory';
 
const orka = createOrka({
llm: new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! }),
vectorDB: new MemoryVectorAdapter(),
});
 
// Define your evaluation dataset
const dataset = [
{
input: 'What is Orka AI?',
expectedOutput: 'A TypeScript framework for building LLM applications.',
knowledge: 'docs', // Knowledge base to use for RAG
},
{
input: 'How do I install Orka?',
expectedOutput: 'Run npm install orkajs',
knowledge: 'docs',
},
];
 
// Run evaluation with multiple metrics
const summary = await orka.evaluate({
dataset,
metrics: ['relevance', 'correctness', 'faithfulness', 'hallucination'],
});
 
console.log(summary.metrics);
// {
// relevance: { average: 0.95, min: 0.9, max: 1.0 },
// correctness: { average: 0.88, min: 0.85, max: 0.92 },
// faithfulness: { average: 0.92, min: 0.88, max: 0.96 },
// hallucination: { average: 0.05, min: 0.0, max: 0.1 },
// }
 
console.log(summary.results); // Detailed per-case results
console.log(summary.passed); // true if all thresholds met

# Built-in Metrics

Orka provides five built-in metrics that cover the most important aspects of RAG quality:

relevance

Contextual Relevance

Measures the alignment between the user's intent and the generated response.

Target> 0.8
correctness

Semantic Correctness

Validates factual accuracy against ground truth, tolerating linguistic variations.

Target> 0.7
faithfulness

Grounded Faithfulness

Ensures the response is derived exclusively from the retrieved document chunks.

Target> 0.85
hallucination

Hallucination Rate

Identifies fabricated information not present in the source context.

Target< 0.15
cost

Operational Cost

Aggregated token consumption (Input + Output) per inference cycle.

TargetMonitor

# RAGAS Metrics

The RAGAS suite provides four production-grade metrics for RAG evaluation. Import them from @orka-js/evaluation — they share the same MetricFn interface as built-in metrics.

contextPrecision

Context Precision

LLM Judge

LLM judge: what fraction of retrieved context chunks are actually useful for the answer?

Target> 0.75
contextRecall

Context Recall

LLM Judge

LLM judge: does the retrieved context cover all aspects of the expected answer?

Target> 0.70
answerRelevance

Answer Relevance

Embedding

Cosine similarity between the question embedding and the answer embedding. High score = answer stays on topic.

Target> 0.80
semanticSimilarity

Semantic Similarity

Embedding

Cosine similarity between the generated output and the expected output embeddings. Requires expectedOutput.

Target> 0.85
ragas-evaluation.ts
import {
contextPrecision,
contextRecall,
answerRelevance,
semanticSimilarity,
cosineSimilarity,
ragasMetrics,
} from '@orka-js/evaluation';
import { OpenAIAdapter } from '@orka-js/openai';
 
const llm = new OpenAIAdapter({ apiKey: process.env.OPENAI_API_KEY! });
 
// Use individually
const precision = await contextPrecision({
input: 'What is the capital of France?',
output: 'Paris is the capital of France.',
context: ['France is a country in Europe. Its capital is Paris.', 'Unrelated text.'],
llm,
});
console.log(precision.score); // e.g. 0.87
 
// Use the ragasMetrics bundle with orka.evaluate()
const summary = await orka.evaluate({
dataset,
metrics: Object.values(ragasMetrics),
});
 
// cosineSimilarity helper
const sim = cosineSimilarity([0.1, 0.9, 0.3], [0.2, 0.8, 0.4]);
console.log(sim); // 0.998

# Custom Metrics

Create custom metrics for domain-specific quality checks. A custom metric is an async function that receives the evaluation context and returns a score.

custom-metrics.ts
import type { MetricFn } from 'orkajs';
 
// Custom metric: Check professionalism of tone
const toneCheck: MetricFn = async ({ input, output, context, llm }) => {
const result = await llm.generate(
`Rate the professionalism of this response on a scale of 0.0 to 1.0.
Question: ${input}
Response: ${output}
Reply with ONLY a number.`,
{ temperature: 0, maxTokens: 10 }
);
 
const score = parseFloat(result.content.trim());
return {
name: 'professionalism',
score: isNaN(score) ? 0 : Math.min(1, Math.max(0, score)),
};
};
 
// Custom metric: Check response length
const lengthCheck: MetricFn = async ({ output }) => {
const wordCount = output.split(/\s+/).length;
// Penalize very short or very long responses
const idealLength = 50;
const score = Math.max(0, 1 - Math.abs(wordCount - idealLength) / idealLength);
return { name: 'length_appropriateness', score };
};
 
// Use alongside built-in metrics
const summary = await orka.evaluate({
dataset: [...],
metrics: ['relevance', 'faithfulness', toneCheck, lengthCheck],
});

Evaluation Context Interface

MetricFn Schema & LLM-as-Judge Readiness

input / outputCore Data

Type: string

The core inference pair: original prompt vs. generated response. Primary source for semantic drift analysis.

expectedOutputGround Truth

Type: string | undefined

Ground truth reference. Essential for calculating exact match, rouge scores, or correctness.

contextRetrieval

Type: ChunkResult[]

The retrieval evidence. Used to measure Faithfulness (hallucination check) and Context Precision.

llmJudge

Type: LLMAdapter

Enables the LLM-as-Judge pattern. Allows recursive evaluation using a secondary high-reasoning model.

# Evaluation Result

interface EvaluationSummary {
metrics: {
[metricName: string]: {
average: number; // Average score across all cases
min: number; // Minimum score
max: number; // Maximum score
stdDev: number; // Standard deviation
};
};
results: EvaluationResult[]; // Per-case detailed results
passed: boolean; // True if all thresholds met
totalCases: number; // Number of test cases
totalLatencyMs: number; // Total evaluation time
totalTokens: number; // Total tokens consumed
}

Quality Assurance Framework

Production Readiness Standards

Target Benchmarks

Faithfulness> 0.85

Strict factual grounding.

Hallucination< 0.15

Maximum tolerated deviance.

Strategic Best Practices

  • Deploy Holistic Evaluation: Mix relevance, correctness, and faithfulness for a 360° view.

  • Domain Customization: Tailor metrics for specific constraints (Tone, PII, Format).

  • CI/CD Gatekeeping: Automated regressing testing at every commit to prevent drift.

Tree-shaking Imports

// ✅ Import evaluation types
import type { MetricFn, EvaluationSummary } from 'orkajs';
 
// ✅ Import built-in metrics individually
import { relevance, correctness, faithfulness, hallucination } from '@orka-js/evaluation';