AI Evals: Grading Your Model's Homework

You've built a prompt, or even a full RAG pipeline. It seems to work well on the few examples you've tested. But how do you know it's good? How do you prevent it from getting worse when you make a change?

This is where evaluations (or "evals") come in. An eval is like a unit test for your AI. It's an automated way to "grade the homework" of your model on a standardized set of test cases, ensuring its performance is consistent, measurable, and high-quality.

Why You Can't Skip Evals

  • Prevent Regressions: Evals act as a safety net. When you change a prompt or update a RAG system, you can run your evals to ensure you haven't accidentally broken something.
  • Objective Measurement: They replace "it feels better" with hard numbers. You can objectively compare two different prompts and know which one performs better across a wide range of inputs.
  • Faster Iteration: With a solid eval suite, you can experiment with new ideas rapidly, confident that you can instantly measure their impact.
  • Build Trust: For both your team and your users, having a robust evaluation process builds confidence in the reliability of your AI features.

A Simple Evaluation Framework

Let's build a basic evaluation system in TypeScript. Our goal is to test a "summarizer" prompt and grade it based on whether the output contains certain keywords we expect to see.

1. Define Your Test Cases

An evaluation starts with a good set of test cases. These should cover a variety of scenarios, including edge cases.

// lib/evals/test-cases.ts
export const summarizerTestCases = [
  {
    id: 'tech-update',
    input: "The engineering team has refactored the authentication service, " +
           "migrating from legacy JWT tokens to a modern OAuth2 implementation. " +
           "This change improves security and simplifies third-party integrations.",
    expectedKeywords: ['security', 'OAuth2', 'integrations'],
  },
  {
    id: 'marketing-launch',
    input: "Our Q3 marketing campaign, 'Summer of Growth', resulted in a 20% increase " +
           "in user sign-ups and a 15% uplift in social media engagement. The " +
           "campaign focused on video content and influencer partnerships.",
    expectedKeywords: ['sign-ups', 'engagement', 'campaign'],
  },
  {
    id: 'customer-support',
    input: "Customer support response times have decreased by 25% since we " +
           "implemented the new AI-powered triage system. The system automatically " +
           "categorizes tickets and routes them to the correct agent.",
    expectedKeywords: ['triage system', 'response times', '25%'],
  },
];

2. Create the Evaluator

Next, we need a function that runs our model against a test case and grades the result. Here, our "grader" is a simple function that checks for the presence of keywords.

// lib/evals/evaluator.ts
import { generateText } from 'ai';
import { openai } from '@ai-sdk/openai';

interface TestCase {
  id: string;
  input: string;
  expectedKeywords: string[];
}

interface EvalResult {
  testCaseId: string;
  input: string;
  output: string;
  score: number; // 1 for pass, 0 for fail
  passed: boolean;
  details: string;
}

// This is the prompt we are evaluating
const summarizerSystemPrompt =
  "You are an expert summarizer. Take the following text and provide a concise, one-sentence summary.";

async function evaluateTestCase(testCase: TestCase): Promise<EvalResult> {
  // 1. Run the model to get the output
  const { text: output } = await generateText({
    model: openai('gpt-4o'),
    system: summarizerSystemPrompt,
    prompt: testCase.input,
  });

  // 2. Grade the output
  const missingKeywords = testCase.expectedKeywords.filter(
    (keyword) => !output.toLowerCase().includes(keyword.toLowerCase())
  );

  const passed = missingKeywords.length === 0;
  const score = passed ? 1 : 0;
  const details = passed
    ? 'All keywords present.'
    : `Missing keywords: ${missingKeywords.join(', ')}`;

  return {
    testCaseId: testCase.id,
    input: testCase.input,
    output,
    score,
    passed,
    details,
  };
}

3. Run the Evaluation Suite

Finally, let's create a script to run all our test cases and print a report.

// lib/evals/run-evals.ts
import { summarizerTestCases } from './test-cases';
import { evaluateTestCase } from './evaluator'; // Assuming this is in its own file
import 'dotenv/config';

async function main() {
  console.log('--- Running AI Evaluations ---');
  const results = [];
  let passedCount = 0;

  for (const testCase of summarizerTestCases) {
    const result = await evaluateTestCase(testCase);
    results.push(result);
    if (result.passed) {
      passedCount++;
    }
    console.log(`\n[${result.passed ? '✅ PASS' : '❌ FAIL'}] ${result.testCaseId}`);
    console.log(`  Input: ${result.input.substring(0, 50)}...`);
    console.log(`  Output: ${result.output}`);
    console.log(`  Details: ${result.details}`);
  }

  const overallScore = (passedCount / results.length) * 100;
  console.log('\n--- Evaluation Summary ---');
  console.log(`Total Tests: ${results.length}`);
  console.log(`Passed: ${passedCount}`);
  console.log(`Failed: ${results.length - passedCount}`);
  console.log(`Overall Score: ${overallScore.toFixed(2)}%`);
  console.log('--------------------------');

  if(overallScore < 100) {
    // In a real CI/CD pipeline, you might exit with an error code
    // process.exit(1);
  }
}

main();

Beyond Simple Keyword Matching

This is a basic example, but you can create much more sophisticated graders:

  • JSON Validation: Does the output conform to a specific JSON schema?
  • AI-as-a-Judge: Use a powerful model like GPT-4 to grade the output of a cheaper model based on criteria like "helpfulness," "tone," or "factual accuracy."
  • Function Call Validation: Did the model call the correct function with the correct arguments?
  • Semantic Similarity: Is the output embedding close to the embedding of a known "good" answer?

Evals are a fundamental part of building professional-grade AI applications. By creating a robust testing process, you can innovate faster, build with confidence, and deliver a product that is consistently reliable and high-quality.