RAG: Giving Your AI an Open-Book Exam - The Caffeinated Engineer

A standard Large Language Model (LLM) is like a student taking a closed-book exam. It can only answer questions based on the vast amount of information it studied during its training. But what if you need it to know about your company's specific policies, the latest project documents, or real-time information?

This is where Retrieval-Augmented Generation (RAG) comes in. RAG is like giving the AI an open-book exam. Before answering your question, it first retrieves relevant information from your specific documents and then uses that context to generate a much more accurate and relevant answer.

Why RAG is a Game-Changer

Reduces Hallucinations: By grounding the AI in factual documents, you dramatically reduce its tendency to make things up.
Uses Up-to-Date Information: The AI's knowledge is no longer frozen in time; it can access and use the latest data you provide.
Provides Source Citations: You can build systems that cite their sources, allowing users to verify the information.
Unlocks Your Private Data: It allows AI to securely leverage your company's internal knowledge base without needing to be retrained.

How RAG Works

The RAG process involves a few key steps:

Building a Simple RAG System

Let's build a small-scale RAG system in TypeScript. For this example, we'll store our documents and their embeddings in memory. In a real-world application, you'd use a dedicated vector database.

First, we need a way to calculate cosine similarity to find the most relevant documents.

// lib/rag/similarity.ts
export function cosineSimilarity(vecA: number[], vecB: number[]): number {
  if (vecA.length !== vecB.length) {
    throw new Error('Embeddings must have the same dimensions');
  }
  let dotProduct = 0, normA = 0, normB = 0;
  for (let i = 0; i < vecA.length; i++) {
    dotProduct += vecA[i] * vecB[i];
    normA += vecA[i] ** 2;
    normB += vecB[i] ** 2;
  }
  if (normA === 0 || normB === 0) return 0;
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

Now, let's create our RAG system. It will manage documents, find relevant context, and generate answers.

// lib/rag/simple-rag.ts
import { embed, embedMany, generateText } from 'ai';
import { openai } from '@ai-sdk/openai';
import { cosineSimilarity } from './similarity';
import 'dotenv/config';

interface Document {
  id: string;
  text: string;
  embedding: number[];
}

class SimpleRAG {
  private documents: Map<string, Document> = new Map();

  // Step 1: Ingest and embed documents
  async addDocuments(texts: string[]) {
    const { embeddings } = await embedMany({
      model: openai.embedding('text-embedding-3-small'),
      values: texts,
    });

    for (let i = 0; i < texts.length; i++) {
      const id = `doc-${this.documents.size + 1}`;
      this.documents.set(id, {
        id,
        text: texts[i],
        embedding: embeddings[i],
      });
    }
    console.log(`Indexed ${texts.length} new documents.`);
  }

  // Step 2: Retrieve relevant documents
  private async retrieve(queryEmbedding: number[], topK: number): Promise<Document[]> {
    const allDocs = Array.from(this.documents.values());

    const similarities = allDocs.map(doc => ({
      ...doc,
      similarity: cosineSimilarity(queryEmbedding, doc.embedding),
    }));

    similarities.sort((a, b) => b.similarity - a.similarity);
    return similarities.slice(0, topK);
  }

  // Step 3: Generate the answer
  async query(question: string) {
    // Embed the user's question
    const { embedding: queryEmbedding } = await embed({
      model: openai.embedding('text-embedding-3-small'),
      value: question,
    });

    // Retrieve the top 3 most relevant documents
    const contextDocs = await this.retrieve(queryEmbedding, 3);

    const contextText = contextDocs
      .map((doc, i) => `Source [${i+1}]: ${doc.text}`)
      .join('\n\n');

    // Generate the final response using the context
    const { text } = await generateText({
      model: openai('gpt-4o'),
      system: "You are a helpful assistant. Answer the user's question based on the provided context. Cite your sources using the [Source #] format.",
      prompt: `Context:\n${contextText}\n\nQuestion:\n${question}`,
    });

    console.log('---');
    console.log('Answer:', text);
    console.log('\nSources:');
    contextDocs.forEach((doc, i) => console.log(`- [Source ${i+1}] ${doc.text}`));
    console.log('---');
  }
}

async function main() {
  const rag = new SimpleRAG();

  await rag.addDocuments([
    "The company's sick leave policy allows for 10 paid days per year.",
    'To submit an expense report, use the online portal and attach all receipts.',
    'The dress code is business casual, but jeans are allowed on Fridays.',
    'All employees receive a 15% discount on company products.',
    'Paid time off (PTO) requests must be submitted at least two weeks in advance.',
  ]);

  await rag.query('How many sick days do I get?');
  await rag.query('How do I ask for a day off?');
}

main();

Key Takeaways

Ingestion: You first need to process your documents (PDFs, text files, etc.), break them into manageable chunks, and generate embeddings for each chunk.
Storage: These embeddings are stored in a vector database, which is optimized for fast similarity searches.
Retrieval: When a user asks a question, you embed their query and use it to find the most relevant document chunks from your database.
Generation: You then pass the user's original question and the retrieved document chunks to an LLM, instructing it to answer the question based only on the provided context.

RAG is a powerful but conceptually simple pattern that lets you build incredibly smart, accurate, and useful AI applications. It's the key to unlocking the value of your own data.