HomeSpecialized Use CasesRAG — Retrieval-Augmented Generation

intermediate15 min read· Module 13, Lesson 1

🔍RAG — Retrieval-Augmented Generation

Build knowledge bases with embeddings and vector search

What Is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances a large language model's responses by retrieving relevant information from an external knowledge base before generating an answer. Instead of relying solely on what the model learned during training, RAG lets you inject fresh, domain-specific context into every prompt — dynamically and at query time.

Why does this matter? LLMs have a knowledge cutoff date. They do not know about your internal documents, proprietary data, or anything that happened after training ended. RAG bridges that gap by fetching the most relevant information and feeding it directly into the model's context window.

The core idea is simple: search first, then generate. This combination transforms a general-purpose LLM into a specialized assistant that can answer questions about your specific data with high accuracy.

Embeddings Explained

Before we can search for relevant content, we need a way to represent text as numbers that capture its meaning — not just its keywords. This is exactly what embeddings do.

An embedding is a dense vector (a list of floating-point numbers) that represents a piece of text in a high-dimensional space. Texts with similar meanings end up close together in this space, while unrelated texts are far apart.

TypeScript
// Conceptual example: generating an embedding
const text = "How do I reset my password?";
const embedding = await embeddingModel.embed(text);
// Result: [0.023, -0.041, 0.087, ..., 0.012]  (e.g., 1536 dimensions)

Key properties of embeddings:

Property	Description
Semantic similarity	"How do I reset my password?" and "I forgot my login credentials" produce vectors that are close together
Dimensionality	Typical embedding models produce vectors with 768 to 3072 dimensions
Distance metrics	Cosine similarity and dot product are the most common ways to measure closeness
Model-specific	Embeddings from different models are NOT interchangeable — always use the same model for indexing and querying

Popular embedding models include OpenAI's text-embedding-3-small, Cohere's embed-english-v3.0, and open-source options like all-MiniLM-L6-v2 from Sentence Transformers.

Vector Databases

Once you have embeddings, you need a place to store them and search through them efficiently. This is the job of a vector database (or vector store).

A vector database is optimized for similarity search — given a query vector, it finds the most similar vectors in the database using approximate nearest neighbor (ANN) algorithms. Traditional databases search by exact match; vector databases search by meaning.

Popular Vector Databases

Database	Type	Best For
Pinecone	Managed cloud service	Production workloads, zero ops
Weaviate	Open-source, self-hosted or cloud	Hybrid search (vector + keyword)
ChromaDB	Open-source, lightweight	Prototyping, local development
Qdrant	Open-source, high performance	Large-scale production systems
pgvector	PostgreSQL extension	Teams already using PostgreSQL
FAISS	Library (Facebook AI)	Research, in-memory search

Storing Vectors

TypeScript
// Example: storing a document chunk in a vector database

const client = new ChromaClient();
const collection = await client.getOrCreateCollection({
  name: "company_docs",
  metadata: { "hnsw:space": "cosine" },
});

await collection.add({
  ids: ["doc-001-chunk-3"],
  embeddings: [[0.023, -0.041, 0.087 /* ... hundreds more dimensions */]],
  metadatas: [{ source: "employee_handbook.pdf", page: 12, section: "PTO Policy" }],
  documents: ["Employees are entitled to 20 days of paid time off per year..."],
});

The metadata field is crucial — it lets you filter results by source, date, category, or any other attribute during retrieval.

The RAG Pipeline

The full RAG pipeline has six stages. Understanding each stage is critical for building a system that actually works well in production.

Stage 1: Embed (Indexing Phase)

Take your source documents and convert them into embeddings. This happens once (or whenever your documents change).

TypeScript
async function embedDocuments(chunks: string[]): Promise<number[][]> {
  const response = await fetch("https://api.openai.com/v1/embeddings", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model: "text-embedding-3-small",
      input: chunks,
    }),
  });
  const data = await response.json();
  return data.data.map((item: any) => item.embedding);
}

Stage 2: Store

Save the embeddings along with their original text and metadata into your vector database.

TypeScript
async function storeChunks(
  collection: Collection,
  chunks: string[],
  embeddings: number[][],
  metadata: Record<string, string>[]
): Promise<void> {
  const ids = chunks.map((_, i) => `chunk-${Date.now()}-${i}`);
  await collection.add({ ids, embeddings, metadatas: metadata, documents: chunks });
  console.log(`Stored ${chunks.length} chunks in the vector database.`);
}

Stage 3: Query

When a user asks a question, embed their query using the same embedding model you used for indexing.

TypeScript
async function embedQuery(query: string): Promise<number[]> {
  const response = await fetch("https://api.openai.com/v1/embeddings", {
    method: "POST",
    headers: {
      "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      model: "text-embedding-3-small",
      input: query,
    }),
  });
  const data = await response.json();
  return data.data[0].embedding;
}

Stage 4: Retrieve

Search the vector database for the most similar chunks to the query embedding.

TypeScript
async function retrieveRelevant(
  collection: Collection,
  queryEmbedding: number[],
  topK: number = 5
): Promise<{ documents: string[]; scores: number[]; metadatas: any[] }> {
  const results = await collection.query({
    queryEmbeddings: [queryEmbedding],
    nResults: topK,
    include: ["documents", "metadatas", "distances"],
  });
  return {
    documents: results.documents[0] as string[],
    scores: results.distances![0] as number[],
    metadatas: results.metadatas![0],
  };
}

Stage 5: Augment

Combine the retrieved context with the user's original question into a prompt for the LLM.

TypeScript
function buildAugmentedPrompt(query: string, retrievedDocs: string[]): string {
  const context = retrievedDocs
    .map((doc, i) => `[Source ${i + 1}]\n${doc}`)
    .join("\n\n");

  return `You are a helpful assistant. Answer the user's question based ONLY on
the provided context. If the context does not contain enough information,
say "I don't have enough information to answer that."

Context:
${context}

Question: ${query}

Answer:`;
}

Stage 6: Generate

Send the augmented prompt to Claude and get the final answer.

TypeScript

const anthropic = new Anthropic();

async function generateAnswer(augmentedPrompt: string): Promise<string> {
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    messages: [{ role: "user", content: augmentedPrompt }],
  });
  return response.content[0].type === "text" ? response.content[0].text : "";
}

Claude and Citations with search_results

Claude supports a special content block type called search_results that enables automatic citation generation. When you pass retrieved documents as search result blocks, Claude can cite specific sources in its response — making answers verifiable and trustworthy.

TypeScript
const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  messages: [
    {
      role: "user",
      content: [
        {
          type: "text",
          text: "What is the company's PTO policy?",
        },
        ...retrievedDocs.map((doc, i) => ({
          type: "tool_result" as const,
          tool_use_id: `search_${i}`,
          content: [
            {
              type: "text" as const,
              text: JSON.stringify({
                type: "search_results",
                source: doc.metadata.source,
                content: doc.text,
              }),
            },
          ],
        })),
      ],
    },
  ],
});

When Claude receives search results in this format, it can generate responses with inline citations that reference the original sources. This is essential for enterprise applications where traceability and auditability matter.

Full Q&A Over Documents Example

Here is a complete, end-to-end example that ties the entire pipeline together — from loading documents to answering questions.

TypeScript

// Initialize clients
const anthropic = new Anthropic();
const chroma = new ChromaClient();

// Step 1: Load and chunk the document
function chunkText(text: string, chunkSize = 500, overlap = 50): string[] {
  const chunks: string[] = [];
  let start = 0;
  while (start < text.length) {
    const end = Math.min(start + chunkSize, text.length);
    chunks.push(text.slice(start, end));
    start += chunkSize - overlap;
  }
  return chunks;
}

// Step 2: Build the index
async function buildIndex(filePath: string): Promise<void> {
  const text = readFileSync(filePath, "utf-8");
  const chunks = chunkText(text);
  const collection = await chroma.getOrCreateCollection({ name: "qa_docs" });

  // Embed all chunks (using a hypothetical embedding function)
  const embeddings = await embedDocuments(chunks);

  await collection.add({
    ids: chunks.map((_, i) => `chunk-${i}`),
    embeddings,
    documents: chunks,
    metadatas: chunks.map((_, i) => ({ source: filePath, chunkIndex: i })),
  });
  console.log(`Indexed ${chunks.length} chunks from ${filePath}`);
}

// Step 3: Answer a question
async function answerQuestion(question: string): Promise<string> {
  const collection = await chroma.getCollection({ name: "qa_docs" });
  const queryEmbedding = await embedQuery(question);

  const results = await collection.query({
    queryEmbeddings: [queryEmbedding],
    nResults: 5,
  });

  const context = (results.documents[0] as string[]).join("\n\n");
  const prompt = buildAugmentedPrompt(question, results.documents[0] as string[]);

  return generateAnswer(prompt);
}

// Usage
await buildIndex("./docs/employee_handbook.pdf");
const answer = await answerQuestion("How many vacation days do I get?");
console.log(answer);

Chunking Strategies

How you split your documents into chunks has a massive impact on retrieval quality. Poor chunking leads to irrelevant results, truncated context, and confused answers.

Strategy 1: Fixed-Size Chunking

Split text into chunks of a fixed number of characters or tokens, with optional overlap.

Pros	Cons
Simple to implement	Breaks sentences and paragraphs mid-thought
Predictable chunk sizes	Ignores document structure
Easy to control token usage	May split related information across chunks

Strategy 2: Sentence-Based Chunking

Split on sentence boundaries, then group sentences until a target size is reached.

TypeScript
function sentenceChunk(text: string, maxTokens = 300): string[] {
  const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
  const chunks: string[] = [];
  let current = "";

  for (const sentence of sentences) {
    if ((current + sentence).length > maxTokens * 4) {
      if (current) chunks.push(current.trim());
      current = sentence;
    } else {
      current += " " + sentence;
    }
  }
  if (current.trim()) chunks.push(current.trim());
  return chunks;
}

Strategy 3: Semantic Chunking

Use embeddings to detect topic shifts, and split where the semantic similarity between consecutive sentences drops below a threshold.

Strategy 4: Document-Structure Chunking

Respect the document's natural structure — split by headings, sections, paragraphs, or pages. This works especially well for structured documents like manuals, FAQs, and technical documentation.

Strategy 5: Recursive Chunking

Try splitting by the largest structural unit (sections), then paragraphs, then sentences, then characters — stopping at each level when chunks are within the target size.

Best practice: Always include overlap (50-200 characters) between consecutive chunks to preserve context at boundaries.

Relevance Scoring

Not all retrieved chunks are equally useful. Relevance scoring helps you filter out noise and keep only the most meaningful results.

Cosine Similarity

The most common metric. It measures the angle between two vectors, ignoring magnitude. A score of 1.0 means identical direction (perfect match), 0.0 means orthogonal (unrelated).

TypeScript
function cosineSimilarity(a: number[], b: number[]): number {
  let dotProduct = 0;
  let normA = 0;
  let normB = 0;
  for (let i = 0; i < a.length; i++) {
    dotProduct += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}

Relevance Threshold

Set a minimum similarity score. Discard any chunks that fall below it to avoid injecting irrelevant context.

TypeScript
const RELEVANCE_THRESHOLD = 0.75;

const relevantDocs = retrievedResults.filter(
  (result) => result.score >= RELEVANCE_THRESHOLD
);

if (relevantDocs.length === 0) {
  return "I could not find relevant information to answer your question.";
}

Re-Ranking

After the initial vector search, use a cross-encoder or a second LLM call to re-rank results by actual relevance to the query. This two-stage approach (fast retrieval followed by accurate re-ranking) is standard in production RAG systems.

Hybrid Search

Pure vector search sometimes misses results that contain exact keywords the user is looking for. Hybrid search combines vector similarity with traditional keyword search (like BM25) for the best of both worlds.

TypeScript
// Conceptual hybrid search
async function hybridSearch(
  query: string,
  collection: any,
  topK: number = 10
): Promise<SearchResult[]> {
  // Vector search — captures semantic meaning
  const vectorResults = await collection.query({
    queryEmbeddings: [await embedQuery(query)],
    nResults: topK,
  });

  // Keyword search (BM25) — captures exact term matches
  const keywordResults = await collection.query({
    queryTexts: [query],
    nResults: topK,
    where: { "$contains": query.split(" ") },
  });

  // Combine and deduplicate using Reciprocal Rank Fusion (RRF)
  return reciprocalRankFusion(vectorResults, keywordResults);
}

function reciprocalRankFusion(
  ...resultSets: SearchResult[][]
): SearchResult[] {
  const scores = new Map<string, number>();
  const k = 60; // RRF constant

  for (const results of resultSets) {
    results.forEach((result, rank) => {
      const id = result.id;
      const currentScore = scores.get(id) || 0;
      scores.set(id, currentScore + 1 / (k + rank + 1));
    });
  }

  return Array.from(scores.entries())
    .sort((a, b) => b[1] - a[1])
    .map(([id, score]) => ({ id, score }));
}

Hybrid search is particularly important when your documents contain domain-specific jargon, product names, or codes that embeddings might not capture well.

Cost Optimization

RAG systems can become expensive at scale. Here are practical strategies to keep costs under control.

1. Use Smaller Embedding Models

text-embedding-3-small is significantly cheaper than text-embedding-3-large and performs well for most use cases. Only upgrade if retrieval quality demands it.

2. Cache Embeddings

Never re-embed the same text twice. Store embeddings alongside the original text and only regenerate when content changes.

TypeScript
async function getOrCreateEmbedding(
  text: string,
  cache: Map<string, number[]>
): Promise<number[]> {
  const hash = createHash("sha256").update(text).digest("hex");
  if (cache.has(hash)) return cache.get(hash)!;

  const embedding = await embedQuery(text);
  cache.set(hash, embedding);
  return embedding;
}

3. Smart Chunking Reduces Token Usage

Smaller, more focused chunks mean fewer tokens sent to the LLM. A 200-token chunk that is highly relevant is better than a 2000-token chunk where only 10% is useful.

4. Limit Retrieved Context

Do not retrieve 20 chunks when 3 to 5 are enough. More context does not always mean a better answer — it can actually confuse the model and increase cost.

5. Use Caching at the Query Level

If users frequently ask similar questions, cache the final answers. Use the query embedding as a cache key and return cached responses when the cosine similarity exceeds a high threshold (e.g., 0.98).

6. Batch Embedding Requests

Embedding APIs support batch input. Always send multiple texts in a single API call rather than one at a time.

TypeScript
// Inefficient: one API call per chunk
for (const chunk of chunks) {
  await embedSingleChunk(chunk); // N API calls
}

// Efficient: one API call for all chunks
const allEmbeddings = await embedDocuments(chunks); // 1 API call

Cost Comparison Table

Operation	Approximate Cost	Optimization
Embedding 1M tokens (small model)	~$0.02	Batch requests, cache results
Embedding 1M tokens (large model)	~$0.13	Use small model unless quality demands it
Claude Sonnet per 1K input tokens	~$0.003	Reduce retrieved context size
Claude Sonnet per 1K output tokens	~$0.015	Set reasonable max_tokens
Vector DB storage (managed)	~$0.10/month per 1M vectors	Use pgvector if already on PostgreSQL

Summary

RAG is the most practical way to make Claude (or any LLM) an expert on your specific data. The pipeline is straightforward: embed your documents, store them in a vector database, embed the user's query, retrieve relevant chunks, augment the prompt with context, and generate an answer.

The key decisions that determine success are: choosing the right chunking strategy, setting appropriate relevance thresholds, implementing hybrid search for robustness, and optimizing costs through caching and batching. Master these fundamentals, and you can build production-grade Q&A systems, document search engines, customer support bots, and knowledge management tools.

Module 13

1/7

📋 Team Workflows & Best Practices

Building Chatbot Frameworks 🗨️

1/7