🔍RAG — Retrieval-Augmented Generation
Build knowledge bases with embeddings and vector search
What Is RAG?
Retrieval-Augmented Generation (RAG) is a technique that enhances a large language model's responses by retrieving relevant information from an external knowledge base before generating an answer. Instead of relying solely on what the model learned during training, RAG lets you inject fresh, domain-specific context into every prompt — dynamically and at query time.
Why does this matter? LLMs have a knowledge cutoff date. They do not know about your internal documents, proprietary data, or anything that happened after training ended. RAG bridges that gap by fetching the most relevant information and feeding it directly into the model's context window.
The core idea is simple: search first, then generate. This combination transforms a general-purpose LLM into a specialized assistant that can answer questions about your specific data with high accuracy.
Embeddings Explained
Before we can search for relevant content, we need a way to represent text as numbers that capture its meaning — not just its keywords. This is exactly what embeddings do.
An embedding is a dense vector (a list of floating-point numbers) that represents a piece of text in a high-dimensional space. Texts with similar meanings end up close together in this space, while unrelated texts are far apart.
// Conceptual example: generating an embedding
const text = "How do I reset my password?";
const embedding = await embeddingModel.embed(text);
// Result: [0.023, -0.041, 0.087, ..., 0.012] (e.g., 1536 dimensions)Key properties of embeddings:
| Property | Description |
|---|---|
| Semantic similarity | "How do I reset my password?" and "I forgot my login credentials" produce vectors that are close together |
| Dimensionality | Typical embedding models produce vectors with 768 to 3072 dimensions |
| Distance metrics | Cosine similarity and dot product are the most common ways to measure closeness |
| Model-specific | Embeddings from different models are NOT interchangeable — always use the same model for indexing and querying |
Popular embedding models include OpenAI's text-embedding-3-small, Cohere's embed-english-v3.0, and open-source options like all-MiniLM-L6-v2 from Sentence Transformers.
Vector Databases
Once you have embeddings, you need a place to store them and search through them efficiently. This is the job of a vector database (or vector store).
A vector database is optimized for similarity search — given a query vector, it finds the most similar vectors in the database using approximate nearest neighbor (ANN) algorithms. Traditional databases search by exact match; vector databases search by meaning.
Popular Vector Databases
| Database | Type | Best For |
|---|---|---|
| Pinecone | Managed cloud service | Production workloads, zero ops |
| Weaviate | Open-source, self-hosted or cloud | Hybrid search (vector + keyword) |
| ChromaDB | Open-source, lightweight | Prototyping, local development |
| Qdrant | Open-source, high performance | Large-scale production systems |
| pgvector | PostgreSQL extension | Teams already using PostgreSQL |
| FAISS | Library (Facebook AI) | Research, in-memory search |
Storing Vectors
// Example: storing a document chunk in a vector database
const client = new ChromaClient();
const collection = await client.getOrCreateCollection({
name: "company_docs",
metadata: { "hnsw:space": "cosine" },
});
await collection.add({
ids: ["doc-001-chunk-3"],
embeddings: [[0.023, -0.041, 0.087 /* ... hundreds more dimensions */]],
metadatas: [{ source: "employee_handbook.pdf", page: 12, section: "PTO Policy" }],
documents: ["Employees are entitled to 20 days of paid time off per year..."],
});The metadata field is crucial — it lets you filter results by source, date, category, or any other attribute during retrieval.
The RAG Pipeline
The full RAG pipeline has six stages. Understanding each stage is critical for building a system that actually works well in production.
Stage 1: Embed (Indexing Phase)
Take your source documents and convert them into embeddings. This happens once (or whenever your documents change).
async function embedDocuments(chunks: string[]): Promise<number[][]> {
const response = await fetch("https://api.openai.com/v1/embeddings", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "text-embedding-3-small",
input: chunks,
}),
});
const data = await response.json();
return data.data.map((item: any) => item.embedding);
}Stage 2: Store
Save the embeddings along with their original text and metadata into your vector database.
async function storeChunks(
collection: Collection,
chunks: string[],
embeddings: number[][],
metadata: Record<string, string>[]
): Promise<void> {
const ids = chunks.map((_, i) => `chunk-${Date.now()}-${i}`);
await collection.add({ ids, embeddings, metadatas: metadata, documents: chunks });
console.log(`Stored ${chunks.length} chunks in the vector database.`);
}Stage 3: Query
When a user asks a question, embed their query using the same embedding model you used for indexing.
async function embedQuery(query: string): Promise<number[]> {
const response = await fetch("https://api.openai.com/v1/embeddings", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "text-embedding-3-small",
input: query,
}),
});
const data = await response.json();
return data.data[0].embedding;
}Stage 4: Retrieve
Search the vector database for the most similar chunks to the query embedding.
async function retrieveRelevant(
collection: Collection,
queryEmbedding: number[],
topK: number = 5
): Promise<{ documents: string[]; scores: number[]; metadatas: any[] }> {
const results = await collection.query({
queryEmbeddings: [queryEmbedding],
nResults: topK,
include: ["documents", "metadatas", "distances"],
});
return {
documents: results.documents[0] as string[],
scores: results.distances![0] as number[],
metadatas: results.metadatas![0],
};
}Stage 5: Augment
Combine the retrieved context with the user's original question into a prompt for the LLM.
function buildAugmentedPrompt(query: string, retrievedDocs: string[]): string {
const context = retrievedDocs
.map((doc, i) => `[Source ${i + 1}]\n${doc}`)
.join("\n\n");
return `You are a helpful assistant. Answer the user's question based ONLY on
the provided context. If the context does not contain enough information,
say "I don't have enough information to answer that."
Context:
${context}
Question: ${query}
Answer:`;
}Stage 6: Generate
Send the augmented prompt to Claude and get the final answer.
const anthropic = new Anthropic();
async function generateAnswer(augmentedPrompt: string): Promise<string> {
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
messages: [{ role: "user", content: augmentedPrompt }],
});
return response.content[0].type === "text" ? response.content[0].text : "";
}Claude and Citations with search_results
Claude supports a special content block type called search_results that enables automatic citation generation. When you pass retrieved documents as search result blocks, Claude can cite specific sources in its response — making answers verifiable and trustworthy.
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
messages: [
{
role: "user",
content: [
{
type: "text",
text: "What is the company's PTO policy?",
},
...retrievedDocs.map((doc, i) => ({
type: "tool_result" as const,
tool_use_id: `search_${i}`,
content: [
{
type: "text" as const,
text: JSON.stringify({
type: "search_results",
source: doc.metadata.source,
content: doc.text,
}),
},
],
})),
],
},
],
});When Claude receives search results in this format, it can generate responses with inline citations that reference the original sources. This is essential for enterprise applications where traceability and auditability matter.
Full Q&A Over Documents Example
Here is a complete, end-to-end example that ties the entire pipeline together — from loading documents to answering questions.
// Initialize clients
const anthropic = new Anthropic();
const chroma = new ChromaClient();
// Step 1: Load and chunk the document
function chunkText(text: string, chunkSize = 500, overlap = 50): string[] {
const chunks: string[] = [];
let start = 0;
while (start < text.length) {
const end = Math.min(start + chunkSize, text.length);
chunks.push(text.slice(start, end));
start += chunkSize - overlap;
}
return chunks;
}
// Step 2: Build the index
async function buildIndex(filePath: string): Promise<void> {
const text = readFileSync(filePath, "utf-8");
const chunks = chunkText(text);
const collection = await chroma.getOrCreateCollection({ name: "qa_docs" });
// Embed all chunks (using a hypothetical embedding function)
const embeddings = await embedDocuments(chunks);
await collection.add({
ids: chunks.map((_, i) => `chunk-${i}`),
embeddings,
documents: chunks,
metadatas: chunks.map((_, i) => ({ source: filePath, chunkIndex: i })),
});
console.log(`Indexed ${chunks.length} chunks from ${filePath}`);
}
// Step 3: Answer a question
async function answerQuestion(question: string): Promise<string> {
const collection = await chroma.getCollection({ name: "qa_docs" });
const queryEmbedding = await embedQuery(question);
const results = await collection.query({
queryEmbeddings: [queryEmbedding],
nResults: 5,
});
const context = (results.documents[0] as string[]).join("\n\n");
const prompt = buildAugmentedPrompt(question, results.documents[0] as string[]);
return generateAnswer(prompt);
}
// Usage
await buildIndex("./docs/employee_handbook.pdf");
const answer = await answerQuestion("How many vacation days do I get?");
console.log(answer);Chunking Strategies
How you split your documents into chunks has a massive impact on retrieval quality. Poor chunking leads to irrelevant results, truncated context, and confused answers.
Strategy 1: Fixed-Size Chunking
Split text into chunks of a fixed number of characters or tokens, with optional overlap.
| Pros | Cons |
|---|---|
| Simple to implement | Breaks sentences and paragraphs mid-thought |
| Predictable chunk sizes | Ignores document structure |
| Easy to control token usage | May split related information across chunks |
Strategy 2: Sentence-Based Chunking
Split on sentence boundaries, then group sentences until a target size is reached.
function sentenceChunk(text: string, maxTokens = 300): string[] {
const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
const chunks: string[] = [];
let current = "";
for (const sentence of sentences) {
if ((current + sentence).length > maxTokens * 4) {
if (current) chunks.push(current.trim());
current = sentence;
} else {
current += " " + sentence;
}
}
if (current.trim()) chunks.push(current.trim());
return chunks;
}Strategy 3: Semantic Chunking
Use embeddings to detect topic shifts, and split where the semantic similarity between consecutive sentences drops below a threshold.
Strategy 4: Document-Structure Chunking
Respect the document's natural structure — split by headings, sections, paragraphs, or pages. This works especially well for structured documents like manuals, FAQs, and technical documentation.
Strategy 5: Recursive Chunking
Try splitting by the largest structural unit (sections), then paragraphs, then sentences, then characters — stopping at each level when chunks are within the target size.
Best practice: Always include overlap (50-200 characters) between consecutive chunks to preserve context at boundaries.
Relevance Scoring
Not all retrieved chunks are equally useful. Relevance scoring helps you filter out noise and keep only the most meaningful results.
Cosine Similarity
The most common metric. It measures the angle between two vectors, ignoring magnitude. A score of 1.0 means identical direction (perfect match), 0.0 means orthogonal (unrelated).
function cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dotProduct / (Math.sqrt(normA) * Math.sqrt(normB));
}Relevance Threshold
Set a minimum similarity score. Discard any chunks that fall below it to avoid injecting irrelevant context.
const RELEVANCE_THRESHOLD = 0.75;
const relevantDocs = retrievedResults.filter(
(result) => result.score >= RELEVANCE_THRESHOLD
);
if (relevantDocs.length === 0) {
return "I could not find relevant information to answer your question.";
}Re-Ranking
After the initial vector search, use a cross-encoder or a second LLM call to re-rank results by actual relevance to the query. This two-stage approach (fast retrieval followed by accurate re-ranking) is standard in production RAG systems.
Hybrid Search
Pure vector search sometimes misses results that contain exact keywords the user is looking for. Hybrid search combines vector similarity with traditional keyword search (like BM25) for the best of both worlds.
// Conceptual hybrid search
async function hybridSearch(
query: string,
collection: any,
topK: number = 10
): Promise<SearchResult[]> {
// Vector search — captures semantic meaning
const vectorResults = await collection.query({
queryEmbeddings: [await embedQuery(query)],
nResults: topK,
});
// Keyword search (BM25) — captures exact term matches
const keywordResults = await collection.query({
queryTexts: [query],
nResults: topK,
where: { "$contains": query.split(" ") },
});
// Combine and deduplicate using Reciprocal Rank Fusion (RRF)
return reciprocalRankFusion(vectorResults, keywordResults);
}
function reciprocalRankFusion(
...resultSets: SearchResult[][]
): SearchResult[] {
const scores = new Map<string, number>();
const k = 60; // RRF constant
for (const results of resultSets) {
results.forEach((result, rank) => {
const id = result.id;
const currentScore = scores.get(id) || 0;
scores.set(id, currentScore + 1 / (k + rank + 1));
});
}
return Array.from(scores.entries())
.sort((a, b) => b[1] - a[1])
.map(([id, score]) => ({ id, score }));
}Hybrid search is particularly important when your documents contain domain-specific jargon, product names, or codes that embeddings might not capture well.
Cost Optimization
RAG systems can become expensive at scale. Here are practical strategies to keep costs under control.
1. Use Smaller Embedding Models
text-embedding-3-small is significantly cheaper than text-embedding-3-large and performs well for most use cases. Only upgrade if retrieval quality demands it.
2. Cache Embeddings
Never re-embed the same text twice. Store embeddings alongside the original text and only regenerate when content changes.
async function getOrCreateEmbedding(
text: string,
cache: Map<string, number[]>
): Promise<number[]> {
const hash = createHash("sha256").update(text).digest("hex");
if (cache.has(hash)) return cache.get(hash)!;
const embedding = await embedQuery(text);
cache.set(hash, embedding);
return embedding;
}3. Smart Chunking Reduces Token Usage
Smaller, more focused chunks mean fewer tokens sent to the LLM. A 200-token chunk that is highly relevant is better than a 2000-token chunk where only 10% is useful.
4. Limit Retrieved Context
Do not retrieve 20 chunks when 3 to 5 are enough. More context does not always mean a better answer — it can actually confuse the model and increase cost.
5. Use Caching at the Query Level
If users frequently ask similar questions, cache the final answers. Use the query embedding as a cache key and return cached responses when the cosine similarity exceeds a high threshold (e.g., 0.98).
6. Batch Embedding Requests
Embedding APIs support batch input. Always send multiple texts in a single API call rather than one at a time.
// Inefficient: one API call per chunk
for (const chunk of chunks) {
await embedSingleChunk(chunk); // N API calls
}
// Efficient: one API call for all chunks
const allEmbeddings = await embedDocuments(chunks); // 1 API callCost Comparison Table
| Operation | Approximate Cost | Optimization |
|---|---|---|
| Embedding 1M tokens (small model) | ~$0.02 | Batch requests, cache results |
| Embedding 1M tokens (large model) | ~$0.13 | Use small model unless quality demands it |
| Claude Sonnet per 1K input tokens | ~$0.003 | Reduce retrieved context size |
| Claude Sonnet per 1K output tokens | ~$0.015 | Set reasonable max_tokens |
| Vector DB storage (managed) | ~$0.10/month per 1M vectors | Use pgvector if already on PostgreSQL |
Summary
RAG is the most practical way to make Claude (or any LLM) an expert on your specific data. The pipeline is straightforward: embed your documents, store them in a vector database, embed the user's query, retrieve relevant chunks, augment the prompt with context, and generate an answer.
The key decisions that determine success are: choosing the right chunking strategy, setting appropriate relevance thresholds, implementing hybrid search for robustness, and optimizing costs through caching and batching. Master these fundamentals, and you can build production-grade Q&A systems, document search engines, customer support bots, and knowledge management tools.