HomeSpecialized Use CasesProject: AI-Powered Search Engine

advanced20 min read· Module 13, Lesson 6

🔎Project: AI-Powered Search Engine

Build a search engine that indexes documents and answers questions with sources

Project: AI-Powered Search Engine

In this project you will build a fully working document search engine from scratch. It ingests text and Markdown files, indexes them using TF-IDF keyword scoring, processes natural-language queries, and uses Claude to generate answers with source citations. Everything runs from a simple CLI — no vector database required.

What You Will Build

By the end of this lesson you will have a Node.js CLI tool that can:

Ingest a folder of .txt and .md files.
Index them with a lightweight TF-IDF scoring algorithm.
Search the index given a free-text query and return the most relevant chunks.
Generate a Claude-powered answer that references the exact sources it used.
Output structured JSON results with title, score, and snippet for each source.

Architecture Overview

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  Document    │────▶│   Indexer    │────▶│  TF-IDF      │
│  Loader      │     │  (chunker)   │     │  Index       │
└──────────────┘     └──────────────┘     └──────┬───────┘
                                                 │
                     ┌──────────────┐             │
                     │  Claude API  │◀────────────┤
                     │  (answerer)  │             │
                     └──────┬───────┘     ┌──────┴───────┐
                            │             │  Query       │
                            ▼             │  Processor   │
                     ┌──────────────┐     └──────────────┘
                     │  Structured  │
                     │  Output      │
                     └──────────────┘

The flow is: load documents -> chunk them -> build index -> receive query -> retrieve top chunks -> send to Claude -> return structured answer.

Step 1 — Project Setup

Create a new directory and initialise the project:

Terminal
mkdir ai-search-engine && cd ai-search-engine
npm init -y
npm install @anthropic-ai/sdk
npm install -D typescript @types/node tsx
npx tsc --init

Create the folder structure:

ai-search-engine/
├── docs/              # put your text/markdown files here
├── src/
│   ├── loader.ts      # reads files from disk
│   ├── indexer.ts     # TF-IDF indexing logic
│   ├── search.ts      # query processing & ranking
│   ├── answerer.ts    # Claude-powered answer generation
│   ├── types.ts       # shared type definitions
│   └── cli.ts         # main entry point
├── package.json
└── tsconfig.json

Step 2 — Shared Types

Start with clear type definitions so every module speaks the same language.

TypeScript
// src/types.ts

export interface Document {
  id: string;
  title: string;
  filePath: string;
  content: string;
  chunks: Chunk[];
}

export interface Chunk {
  id: string;
  documentId: string;
  documentTitle: string;
  text: string;
  index: number;
}

export interface TFIDFIndex {
  documents: Document[];
  idf: Record<string, number>;
  tfidf: Record<string, Record<string, number>>; // chunkId -> term -> score
}

export interface SearchResult {
  chunkId: string;
  documentTitle: string;
  snippet: string;
  score: number;
}

export interface AnswerResponse {
  answer: string;
  sources: {
    title: string;
    snippet: string;
    relevance: number;
  }[];
  query: string;
}

Step 3 — Document Loader

The loader reads every .txt and .md file from a directory, splits each file into chunks of roughly 500 characters (respecting paragraph boundaries), and returns an array of Document objects.

TypeScript
// src/loader.ts


const SUPPORTED_EXTENSIONS = [".txt", ".md"];
const CHUNK_SIZE = 500;

export function loadDocuments(dirPath: string): Document[] {
  const files = fs
    .readdirSync(dirPath)
    .filter((f) => SUPPORTED_EXTENSIONS.includes(path.extname(f).toLowerCase()));

  if (files.length === 0) {
    throw new Error(`No supported files found in ${dirPath}`);
  }

  return files.map((file) => {
    const filePath = path.join(dirPath, file);
    const content = fs.readFileSync(filePath, "utf-8");
    const id = path.basename(file, path.extname(file));
    const title = formatTitle(id);
    const chunks = chunkText(content, id, title);

    return { id, title, filePath, content, chunks };
  });
}

function formatTitle(filename: string): string {
  return filename
    .replace(/[-_]/g, " ")
    .replace(/\b\w/g, (c) => c.toUpperCase());
}

function chunkText(text: string, docId: string, docTitle: string): Chunk[] {
  const paragraphs = text.split(/\n\s*\n/);
  const chunks: Chunk[] = [];
  let buffer = "";
  let index = 0;

  for (const para of paragraphs) {
    const trimmed = para.trim();
    if (!trimmed) continue;

    if (buffer.length + trimmed.length > CHUNK_SIZE && buffer.length > 0) {
      chunks.push({
        id: `${docId}-chunk-${index}`,
        documentId: docId,
        documentTitle: docTitle,
        text: buffer.trim(),
        index,
      });
      index++;
      buffer = "";
    }
    buffer += trimmed + "\n\n";
  }

  if (buffer.trim().length > 0) {
    chunks.push({
      id: `${docId}-chunk-${index}`,
      documentId: docId,
      documentTitle: docTitle,
      text: buffer.trim(),
      index,
    });
  }

  return chunks;
}

Step 4 — TF-IDF Indexer

TF-IDF (Term Frequency - Inverse Document Frequency) is a classic information retrieval technique. It scores each word in each chunk based on how frequently it appears in that chunk versus how rare it is across all chunks. Rare, meaningful words get higher scores than common ones.

TypeScript
// src/indexer.ts


export function buildIndex(documents: Document[]): TFIDFIndex {
  const allChunks: Chunk[] = documents.flatMap((doc) => doc.chunks);
  const totalChunks = allChunks.length;

  // Step A: compute document frequency for each term
  const df: Record<string, number> = {};
  for (const chunk of allChunks) {
    const uniqueTerms = new Set(tokenize(chunk.text));
    for (const term of uniqueTerms) {
      df[term] = (df[term] || 0) + 1;
    }
  }

  // Step B: compute IDF
  const idf: Record<string, number> = {};
  for (const [term, freq] of Object.entries(df)) {
    idf[term] = Math.log((totalChunks + 1) / (freq + 1)) + 1;
  }

  // Step C: compute TF-IDF for each chunk
  const tfidf: Record<string, Record<string, number>> = {};
  for (const chunk of allChunks) {
    const terms = tokenize(chunk.text);
    const tf: Record<string, number> = {};
    for (const term of terms) {
      tf[term] = (tf[term] || 0) + 1;
    }
    // normalise TF by chunk length
    const maxTf = Math.max(...Object.values(tf));
    const scores: Record<string, number> = {};
    for (const [term, count] of Object.entries(tf)) {
      scores[term] = (count / maxTf) * (idf[term] || 0);
    }
    tfidf[chunk.id] = scores;
  }

  return { documents, idf, tfidf };
}

export function tokenize(text: string): string[] {
  return text
    .toLowerCase()
    .replace(/[^a-z0-9\s]/g, " ")
    .split(/\s+/)
    .filter((t) => t.length > 2)
    .filter((t) => !STOP_WORDS.has(t));
}

const STOP_WORDS = new Set([
  "the", "and", "for", "are", "but", "not", "you", "all",
  "can", "had", "her", "was", "one", "our", "out", "has",
  "his", "how", "its", "may", "new", "now", "old", "see",
  "way", "who", "did", "get", "let", "say", "she", "too",
  "use", "this", "that", "with", "have", "from", "they",
  "been", "will", "more", "when", "what", "your", "than",
  "them", "then", "some", "into", "also", "just", "about",
  "which", "would", "there", "their", "could", "other",
  "very", "after", "these", "should", "where",
]);

Step 5 — Search / Query Processor

The search module takes a user query, tokenizes it the same way, scores every chunk by summing the TF-IDF values of matching terms, and returns the top results.

TypeScript
// src/search.ts


const TOP_K = 5;

export function search(query: string, index: TFIDFIndex): SearchResult[] {
  const queryTerms = tokenize(query);

  if (queryTerms.length === 0) {
    return [];
  }

  const allChunks: Chunk[] = index.documents.flatMap((d) => d.chunks);
  const scored: SearchResult[] = [];

  for (const chunk of allChunks) {
    const chunkScores = index.tfidf[chunk.id] || {};
    let score = 0;

    for (const term of queryTerms) {
      score += chunkScores[term] || 0;
    }

    if (score > 0) {
      scored.push({
        chunkId: chunk.id,
        documentTitle: chunk.documentTitle,
        snippet: chunk.text.slice(0, 200) + (chunk.text.length > 200 ? "..." : ""),
        score: Math.round(score * 1000) / 1000,
      });
    }
  }

  scored.sort((a, b) => b.score - a.score);
  return scored.slice(0, TOP_K);
}

Step 6 — Claude-Powered Answer Generation

This is where the magic happens. We send the top search results to Claude along with the user query, and Claude synthesizes a clear answer with source citations.

TypeScript
// src/answerer.ts


const client = new Anthropic();

export async function generateAnswer(
  query: string,
  results: SearchResult[]
): Promise<AnswerResponse> {
  if (results.length === 0) {
    return {
      answer: "No relevant documents found for your query.",
      sources: [],
      query,
    };
  }

  const contextBlock = results
    .map(
      (r, i) =>
        `[Source ${i + 1}: ${r.documentTitle} (score: ${r.score})]\n${r.snippet}`
    )
    .join("\n\n");

  const systemPrompt = `You are a precise research assistant. Answer the user's
question using ONLY the provided source documents. Follow these rules:

1. Base your answer strictly on the provided sources.
2. Cite sources using [Source N] notation inline.
3. If the sources do not contain enough information, say so clearly.
4. Keep the answer concise but thorough.
5. At the end, list each source you referenced with a one-line summary.

Respond in this exact JSON format:
{
  "answer": "Your answer text with [Source N] citations...",
  "sources": [
    { "title": "Document Title", "snippet": "key excerpt", "relevance": 0.95 }
  ]
}`;

  const message = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    messages: [
      {
        role: "user",
        content: `## Sources\n\n${contextBlock}\n\n## Question\n\n${query}`,
      },
    ],
    system: systemPrompt,
  });

  const raw =
    message.content[0].type === "text" ? message.content[0].text : "";

  try {
    const parsed = JSON.parse(raw);
    return { ...parsed, query };
  } catch {
    return {
      answer: raw,
      sources: results.map((r) => ({
        title: r.documentTitle,
        snippet: r.snippet,
        relevance: r.score,
      })),
      query,
    };
  }
}

Step 7 — CLI Entry Point

The CLI ties everything together. It accepts a docs directory and a query as arguments.

TypeScript
// src/cli.ts


async function main() {
  const args = process.argv.slice(2);

  if (args.length < 2) {
    console.log("Usage: npx tsx src/cli.ts <docs-folder> <query>");
    console.log('Example: npx tsx src/cli.ts ./docs "What is TF-IDF?"');
    process.exit(1);
  }

  const docsDir = args[0];
  const query = args.slice(1).join(" ");

  console.log("\n--- AI Search Engine ---\n");
  console.log(`Loading documents from: ${docsDir}`);

  // Step 1: Load
  const documents = loadDocuments(docsDir);
  console.log(`Loaded ${documents.length} document(s), ${
    documents.reduce((sum, d) => sum + d.chunks.length, 0)
  } chunk(s) total.\n`);

  // Step 2: Index
  console.log("Building TF-IDF index...");
  const index = buildIndex(documents);
  console.log("Index ready.\n");

  // Step 3: Search
  console.log(`Searching for: "${query}"\n`);
  const results = search(query, index);

  if (results.length === 0) {
    console.log("No relevant results found.");
    return;
  }

  console.log(`Found ${results.length} relevant chunk(s):\n`);
  results.forEach((r, i) => {
    console.log(`  ${i + 1}. [${r.documentTitle}] score=${r.score}`);
    console.log(`     ${r.snippet.slice(0, 80)}...\n`);
  });

  // Step 4: Generate answer
  console.log("Generating AI answer...\n");
  const answer = await generateAnswer(query, results);

  console.log("=== ANSWER ===\n");
  console.log(answer.answer);
  console.log("\n=== SOURCES ===\n");
  answer.sources.forEach((s, i) => {
    console.log(`  [${i + 1}] ${s.title} (relevance: ${s.relevance})`);
    console.log(`      ${s.snippet.slice(0, 100)}\n`);
  });

  // Step 5: Structured JSON output
  console.log("\n=== RAW JSON ===\n");
  console.log(JSON.stringify(answer, null, 2));
}

main().catch((err) => {
  console.error("Error:", err.message);
  process.exit(1);
});

Step 8 — Try It Out

Create a few sample documents inside the docs/ folder:

docs/machine-learning.md

Markdown
# Machine Learning Basics

Machine learning is a subset of artificial intelligence that enables
systems to learn and improve from experience without being explicitly
programmed. It focuses on developing algorithms that can access data
and use it to learn for themselves.

## Types of Machine Learning

- **Supervised Learning**: The algorithm learns from labeled training data.
- **Unsupervised Learning**: The algorithm finds patterns in unlabeled data.
- **Reinforcement Learning**: The algorithm learns by interacting with an
  environment and receiving rewards or penalties.

docs/search-algorithms.txt

Search Algorithms Overview

TF-IDF (Term Frequency - Inverse Document Frequency) is a numerical
statistic that reflects how important a word is to a document in a
collection. It is commonly used as a weighting factor in information
retrieval and text mining.

The TF-IDF value increases proportionally to the number of times a
word appears in the document and is offset by the number of documents
in the collection that contain the word.

BM25 is an improvement over TF-IDF that adds document length
normalisation and term frequency saturation.

Now run the search:

Terminal

npx tsx src/cli.ts ./docs "What is TF-IDF and how does it work?"

Example output:

--- AI Search Engine ---

Loading documents from: ./docs
Loaded 2 document(s), 4 chunk(s) total.

Building TF-IDF index...
Index ready.

Searching for: "What is TF-IDF and how does it work?"

Found 2 relevant chunk(s):

  1. [Search Algorithms] score=4.231
     TF-IDF (Term Frequency - Inverse Document Frequency) is a numerical...

  2. [Machine Learning] score=1.102
     Machine learning is a subset of artificial intelligence that enables...

Generating AI answer...

=== ANSWER ===

TF-IDF stands for Term Frequency - Inverse Document Frequency. It is
a numerical statistic that reflects how important a word is to a
document within a collection [Source 1]. The value increases with the
number of times a word appears in a document but is offset by how
many documents in the collection contain that word [Source 1].

=== SOURCES ===

  [1] Search Algorithms (relevance: 0.95)
      TF-IDF (Term Frequency - Inverse Document Frequency) is a numerical...

How the TF-IDF Scoring Works

Let us walk through the math with a concrete example.

Suppose you have 10 chunks and the word "tfidf" appears in only 2 of them.

IDF = ln((10 + 1) / (2 + 1)) + 1 = ln(3.67) + 1 = 1.30 + 1 = 2.30

Now in chunk A the word appears 3 times and the most frequent word appears 5 times:

TF = 3 / 5 = 0.6

TF-IDF = 0.6 * 2.30 = 1.38

In chunk B the word appears once and the most frequent word appears 8 times:

TF = 1 / 8 = 0.125

TF-IDF = 0.125 * 2.30 = 0.29

So chunk A would rank higher for a query containing "tfidf" — exactly what we want.

Key Design Decisions

Decision	Reason
TF-IDF over vector embeddings	Keeps the project dependency-free and understandable; no external database needed
Chunk size of 500 chars	Balances granularity with context; too small loses meaning, too large loses precision
Top-5 results	Provides enough context for Claude without exceeding token limits
JSON structured output	Makes the tool composable with other systems
Paragraph-aware chunking	Avoids splitting sentences in the middle

Extending the Project

Once you have the basics working, consider these enhancements:

Add BM25 scoring — a more sophisticated ranking algorithm that handles document length.
Recursive directory loading — traverse subdirectories for larger document sets.
PDF support — use a library like pdf-parse to extract text from PDFs.
Streaming answers — use Claude's streaming API to show the answer as it generates.
Web interface — add an Express server and a simple HTML frontend.
Caching — store the index to disk so you only rebuild when documents change.
Hybrid search — combine keyword matching with semantic similarity using embeddings.

Summary

In this project you built a complete document search engine that:

Loads and chunks text and Markdown files.
Builds a TF-IDF index for fast keyword-based retrieval.
Ranks document chunks by relevance to a user query.
Uses Claude to generate a natural-language answer with inline source citations.
Returns structured JSON output ready for downstream consumption.

The entire system runs locally from the command line with no external database, demonstrating how traditional information retrieval techniques pair powerfully with large language models.

Module 13

6/7

📊 Data Analysis & Data Science

What's Next — Your AI Journey 🚀

6/7