🔎Project: AI-Powered Search Engine
Build a search engine that indexes documents and answers questions with sources
Project: AI-Powered Search Engine
In this project you will build a fully working document search engine from scratch. It ingests text and Markdown files, indexes them using TF-IDF keyword scoring, processes natural-language queries, and uses Claude to generate answers with source citations. Everything runs from a simple CLI — no vector database required.
What You Will Build
By the end of this lesson you will have a Node.js CLI tool that can:
- Ingest a folder of
.txtand.mdfiles. - Index them with a lightweight TF-IDF scoring algorithm.
- Search the index given a free-text query and return the most relevant chunks.
- Generate a Claude-powered answer that references the exact sources it used.
- Output structured JSON results with title, score, and snippet for each source.
Architecture Overview
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Document │────▶│ Indexer │────▶│ TF-IDF │
│ Loader │ │ (chunker) │ │ Index │
└──────────────┘ └──────────────┘ └──────┬───────┘
│
┌──────────────┐ │
│ Claude API │◀────────────┤
│ (answerer) │ │
└──────┬───────┘ ┌──────┴───────┐
│ │ Query │
▼ │ Processor │
┌──────────────┐ └──────────────┘
│ Structured │
│ Output │
└──────────────┘
The flow is: load documents -> chunk them -> build index -> receive query -> retrieve top chunks -> send to Claude -> return structured answer.
Step 1 — Project Setup
Create a new directory and initialise the project:
mkdir ai-search-engine && cd ai-search-engine
npm init -y
npm install @anthropic-ai/sdk
npm install -D typescript @types/node tsx
npx tsc --initCreate the folder structure:
ai-search-engine/
├── docs/ # put your text/markdown files here
├── src/
│ ├── loader.ts # reads files from disk
│ ├── indexer.ts # TF-IDF indexing logic
│ ├── search.ts # query processing & ranking
│ ├── answerer.ts # Claude-powered answer generation
│ ├── types.ts # shared type definitions
│ └── cli.ts # main entry point
├── package.json
└── tsconfig.json
Step 2 — Shared Types
Start with clear type definitions so every module speaks the same language.
// src/types.ts
export interface Document {
id: string;
title: string;
filePath: string;
content: string;
chunks: Chunk[];
}
export interface Chunk {
id: string;
documentId: string;
documentTitle: string;
text: string;
index: number;
}
export interface TFIDFIndex {
documents: Document[];
idf: Record<string, number>;
tfidf: Record<string, Record<string, number>>; // chunkId -> term -> score
}
export interface SearchResult {
chunkId: string;
documentTitle: string;
snippet: string;
score: number;
}
export interface AnswerResponse {
answer: string;
sources: {
title: string;
snippet: string;
relevance: number;
}[];
query: string;
}Step 3 — Document Loader
The loader reads every .txt and .md file from a directory, splits each file into chunks of roughly 500 characters (respecting paragraph boundaries), and returns an array of Document objects.
// src/loader.ts
const SUPPORTED_EXTENSIONS = [".txt", ".md"];
const CHUNK_SIZE = 500;
export function loadDocuments(dirPath: string): Document[] {
const files = fs
.readdirSync(dirPath)
.filter((f) => SUPPORTED_EXTENSIONS.includes(path.extname(f).toLowerCase()));
if (files.length === 0) {
throw new Error(`No supported files found in ${dirPath}`);
}
return files.map((file) => {
const filePath = path.join(dirPath, file);
const content = fs.readFileSync(filePath, "utf-8");
const id = path.basename(file, path.extname(file));
const title = formatTitle(id);
const chunks = chunkText(content, id, title);
return { id, title, filePath, content, chunks };
});
}
function formatTitle(filename: string): string {
return filename
.replace(/[-_]/g, " ")
.replace(/\b\w/g, (c) => c.toUpperCase());
}
function chunkText(text: string, docId: string, docTitle: string): Chunk[] {
const paragraphs = text.split(/\n\s*\n/);
const chunks: Chunk[] = [];
let buffer = "";
let index = 0;
for (const para of paragraphs) {
const trimmed = para.trim();
if (!trimmed) continue;
if (buffer.length + trimmed.length > CHUNK_SIZE && buffer.length > 0) {
chunks.push({
id: `${docId}-chunk-${index}`,
documentId: docId,
documentTitle: docTitle,
text: buffer.trim(),
index,
});
index++;
buffer = "";
}
buffer += trimmed + "\n\n";
}
if (buffer.trim().length > 0) {
chunks.push({
id: `${docId}-chunk-${index}`,
documentId: docId,
documentTitle: docTitle,
text: buffer.trim(),
index,
});
}
return chunks;
}Step 4 — TF-IDF Indexer
TF-IDF (Term Frequency - Inverse Document Frequency) is a classic information retrieval technique. It scores each word in each chunk based on how frequently it appears in that chunk versus how rare it is across all chunks. Rare, meaningful words get higher scores than common ones.
// src/indexer.ts
export function buildIndex(documents: Document[]): TFIDFIndex {
const allChunks: Chunk[] = documents.flatMap((doc) => doc.chunks);
const totalChunks = allChunks.length;
// Step A: compute document frequency for each term
const df: Record<string, number> = {};
for (const chunk of allChunks) {
const uniqueTerms = new Set(tokenize(chunk.text));
for (const term of uniqueTerms) {
df[term] = (df[term] || 0) + 1;
}
}
// Step B: compute IDF
const idf: Record<string, number> = {};
for (const [term, freq] of Object.entries(df)) {
idf[term] = Math.log((totalChunks + 1) / (freq + 1)) + 1;
}
// Step C: compute TF-IDF for each chunk
const tfidf: Record<string, Record<string, number>> = {};
for (const chunk of allChunks) {
const terms = tokenize(chunk.text);
const tf: Record<string, number> = {};
for (const term of terms) {
tf[term] = (tf[term] || 0) + 1;
}
// normalise TF by chunk length
const maxTf = Math.max(...Object.values(tf));
const scores: Record<string, number> = {};
for (const [term, count] of Object.entries(tf)) {
scores[term] = (count / maxTf) * (idf[term] || 0);
}
tfidf[chunk.id] = scores;
}
return { documents, idf, tfidf };
}
export function tokenize(text: string): string[] {
return text
.toLowerCase()
.replace(/[^a-z0-9\s]/g, " ")
.split(/\s+/)
.filter((t) => t.length > 2)
.filter((t) => !STOP_WORDS.has(t));
}
const STOP_WORDS = new Set([
"the", "and", "for", "are", "but", "not", "you", "all",
"can", "had", "her", "was", "one", "our", "out", "has",
"his", "how", "its", "may", "new", "now", "old", "see",
"way", "who", "did", "get", "let", "say", "she", "too",
"use", "this", "that", "with", "have", "from", "they",
"been", "will", "more", "when", "what", "your", "than",
"them", "then", "some", "into", "also", "just", "about",
"which", "would", "there", "their", "could", "other",
"very", "after", "these", "should", "where",
]);Step 5 — Search / Query Processor
The search module takes a user query, tokenizes it the same way, scores every chunk by summing the TF-IDF values of matching terms, and returns the top results.
// src/search.ts
const TOP_K = 5;
export function search(query: string, index: TFIDFIndex): SearchResult[] {
const queryTerms = tokenize(query);
if (queryTerms.length === 0) {
return [];
}
const allChunks: Chunk[] = index.documents.flatMap((d) => d.chunks);
const scored: SearchResult[] = [];
for (const chunk of allChunks) {
const chunkScores = index.tfidf[chunk.id] || {};
let score = 0;
for (const term of queryTerms) {
score += chunkScores[term] || 0;
}
if (score > 0) {
scored.push({
chunkId: chunk.id,
documentTitle: chunk.documentTitle,
snippet: chunk.text.slice(0, 200) + (chunk.text.length > 200 ? "..." : ""),
score: Math.round(score * 1000) / 1000,
});
}
}
scored.sort((a, b) => b.score - a.score);
return scored.slice(0, TOP_K);
}Step 6 — Claude-Powered Answer Generation
This is where the magic happens. We send the top search results to Claude along with the user query, and Claude synthesizes a clear answer with source citations.
// src/answerer.ts
const client = new Anthropic();
export async function generateAnswer(
query: string,
results: SearchResult[]
): Promise<AnswerResponse> {
if (results.length === 0) {
return {
answer: "No relevant documents found for your query.",
sources: [],
query,
};
}
const contextBlock = results
.map(
(r, i) =>
`[Source ${i + 1}: ${r.documentTitle} (score: ${r.score})]\n${r.snippet}`
)
.join("\n\n");
const systemPrompt = `You are a precise research assistant. Answer the user's
question using ONLY the provided source documents. Follow these rules:
1. Base your answer strictly on the provided sources.
2. Cite sources using [Source N] notation inline.
3. If the sources do not contain enough information, say so clearly.
4. Keep the answer concise but thorough.
5. At the end, list each source you referenced with a one-line summary.
Respond in this exact JSON format:
{
"answer": "Your answer text with [Source N] citations...",
"sources": [
{ "title": "Document Title", "snippet": "key excerpt", "relevance": 0.95 }
]
}`;
const message = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
messages: [
{
role: "user",
content: `## Sources\n\n${contextBlock}\n\n## Question\n\n${query}`,
},
],
system: systemPrompt,
});
const raw =
message.content[0].type === "text" ? message.content[0].text : "";
try {
const parsed = JSON.parse(raw);
return { ...parsed, query };
} catch {
return {
answer: raw,
sources: results.map((r) => ({
title: r.documentTitle,
snippet: r.snippet,
relevance: r.score,
})),
query,
};
}
}Step 7 — CLI Entry Point
The CLI ties everything together. It accepts a docs directory and a query as arguments.
// src/cli.ts
async function main() {
const args = process.argv.slice(2);
if (args.length < 2) {
console.log("Usage: npx tsx src/cli.ts <docs-folder> <query>");
console.log('Example: npx tsx src/cli.ts ./docs "What is TF-IDF?"');
process.exit(1);
}
const docsDir = args[0];
const query = args.slice(1).join(" ");
console.log("\n--- AI Search Engine ---\n");
console.log(`Loading documents from: ${docsDir}`);
// Step 1: Load
const documents = loadDocuments(docsDir);
console.log(`Loaded ${documents.length} document(s), ${
documents.reduce((sum, d) => sum + d.chunks.length, 0)
} chunk(s) total.\n`);
// Step 2: Index
console.log("Building TF-IDF index...");
const index = buildIndex(documents);
console.log("Index ready.\n");
// Step 3: Search
console.log(`Searching for: "${query}"\n`);
const results = search(query, index);
if (results.length === 0) {
console.log("No relevant results found.");
return;
}
console.log(`Found ${results.length} relevant chunk(s):\n`);
results.forEach((r, i) => {
console.log(` ${i + 1}. [${r.documentTitle}] score=${r.score}`);
console.log(` ${r.snippet.slice(0, 80)}...\n`);
});
// Step 4: Generate answer
console.log("Generating AI answer...\n");
const answer = await generateAnswer(query, results);
console.log("=== ANSWER ===\n");
console.log(answer.answer);
console.log("\n=== SOURCES ===\n");
answer.sources.forEach((s, i) => {
console.log(` [${i + 1}] ${s.title} (relevance: ${s.relevance})`);
console.log(` ${s.snippet.slice(0, 100)}\n`);
});
// Step 5: Structured JSON output
console.log("\n=== RAW JSON ===\n");
console.log(JSON.stringify(answer, null, 2));
}
main().catch((err) => {
console.error("Error:", err.message);
process.exit(1);
});Step 8 — Try It Out
Create a few sample documents inside the docs/ folder:
docs/machine-learning.md
# Machine Learning Basics
Machine learning is a subset of artificial intelligence that enables
systems to learn and improve from experience without being explicitly
programmed. It focuses on developing algorithms that can access data
and use it to learn for themselves.
## Types of Machine Learning
- **Supervised Learning**: The algorithm learns from labeled training data.
- **Unsupervised Learning**: The algorithm finds patterns in unlabeled data.
- **Reinforcement Learning**: The algorithm learns by interacting with an
environment and receiving rewards or penalties.docs/search-algorithms.txt
Search Algorithms Overview
TF-IDF (Term Frequency - Inverse Document Frequency) is a numerical
statistic that reflects how important a word is to a document in a
collection. It is commonly used as a weighting factor in information
retrieval and text mining.
The TF-IDF value increases proportionally to the number of times a
word appears in the document and is offset by the number of documents
in the collection that contain the word.
BM25 is an improvement over TF-IDF that adds document length
normalisation and term frequency saturation.
Now run the search:
npx tsx src/cli.ts ./docs "What is TF-IDF and how does it work?"Example output:
--- AI Search Engine ---
Loading documents from: ./docs
Loaded 2 document(s), 4 chunk(s) total.
Building TF-IDF index...
Index ready.
Searching for: "What is TF-IDF and how does it work?"
Found 2 relevant chunk(s):
1. [Search Algorithms] score=4.231
TF-IDF (Term Frequency - Inverse Document Frequency) is a numerical...
2. [Machine Learning] score=1.102
Machine learning is a subset of artificial intelligence that enables...
Generating AI answer...
=== ANSWER ===
TF-IDF stands for Term Frequency - Inverse Document Frequency. It is
a numerical statistic that reflects how important a word is to a
document within a collection [Source 1]. The value increases with the
number of times a word appears in a document but is offset by how
many documents in the collection contain that word [Source 1].
=== SOURCES ===
[1] Search Algorithms (relevance: 0.95)
TF-IDF (Term Frequency - Inverse Document Frequency) is a numerical...
How the TF-IDF Scoring Works
Let us walk through the math with a concrete example.
Suppose you have 10 chunks and the word "tfidf" appears in only 2 of them.
IDF = ln((10 + 1) / (2 + 1)) + 1 = ln(3.67) + 1 = 1.30 + 1 = 2.30
Now in chunk A the word appears 3 times and the most frequent word appears 5 times:
TF = 3 / 5 = 0.6
TF-IDF = 0.6 * 2.30 = 1.38
In chunk B the word appears once and the most frequent word appears 8 times:
TF = 1 / 8 = 0.125
TF-IDF = 0.125 * 2.30 = 0.29
So chunk A would rank higher for a query containing "tfidf" — exactly what we want.
Key Design Decisions
| Decision | Reason |
|---|---|
| TF-IDF over vector embeddings | Keeps the project dependency-free and understandable; no external database needed |
| Chunk size of 500 chars | Balances granularity with context; too small loses meaning, too large loses precision |
| Top-5 results | Provides enough context for Claude without exceeding token limits |
| JSON structured output | Makes the tool composable with other systems |
| Paragraph-aware chunking | Avoids splitting sentences in the middle |
Extending the Project
Once you have the basics working, consider these enhancements:
- Add BM25 scoring — a more sophisticated ranking algorithm that handles document length.
- Recursive directory loading — traverse subdirectories for larger document sets.
- PDF support — use a library like
pdf-parseto extract text from PDFs. - Streaming answers — use Claude's streaming API to show the answer as it generates.
- Web interface — add an Express server and a simple HTML frontend.
- Caching — store the index to disk so you only rebuild when documents change.
- Hybrid search — combine keyword matching with semantic similarity using embeddings.
Summary
In this project you built a complete document search engine that:
- Loads and chunks text and Markdown files.
- Builds a TF-IDF index for fast keyword-based retrieval.
- Ranks document chunks by relevance to a user query.
- Uses Claude to generate a natural-language answer with inline source citations.
- Returns structured JSON output ready for downstream consumption.
The entire system runs locally from the command line with no external database, demonstrating how traditional information retrieval techniques pair powerfully with large language models.