RAG Architecture: Beyond the Basics
Advanced techniques for building production-ready RAG applications.
Retrieval-Augmented Generation (RAG) has become the standard pattern for building LLM applications that need to answer questions over private data. The basic recipe is simple: chunk your documents, embed them in a vector database, retrieve relevant chunks at query time, and pass them to an LLM as context. But the gap between a demo and a production system is vast.
After building RAG systems for legal document analysis, medical literature search, customer support automation, and internal knowledge bases, we've identified the patterns that separate good RAG from great RAG. This article covers the advanced techniques we use to improve retrieval quality, reduce hallucinations, and handle the edge cases that break naive implementations.
The Problem with Naive RAG
The basic RAG pipeline — embed query, find top-k similar chunks, stuff them into a prompt — works surprisingly well for demos. But it fails in predictable ways in production: it retrieves semantically similar but factually irrelevant passages, it loses context when important information spans multiple chunks, it can't handle queries that require reasoning across multiple documents, and it has no way to express uncertainty when the answer isn't in the knowledge base.
In our benchmarks across 5 production RAG systems, naive top-k vector retrieval had a precision of only 47% — meaning more than half the retrieved chunks were irrelevant to the query. The techniques below brought precision to 82%.
Technique 1: Hybrid Search (Dense + Sparse)
Vector embeddings capture semantic meaning but often miss exact keyword matches. If a user asks about "HIPAA Section 164.312(a)(1)" and your knowledge base has that exact reference, a purely semantic search might return a dozen vaguely related passages while missing the exact one. Hybrid search combines vector similarity (dense retrieval) with BM25 keyword matching (sparse retrieval) and fuses the results.
import { embedQuery } from './embeddings';
import { vectorStore } from './pinecone';
import { bm25Search } from './elasticsearch';
interface SearchResult {
id: string;
content: string;
score: number;
metadata: Record<string, unknown>;
}
export async function hybridSearch(
query: string,
topK: number = 10,
alpha: number = 0.7 // Weight: 0 = pure BM25, 1 = pure vector
): Promise<SearchResult[]> {
// Run both searches in parallel
const [vectorResults, bm25Results] = await Promise.all([
vectorStore.query({
vector: await embedQuery(query),
topK: topK * 2,
includeMetadata: true,
}),
bm25Search(query, topK * 2),
]);
// Reciprocal Rank Fusion (RRF)
const scores = new Map<string, number>();
const k = 60; // RRF constant
vectorResults.forEach((r, i) => {
const rrf = alpha / (k + i + 1);
scores.set(r.id, (scores.get(r.id) || 0) + rrf);
});
bm25Results.forEach((r, i) => {
const rrf = (1 - alpha) / (k + i + 1);
scores.set(r.id, (scores.get(r.id) || 0) + rrf);
});
// Sort by fused score and return top-k
return [...scores.entries()]
.sort((a, b) => b[1] - a[1])
.slice(0, topK)
.map(([id, score]) => ({ id, score, ...getDocument(id) }));
}Technique 2: Contextual Chunking
Most tutorials chunk documents by fixed token count (e.g., 512 tokens with 50-token overlap). This is fast and simple but creates terrible chunk boundaries — splitting sentences mid-thought, separating tables from their headers, and losing the relationship between a section heading and its content.
We use hierarchical chunking: first split by document structure (headings, sections, paragraphs), then apply semantic similarity to merge small consecutive chunks that discuss the same topic. Each chunk also carries contextual metadata — the document title, section hierarchy (h1 > h2 > h3), and a generated summary of the chunk's role in the larger document.
interface Chunk {
content: string;
metadata: {
documentTitle: string;
sectionPath: string[]; // ["Chapter 3", "3.2 Security", "3.2.1 Access Control"]
chunkSummary: string; // AI-generated: "Describes role-based access control requirements"
pageNumbers: number[];
previousChunkId?: string;
nextChunkId?: string; // Enables chunk chain traversal
};
}
export function contextualChunk(document: ParsedDocument): Chunk[] {
// 1. Split by document structure (headings, paragraphs, lists, tables)
const structuralChunks = splitByStructure(document);
// 2. Merge small adjacent chunks with high semantic similarity
const mergedChunks = mergeBySemanticSimilarity(structuralChunks, {
maxTokens: 512,
similarityThreshold: 0.85,
});
// 3. Add contextual metadata
return mergedChunks.map((chunk, i) => ({
...chunk,
metadata: {
...chunk.metadata,
chunkSummary: generateSummary(chunk.content),
previousChunkId: mergedChunks[i - 1]?.id,
nextChunkId: mergedChunks[i + 1]?.id,
},
}));
}Technique 3: Query Transformation
User queries are often vague, ambiguous, or poorly structured for retrieval. "What's our refund policy?" might need to match a document titled "Section 8.3: Returns, Exchanges, and Refund Procedures." Query transformation uses an LLM to rewrite the user's query into multiple retrieval-optimized variants.
- HyDE (Hypothetical Document Embedding): Generate a hypothetical answer, then use it as the search query. The embedding of a well-formed answer is often closer to the actual answer document than the question embedding.
- Multi-Query: Generate 3-5 semantically diverse rewrites of the original query to increase recall. Union the results.
- Step-Back Prompting: For specific questions, generate a more general query first. "What was our Q3 2025 revenue?" → "Quarterly financial performance reports 2025"
- Query Decomposition: Break complex questions into sub-questions, retrieve for each, then synthesize.
Technique 4: Re-Ranking
Initial retrieval is optimized for recall (finding all potentially relevant documents), not precision (finding only the relevant ones). A re-ranking step uses a cross-encoder model to score each retrieved chunk against the original query with full attention, producing much more accurate relevance scores than the initial bi-encoder retrieval.
We use Cohere's rerank API or a locally-hosted cross-encoder model (ms-marco-MiniLM-L-12-v2) depending on latency requirements. Re-ranking typically improves precision by 20-30% and dramatically reduces the noise in the LLM's context window.
Technique 5: Faithfulness Guards
Even with perfect retrieval, LLMs can hallucinate — generating confident answers that aren't supported by the retrieved context. We use a citation-based approach: the LLM must cite specific passages from the retrieved documents for every factual claim. A post-processing step verifies that each citation actually supports the claim it's attached to.
Never deploy a RAG system without faithfulness guardrails. In our testing, even GPT-4 hallucinated in 12% of responses when given relevant context. With citation verification, we reduced this to under 2%.
Putting It All Together
A production RAG pipeline combines all these techniques into a coherent system. The query comes in, gets transformed into multiple variants, each variant runs through hybrid search, results are de-duplicated and re-ranked, the top chunks are passed to the LLM with citation instructions, and the response goes through faithfulness verification before being returned to the user.
It sounds complex, and it is — but each layer addresses a specific failure mode that you'll encounter in production. Start with the basics, measure quality with a robust evaluation framework, and add complexity only where your metrics show it's needed. The best RAG system is the simplest one that meets your quality bar.
“RAG is not a solved problem — it's an engineering discipline. The difference between a 60% and a 95% accuracy RAG system is not one silver bullet, but ten small improvements stacked on top of each other.”
— Amar Singh, Vaarak Engineering
Amar Singh
Founder & Lead Engineer