Retrieval Augmented Generation (RAG)
Grounding model answers with targeted retrieval and evidence.
First published Dec 2025 · Updated 13h ago
Why RAG?
The intuition behind Retrieval Augmented Generation (RAG) is to pair a language model with a search layer. Instead of relying only on parametric memory (the world knowledge of a model thats trained on tons of data, a large proportion from the internet), the model pulls fresh or proprietary sources at query time, then answers using that evidence (sometimes called grounding) which can be cited as verifiable information.
So good RAG systems should:
- improve factual accuracy and reduce hallucinations and
- be able to answer questions about private, domain-specific, or rapidly changing data without retraining.
RAG Architectures
Naive RAG - The simplest pipeline: embed the query, retrieve top-k chunks from a vector database, and pass them directly to the LLM for generation.
Hybrid RAG - Combines dense (vector) and sparse (keyword) retrieval with query rewriting and reranking to improve recall and relevance before generation. This is useful to cover the weaknesses of semantic similarity with vector embeddings e.g. sparse full-text search can retrieve chunks which mention specific entities (e.g. Tan Ah Kow) or technical names while semantic search might fails.
Agentic RAG - Uses an agent to plan, reason, and iteratively call tools (search, code, APIs) until it determines the task is complete (the questions is answered).
Multimodal RAG - Extends retrieval to multiple modalities (text, images, etc.), enabling multimodal queries and context-aware generation with multimodal models. These could be figures or technical diagrams in your source documents or pictures and images.
Choosing a Vector Database
When it comes to choosing a vector database, my recommendation is to go with something already in your ecosystem (e.g. if you already use elasticsearch or postgres, they have vector extensions).
However, if you're considering a dedicated vector store, Weaviate leads in popularity and integration richness (built-in modules, schema support, scalable cloud options), though it can be heavyweight for simple use cases. I wouldn't recommend FAISS except for the simplest proof-of-concepts as it lacks a managed server and built-in metadata handling. pgvector benefits from native integration with PostgreSQL, making it easy for teams already invested in relational systems, though its search performance trails specialized engines at scale. Qdrant balances strong vector performance with a developer-friendly API and good metadata filtering.
Failure Modes
RAG can fail quietly when retrieval is weak. Common issues include poor chunking, semantic drift between query and documents, stale sources, and noisy prompts that mix evidence with instructions.
Evaluation
Evaluate retrieval quality separately from generation to avoid masking problems.
Track recall at k for known queries, measure citation coverage, and add regression tests for critical prompts. A small, curated eval set is usually more useful than a large, generic one.