Retrieval Augmented Generation (RAG)

Why RAG?

The intuition behind Retrieval Augmented Generation (RAG) is to pair a language model with a search layer. Instead of relying only on parametric memory (the world knowledge of a model thats trained on tons of data, a large proportion from the internet), the model pulls fresh or proprietary sources at query time, then answers using that evidence (sometimes called grounding) which can be cited as verifiable information.

So good RAG systems should:

improve factual accuracy and reduce hallucinations and
be able to answer questions about private, domain-specific, or rapidly changing data without retraining.

RAG Architectures

Naive RAG - The simplest pipeline: embed the query, retrieve top-k chunks from a vector database, and pass them directly to the LLM for generation.

Hybrid RAG - Combines dense (vector) and sparse (keyword) retrieval with query rewriting and reranking to improve recall and relevance before generation. This is useful to cover the weaknesses of semantic similarity with vector embeddings e.g. sparse full-text search can retrieve chunks which mention specific entities (e.g. Tan Ah Kow) or technical names while semantic search might fails.

Agentic RAG - Uses an agent to plan, reason, and iteratively call tools (search, code, APIs) until it determines the task is complete (the questions is answered).

Multimodal RAG - Extends retrieval to multiple modalities (text, images, etc.), enabling multimodal queries and context-aware generation with multimodal models. These could be figures or technical diagrams in your source documents or pictures and images.

Choosing a Vector Database

When it comes to choosing a vector database, my recommendation is to go with something already in your ecosystem (e.g. if you already use elasticsearch or postgres, they have vector extensions).

However, if you're considering a dedicated vector store, Weaviate leads in popularity and integration richness (built-in modules, schema support, scalable cloud options), though it can be heavyweight for simple use cases. I wouldn't recommend FAISS except for the simplest proof-of-concepts as it lacks a managed server and built-in metadata handling. pgvector benefits from native integration with PostgreSQL, making it easy for teams already invested in relational systems, though its search performance trails specialized engines at scale. Qdrant balances strong vector performance with a developer-friendly API and good metadata filtering.

Loading vector DB download stats…

Average monthly combined PyPI + npm downloads across vector database libraries.Source:PyPI + npm

Failure Modes

RAG can fail quietly when retrieval is weak. Common issues include poor chunking, semantic drift between query and documents, stale sources, and noisy prompts that mix evidence with instructions.

Evaluation

Evaluate retrieval quality separately from generation to avoid masking problems.

Track recall at k for known queries, measure citation coverage, and add regression tests for critical prompts. A small, curated eval set is usually more useful than a large, generic one.