RAG lets language models pull in outside information before answering, so they can work with fresh data, internal docs, or knowledge they were never trained on.
If you've ever asked a chatbot about something it clearly didn't know, you've already bumped into the problem RAG was built to solve. Language models are frozen at training time. They know what they were shown, and they don't know what came after. They also don't know about your company's internal wiki, your private codebase, or the API that shipped last Tuesday.
Retrieval-augmented generation is the workaround that became the standard.
The basic idea
Retrieval-augmented generation (RAG) is a technique that lets large language models pull in new information from external data sources before answering a question [Source 1]. The flow is straightforward: the model first consults a specified set of documents, then responds to the user's query, using those documents to supplement whatever it already learned during training [Source 1].
That's it. That's the whole concept. The model gets a cheat sheet before it has to talk.
Write for sansxel
Want your work in the Learn library? Apply for a hardlocked byline.
Why this matters in practice: it lets an LLM work with domain-specific or updated information that simply wasn't in its training data [Source 1]. A chatbot built on a general-purpose model can suddenly answer questions about your internal company data, or cite authoritative sources instead of vibes [Source 1].
Why not just retrain the model?
You could, in theory, fine-tune a model every time your documentation changes. In practice, that's expensive, slow, and you'd still be chasing a moving target. Documentation gets edited. Policies change. New libraries ship. Pricing pages get updated.
RAG sidesteps all of that by separating two concerns:
What the model knows how to do (reason, write, summarize, follow instructions). That stays in the weights.
What the model knows about (specific facts, current state of the world, your private data). That lives in a retrievable knowledge store.
When the facts change, you update the store. The model itself doesn't need to be touched.
How a basic RAG pipeline actually works
A typical setup has three moving parts:
A knowledge base. Usually a collection of documents, chunked into manageable pieces and indexed so you can search them quickly. Often this is a vector database, where each chunk is stored alongside an embedding that captures its meaning.
A retriever. When the user asks something, the retriever pulls the most relevant chunks. It might use semantic similarity, keyword search, or a hybrid.
The generator. That's the LLM. It receives the user's question plus the retrieved chunks as context, and produces an answer grounded in that context.
The trick is that the model doesn't have to memorize anything. It just has to read what's been handed to it and respond well. Modern models are very good at that.
Where RAG shines
The obvious case is question-answering over private or proprietary data. Customer support bots that read your help center. Legal assistants that cite your contract library. Internal tools that answer questions about your codebase using actual files instead of hallucinated APIs.
It's also useful any time you need answers tied to authoritative sources [Source 1]. If a model can point to the document it pulled an answer from, you can verify it. That's a meaningful improvement over a model that confidently makes things up.
The limits of the basic recipe
The simple version of RAG (retrieve once, generate once) works surprisingly well, but it has weak spots. Researchers have been picking at those weak spots for a couple of years now, and the field has split into a bunch of more sophisticated variants.
One issue: a single retrieval at the start of generation assumes you know what you need before you start producing the answer. That's often not true. As the model writes, its needs shift. A code-generation task might start by needing API docs, then need an example, then need a specific error-handling pattern. A static, one-shot retrieval can't keep up.
This is the problem EVOR tackles in the code-generation setting. The authors point out that existing pipelines for retrieval-augmented code generation use static knowledge bases with a single source, which limits how well LLMs can adapt to domains they don't know well [Source 3]. Their pipeline, EVOR, has both the queries and the knowledge bases evolve together as generation proceeds, and they test it on datasets built around frequently updated libraries and long-tail programming languages [Source 3]. The point is that retrieval shouldn't be a one-time event. It should track the work as it unfolds.
RAG isn't just for text
When people first hear about RAG, they think of chatbots reading PDFs. But the same idea works for other generation tasks.
AR-RAG (Autoregressive Retrieval Augmentation) applies the pattern to image generation [Source 2]. Instead of doing a single retrieval before drawing the image and conditioning the whole generation on fixed reference images, AR-RAG performs context-aware retrievals at each step, using already-generated patches as queries to fetch the most relevant patch-level visual references [Source 2]. The authors argue this lets the model respond to what the image actually needs as it develops, and avoids problems like over-copying and stylistic bias that show up when you commit to one reference up front [Source 2].
It's the same intuition as EVOR, just applied to pixels instead of code: don't lock in your retrieval before you know what the output looks like.
A mental model that actually helps
Here's how I think about it. A bare LLM is like a very well-read consultant who hasn't been in the office for a year. They know a lot. They can reason about almost anything. But they don't know what's happened recently, and they've never seen your specific situation.
RAG is the briefing you hand them before the meeting.
A good briefing is short, relevant, and well-organized. A bad briefing is a stack of unfiltered documents that buries the important stuff. Most of the engineering effort in a real RAG system goes into making sure the briefing is good: chunking documents sensibly, picking the right retrieval strategy, ranking results, deciding how much context to include.
The LLM is rarely the bottleneck. The retrieval is.
What to take away
RAG is the dominant pattern for getting language models to work with information they weren't trained on, whether that's last week's news, your company's internal docs, or a programming library that didn't exist when the model was built [Source 1]. The basic version is simple: retrieve relevant documents, hand them to the model, generate an answer.
The interesting work is happening in the variants. Retrieval that evolves with the generation [Source 3]. Retrieval that operates step-by-step inside an image [Source 2]. Retrieval that pulls from multiple knowledge sources at once [Source 3]. The core idea (let the model look things up instead of memorizing everything) keeps generalizing to new domains.
If you're building anything that needs an LLM to work with specific or current information, you'll almost certainly end up with some form of RAG in your stack. It's worth understanding the basic shape before you start tuning the details.