RAG: Teaching AI Models About Recent Events
AI models have a fundamental limitation: they only know what they were trained on. Ask about last quarter’s earnings or your company’s internal processes, and you’ll get nothing useful. The model doesn’t know, and no amount of prompting will fix that.
The solution is Retrieval Augmented Generation, or RAG. The concept is simple: find relevant information and include it in your prompt. The model then generates answers based on that context. But getting from concept to working system requires understanding several moving parts.
The Document Problem
Start with your source documents. If you’re working with PDFs, stop and find the original text files. I know this seems like extra work, but PDF was designed for printing, not text extraction. The format doesn’t care if your text comes out jumbled.
Tools like PyPDF or PyMuPDF can work on some documents and completely fail on others. PyMuPDF’s documentation even says it’s the PDF author’s responsibility to make extraction easier - which tells you everything about how reliable this process is. When I’ve needed to extract from PDFs, I’ve spent more time debugging the extraction than it would have taken to email someone for the source file.
If you’re stuck with PDF, budget time for manual verification. Jumbled text in means garbage answers out.
Chunking: Breaking Documents Apart
You can’t just dump an entire document into your prompt. Even if it fits within the context window, you’re wasting tokens and confusing the model. Most documents contain information that’s irrelevant to any given question.
The solution is chunking - splitting documents into smaller pieces. There’s no universal best approach here. Some split by character count, others by tokens or paragraphs. Some add overlap between chunks so context isn’t lost at boundaries.
I’ve seen systems that split by section headers work well for technical documentation. For financial reports, splitting by paragraph often makes sense. The right approach depends on your documents and what questions you’re trying to answer.
Embeddings: Making Text Searchable
Once you have chunks, you need a way to find relevant ones based on the user’s question. This is where embeddings come in.
An embedding is a numerical representation of text - a vector that captures semantic meaning. Run your chunk through an embedding model, and you get back an array of numbers. The length of that array depends on the model. All text embedded by the same model produces vectors of the same length.
The magic is mathematical. Vectors that are close together in this high-dimensional space represent text that’s semantically similar. When a user asks a question, you embed that question and search for the nearest chunks.
Different embedding models produce different vector lengths. If you switch models later, you need to re-embed everything. This is a real operational consideration - I’ve seen teams put off improving their embedding model because they didn’t want to re-process millions of chunks.
Vector Databases: Storing and Searching
You could store chunks in JSON files or a traditional database. This works for small systems. Once you’re dealing with thousands of chunks, you need a vector database.
Vector databases are optimized for the specific operation you’re doing: finding vectors that are nearest to a query vector. Options include Chroma, Pinecone, Milvus, or extensions to databases like Postgres. Last count, there were over 50 options.
Each chunk stored in the database contains two things: the embedding (for searching) and the original text (for including in prompts). You could theoretically reconstruct text from embeddings, but storing it explicitly is more efficient and reliable.
Building the Prompt
When a user asks a question, here’s what happens:
- Embed the question using the same model that embedded your chunks
- Query the vector database for the most similar chunks
- The database returns ranked results - usually you take the top 5-10
- Extract the raw text from these chunks
- Build a prompt: “Use the following information to answer the question: [chunks] [user’s question]”
- Send this augmented prompt to the model
The model just sees text. It doesn’t know this information came from a RAG system. From its perspective, you simply included relevant context in the prompt.
Why This Works
RAG doesn’t make models smarter. It makes them better informed. The model still hallucinates, still has the same capabilities - but now it’s working with information it actually has access to.
Think of it like giving someone a reference manual before asking them a question. They might still misunderstand or make mistakes, but at least they have the relevant information in front of them.
I’ve built RAG systems for internal company documentation and for analyzing financial reports. The difference in answer quality is dramatic when the model has access to the right context. But the system is only as good as your chunking strategy and embedding quality.
Historical Note: The concept of augmenting language models with retrieved information isn’t new. Search engines have been doing information retrieval for decades. RAG applies these proven techniques to the specific challenge of giving language models access to information outside their training data. It’s a bridge between classical IR and modern LLMs.
The Components Matter
A RAG system has several points of failure:
- Text extraction: Garbage in, garbage out
- Chunking: Too large and you waste context, too small and you lose meaning
- Embeddings: Different models capture different aspects of meaning
- Search: How many chunks do you retrieve? What similarity threshold?
- Prompt construction: How do you present the chunks to the model?
Each of these requires tuning for your specific use case. There’s no universal configuration that works everywhere.
Questions to Consider
- What information do your models actually need to know about?
- Can you get clean source text, or are you stuck with PDF extraction?
- How will you measure if your RAG system is actually improving answers?
- What happens when your source documents change - how do you keep the vector store updated?
RAG isn’t magic, but it’s a practical solution to a real limitation of AI models. Get the fundamentals right - clean text, sensible chunking, good embeddings - and you can extend what models know without retraining them.