Skip to content
Chris Garlick 11 min read

What is RAG? Retrieval-Augmented Generation Explained (UK Edition)

What is RAG? A plain-English UK guide to retrieval-augmented generation. How it works, why it cuts hallucinations, what it costs, and when to use it.

If you've asked ChatGPT a question about your own company's documents and watched it confidently invent the answer, you've already met the problem RAG solves.

Large language models don't know your contracts, your tax workpapers, your client onboarding policy, or last quarter's project notes. They know what was on the open internet when they finished training. Ask them about the rest, and they guess. Some of those guesses are very convincing. Enterprise losses from AI hallucinations reached an estimated $67.4 billion in 2024.

RAG (Retrieval-Augmented Generation) is the workaround. It's how you make a generic model answer questions about your specific business without retraining it from scratch. This is the plain-English version, written for UK business owners and operations leads who keep hearing "we should be doing RAG" without anyone explaining what that actually means.

The short answer

RAG is a three-step pattern: when someone asks a question, you retrieve the most relevant chunks of your own documents, augment the question with those chunks as context, then ask the LLM to generate an answer using that context.

The result is a system that can answer questions about your specific business, in real time, without the LLM needing to memorise anything. The model's job becomes "summarise these specific paragraphs into a useful answer," not "recall facts from training data." That single change is what cuts hallucinations.

RAG reduces hallucination rates by 30 to 70% across domains, with a median 71% improvement on domain-specific queries across 847 production deployments. For summarisation tasks specifically, grounded retrieval drops hallucinations below 2%.

How RAG actually works, step by step

The mechanics aren't complicated. Strip away the jargon and you have four moving parts.

1. Ingestion (one-off, then repeated). You take every document the system should know about (PDFs, Word files, Notion pages, contracts, emails, knowledge base articles) and break each one into smaller chunks. Maybe a paragraph at a time. Maybe a few hundred words.

2. Embedding. Each chunk gets turned into a list of numbers (a vector) that represents what the chunk "means." Vector embeddings let you search by semantic similarity rather than exact keyword matches. A chunk about "client onboarding deadlines" and a chunk about "when must we contact new clients" will sit near each other in vector space, even though they share almost no words.

3. Storage. The vectors go into a vector database. This is just a database optimised for "find me the chunks closest in meaning to this query." pgvector is the default recommendation for teams that already use Postgres. Qdrant, Pinecone, and Weaviate are the other names you'll hear.

4. Retrieval + generation. When a user asks a question, you embed the question itself, find the closest matching chunks in the vector database, glue those chunks to the question as context, and send the whole package to an LLM. The LLM generates an answer using the retrieved context as its source material.

The user sees: "What's the SLA we offered Acme Corp?"

The system does: embed the question, retrieve the three chunks of Acme's signed contract that mention SLAs, send those chunks plus the question to Claude or GPT, and return an answer like "Under section 4.2 of the Acme MSA dated 14 March 2025, the SLA is 99.9% uptime measured monthly."

That last sentence is the difference between AI that hallucinates and AI you can actually use in a business.

Why RAG matters for UK businesses

Three specific reasons it's worth understanding even if you have no immediate plans to build one.

Hallucinations are the single biggest reason AI pilots fail in business. The model says something confident and wrong. Someone repeats it to a client. The trust evaporates. Stanford researchers found that even RAG-powered legal AI tools still hallucinate in 17 to 33% of queries, which is worth knowing. RAG isn't a magic fix. It's a structural reduction. Going from 50% hallucination rate to 5% changes whether a tool is usable. Going from 5% to 0% is a different (and much harder) problem.

Your data is the moat. Most UK SMEs sit on a decade of contracts, briefs, client notes, and internal documentation that's worth a fortune as context for AI. RAG is the cheapest way to actually use it. Fine-tuning a model on the same documents costs an order of magnitude more, and the model has to be retrained every time the documents change. RAG just updates the index.

Regulatory and data residency become tractable. A RAG system can be deployed entirely inside UK or EU infrastructure. The LLM call can go to Claude in EU region, the vector database can sit on a Hetzner box in Helsinki or AWS London, and your source documents never leave your control. For SRA-regulated law firms, FCA-regulated financial services, or NHS-adjacent admin, that's the bit that makes AI legally usable.

RAG vs fine-tuning: when each one wins

The most common question I get from UK technical leads scoping AI: "do we need RAG or do we need fine-tuning?" Different problems. Different answers.

RAG wins when:

  • Your knowledge changes often (weekly, monthly)

  • You need citations or auditability (the system can tell you which document the answer came from)

  • You're early-stage and don't know yet what queries the system will need to handle

  • Your budget is under £50,000 for the build

Fine-tuning wins when:

  • You're processing very high query volumes on a well-defined task (think 100,000+ queries a day on classification, entity extraction, or format conversion)

  • The knowledge is genuinely static

  • You need a specific style of output that prompting can't reliably produce

The cost picture is roughly this. A production RAG system serving 10,000 queries a day across a 500,000-document corpus typically costs $4,000 to $9,000 a month all-in, of which most is the LLM inference cost. A document update that costs nothing in a RAG system can cost $500 to $5,000 with fine-tuning, because you're retraining.

Roughly 60% of production AI deployments in 2025 and 2026 use both RAG and fine-tuning together. The hybrid pattern is the dominant one at scale. For most UK SMEs starting today, RAG alone is the right first move. Fine-tuning is a later optimisation, not a starting point.

What it costs to build RAG for a UK business

A realistic 2026 UK pricing picture for an SME-scale RAG build.

Single-purpose RAG system (one document type, one user-facing interface, one LLM provider): £15,000 to £40,000 upfront. Two to six weeks of build time. Ongoing run cost £100 to £500 a month at low to moderate query volumes.

Multi-corpus RAG with admin UI (multiple document types, a review interface for staff to vet outputs, observability and evals): £40,000 to £80,000. Four to eight weeks. Ongoing £300 to £1,500 a month.

Multi-agent RAG with workflow integration (RAG plus orchestration into Xero, HubSpot, a CRM, document generation, the works): £80,000 to £150,000. Eight to sixteen weeks. Ongoing £500 to £3,000 a month.

The cost drivers are document quality (clean, structured data is cheap to ingest; PDFs of scanned contracts are expensive), retrieval quality requirements (a 70% retrieval accuracy is easy; 95% requires real engineering), and integration scope (just-a-chatbot is cheap; embedded in your actual operational software is not).

The four mistakes that kill RAG projects

I've seen these in the wild repeatedly. Each one looks small. Each one quietly degrades the system until users stop trusting it.

Treating chunking as an afterthought. Chunking strategy has equal or greater impact on retrieval quality than embedding model selection, yet most teams cut documents every 512 tokens and move on. Split a contract clause in half and the retrieved chunk no longer answers the question. The wrong chunking strategy can create up to a 9% gap in recall performance between best and worst approaches.

Pure vector search for exact-match queries. Vector embeddings handle "what's our refund policy" beautifully. They miss "find me clause 4.2 of the Acme MSA" because exact identifiers don't have semantic meaning. The fix is hybrid search: combine BM25 keyword scoring with dense vector similarity, usually merged via Reciprocal Rank Fusion. It's two days of additional engineering. It's the difference between a toy and a tool.

No evaluation pipeline. "It feels like it works" is not a quality metric. A serious RAG build needs golden questions with known correct answers, run automatically every time the prompt or model changes. Without evals, you can't tell whether an update made the system better or worse.

Treating the LLM as the bottleneck. The model is rarely the weakest link. Retrieval quality, chunking, and document cleanliness almost always matter more than whether you used Claude Sonnet 4.6 or GPT-5. Spending two weeks A/B testing models while ignoring retrieval is the classic time-sink.

Where RAG fits in a UK SME today

Three practical use cases I see working consistently in 2026.

Document Q&A for professional services. Law firms answering questions across case files. Accountancy practices querying client tax history. Surveyors searching prior reports. The shape is the same: tens of thousands of documents, internal users, accuracy matters more than speed, citations are non-negotiable. RAG is the right architecture every time.

Client-facing knowledge bots. Customer support, FAQ, onboarding. The bot answers questions using only your documented policies, not made-up ones. The legal exposure is low because the system can only say what your documents say.

Internal operations. Onboarding new hires by letting them query the company handbook. Searching prior project notes when scoping new work. Surfacing relevant precedent without anyone having to remember it exists. This is the cheapest RAG to build because the audience is forgiving (internal staff vs paying clients) and the data is usually already organised.

The cases where RAG is the wrong tool are also worth naming. Real-time data lookup (the database has the answer, you don't need an LLM to find it). Anything where the cost of being wrong is catastrophic and unappealable (medical diagnosis, irreversible financial transactions). Cases where the volume justifies fine-tuning a smaller model instead.

My honest take

RAG is the workhorse pattern for the next five years of business AI. It's not glamorous. It's not the bit that gets the magazine cover. It's the boring engineering that makes the magazine-cover features actually work in production.

The good news for a UK SME is that you can ship a useful RAG system in under two months for under £40,000. The bad news is that the gap between a demo RAG (impressive in a 20-minute meeting) and a production RAG (reliable across thousands of queries) is bigger than most people think. The difference is mostly in the unglamorous parts: chunking decisions, evaluation harnesses, hybrid retrieval, document cleanliness.

If you're scoping AI for a UK service business in 2026, you almost certainly need a RAG layer somewhere in the architecture. The questions worth answering before you build are: which documents matter, how often do they change, how are users going to interact with the answers, and what's the cost of a wrong answer.

If you'd like a second opinion on whether RAG is the right next step for your business (and what it would realistically cost to build), book a free scoping call. I'll tell you straight whether RAG is the answer, a simpler approach would do it, or you're not ready for either yet.

Frequently asked questions

What is RAG in simple terms?

RAG (retrieval-augmented generation) is a way to make an AI model answer questions about your specific documents without retraining it. The system finds the most relevant chunks of your documents, hands them to the AI as context, and asks the AI to answer using only that context. It cuts hallucinations and grounds answers in your actual data.

What is the difference between RAG and fine-tuning?

RAG retrieves relevant documents at query time and uses them as context, so updating knowledge is as cheap as updating the index. Fine-tuning retrains the model itself on your data, baking the knowledge into the model's weights. RAG is cheaper upfront and easier to update; fine-tuning is more efficient at very high query volumes but expensive to retrain. Roughly 60% of production AI projects use both together.

How much does it cost to build a RAG system in the UK?

A single-purpose RAG system for a UK SME typically costs £15,000 to £40,000 upfront with two to six weeks of build time, plus £100 to £500 a month to run. Multi-corpus systems with admin UI run £40,000 to £80,000. The cost drivers are document quality, retrieval accuracy requirements, and how deeply the system integrates with your existing operational software.

Does RAG eliminate AI hallucinations?

No, but it reduces them substantially. RAG cuts hallucination rates by 30 to 70% across domains, with a median 71% improvement on domain-specific queries. Stanford researchers found that even RAG-powered legal AI tools still hallucinate in 17 to 33% of queries, so a RAG system still needs human review for any high-stakes output. The structural improvement is what makes RAG usable in business; perfection is not the goal.

Which vector database should a UK SME use for RAG?

For most UK SMEs starting out, pgvector inside Postgres is the right choice. It handles up to roughly 10 million vectors comfortably, integrates with existing Postgres infrastructure, gives transactional consistency, and avoids operational complexity. Qdrant is the right next step for systems with strict latency requirements or heavy filtered search. Pinecone's managed service is worth it when you want zero infrastructure work and have the budget to pay for it.

Want this for your business?

I build software like what's described above. Fixed pricing, transparent process.

Get in touch

Software that actually gets used.

If you want to know whether software can cut real time from your operations, apply. I review every application personally.

Get in touch