TL;DR

RAG architecture (retrieval-augmented generation) is a way to make a language model answer grounded in your private data: product catalog, brand guidelines, customer history, support content. The pipeline is chunk, embed, retrieve, re-rank, generate. The data foundation matters more than the model. Most teams over-chunk, skip the re-ranker, and ship without an eval set, which is why their RAG demos great and disappoints in production.

  • RAG is for grounded answers from private data.
  • The pipeline has five steps; skipping any one shows up later.
  • The data foundation is the real moat.
  • Over-chunking is the silent killer of recall.
  • No eval set, no production-grade RAG.

What RAG is in one paragraph

RAG, or retrieval-augmented generation, is a pattern where a language model is given relevant private documents at query time, fetched by a search system, so it can answer questions grounded in your data instead of just its training data. You index your content (product pages, support docs, brand guidelines, internal wiki) into a vector store, search the store at query time, return the most relevant passages, and pass them to the model along with the user's question. The model's answer is grounded in those passages, with citations possible. That is the whole pattern. Everything else is implementation detail.

When consumer brands need RAG

RAG earns its keep when the AI needs to answer questions about content the model was not trained on, or content that changes faster than the model can be retrained. For a consumer brand, that is most internal content.

Real use cases I have seen ship at consumer brands:

RAG is not the right answer when the model can already do the job on public knowledge, when the volume is so low that a human can handle it, or when the latency budget is too tight to fit a retrieval step.

The architecture: chunk, embed, retrieve, re-rank, generate

A production RAG pipeline has five steps. Skip any of them and the system has a soft spot.

1. Chunk

Break your documents into passages the retrieval system can index. The chunking strategy is one of the highest-leverage and most-misunderstood decisions in the pipeline. Chunks that are too small lose context. Chunks that are too large dilute the relevance signal. The right size depends on your content; product pages chunk differently than support articles, which chunk differently than long-form blog content.

Most teams I see chunking on a default of 500 tokens with no overlap and never re-tuning. That default is often wrong. Test a few strategies against your eval set and pick by recall, not by hunch.

2. Embed

Turn each chunk into a vector with an embedding model. Pick the embedding model the same way you pick a generation model: based on eval performance on your workload, cost, and latency. The embedding model is a swappable layer. Build the pipeline so you can re-index when a better embedding model lands.

3. Retrieve

Given a query, find the most similar chunks in the vector store. The basic version is pure semantic search via cosine similarity. The production version usually combines semantic search with keyword search (hybrid retrieval), because each catches what the other misses.

4. Re-rank

The retrieval step returns a candidate set of, say, 20 to 50 chunks. A re-ranker takes those candidates, scores them against the query with a higher-precision (and slower) model, and returns the top 3 to 10 to pass to the generation step. The re-ranker is the step most teams skip, and the step that most improves answer quality.

5. Generate

The generation model receives the user's question, the retrieved passages, and a system prompt that defines tone, constraints, and citation behavior. The model produces the answer. The choice of generation model follows the same rules as the rest of the stack (see how to pick the right LLM).

The re-ranker is the cheapest answer-quality upgrade in the pipeline. Skipping it is the most common mistake I see.

The data foundation question

RAG sits on top of your data. The quality of the foundation is the quality of the system. If your product catalog has duplicate SKUs, inconsistent attributes, and stale descriptions, your RAG will answer with duplicate SKUs, inconsistent attributes, and stale descriptions.

Before any RAG build, run the data foundation through these questions:

If any of these are unresolved, fix the foundation before you fix the retrieval. This is the same principle that drives the martech AI readiness audit: AI multiplies the data layer. Bad data plus RAG equals fast, confident, articulate wrong answers.

The cost math

RAG has predictable cost behavior at consumer-brand scale. The components, in order of typical dominance:

  1. Generation cost per query. The biggest line for most workloads. Driven by which generation model you picked and how long the context is.
  2. Embedding cost during ingestion. A one-time cost per chunk when you index. Re-ingestion happens when content changes or when you upgrade the embedding model.
  3. Embedding cost per query. A small per-query cost to embed the user's question.
  4. Vector store cost. Hosted vector databases charge by index size and query volume. At consumer-brand scale, usually modest.
  5. Re-ranker cost per query. A small per-query cost for the higher-precision scoring step.

The discipline: model the cost per query, multiply by projected daily query volume, then by 30 days. If the answer makes you wince, the issue is usually context length on the generation step, not the retrieval pipeline. Optimize the prompt and the retrieval count before you change models.

What most teams get wrong

Across the RAG builds I have advised on at Automatic and the RAG patterns inside CreativeOS, the same handful of mistakes show up.

1. Over-chunking. Chunks that are too small lose the surrounding context, and the model gets a relevant-feeling passage that does not actually carry the answer. Tune chunk size against an eval set.

2. No re-ranker. Teams ship with pure semantic retrieval and live with mediocre top-k quality for months. Adding a re-ranker is usually a one-day project for a multi-week improvement.

3. No eval set. The single highest-leverage RAG asset is a private eval set of 50 to 200 real questions, with the correct passages identified and the correct answers written. Without it, you cannot tell whether a change made things better or worse. With it, every change is a measured decision. Same eval discipline as in building your first AI agent workflow.

4. Treating retrieval and generation as the same problem. They are not. A better generation model does not fix a recall problem. A better retrieval pipeline does not fix a hallucination problem. Diagnose the failure step before you change anything.

5. Ignoring freshness. Indexed data goes stale. If your product catalog updates daily and your index updates monthly, your RAG is lying to your customers on day two of every month. Wire freshness into the architecture from the start.

6. No citation policy. If the model can answer without showing its sources, you cannot tell whether it is grounded or hallucinating. Force citations in the system prompt and verify them in the eval.

7. Skipping the V1 Framework. The same scoping discipline that applies to agents and prompts applies to RAG. Strip the use case to its essential job, decompose into steps, constrain inputs and outputs, define done. See the V1 Framework.

Your RAG is only as good as your data, your re-ranker, and your eval set. The model is the easy part.

The bottom line

RAG is the right pattern for consumer brands that need AI grounded in their own content. It is not the right pattern for every problem, and it is not a substitute for good data. The pipeline is well-understood. The mistakes are well-understood. The teams that ship build the eval set first, score recall before they score answers, and tune one step at a time.

Build the foundation. Build the pipeline. Build the eval set. Ship.


FAQ

What is RAG?

RAG, or retrieval-augmented generation, is a pattern where a language model is given relevant private documents at query time, fetched by a search system, so it can answer questions grounded in your data instead of just its training data.

When does a consumer brand need RAG?

When the AI needs to answer questions about content the model was not trained on: your product catalog, your brand guidelines, your customer history, your support knowledge base. If you want grounded, citable answers from private data, you want RAG.

Is RAG expensive?

RAG has fixed and variable costs. Fixed: the indexing infrastructure. Variable: embedding calls during ingestion, retrieval calls per query, model calls per query. At consumer-brand scale, RAG is usually affordable; the cost dominator is the generation step, not the retrieval step.

How is RAG different from fine-tuning?

Fine-tuning bakes knowledge into model weights and is good for teaching style, format, or narrow domain language. RAG injects fresh knowledge at query time and is good for facts that change. For most consumer-brand use cases, RAG is the right starting point because the data changes weekly.

What is the biggest RAG mistake?

Skipping the eval set. Teams launch RAG, the answers feel okay in demos, and they ship. Six weeks later the team realizes recall is mediocre and the model hallucinates around the gaps. An eval set catches this on day one and keeps catching it every quarter.

How long does a RAG build take?

A first usable RAG ships in three to six weeks. A production-grade RAG with re-ranking, evaluation, and monitoring takes longer, usually one to three months. The data quality work, not the retrieval engineering, is the long pole.

About the author

Nicholas Harris is an AI-native operator at the intersection of generative AI and consumer growth. He is President at CreativeOS, an AI-powered SaaS platform serving 25,000+ brands with production LLM, image generation, and AI agent workflows, and Founder at Automatic, an AI consultancy for consumer brands.

He has delivered three exits and built consumer-brand operations from SMB through nine-figure scale, including 110.6% e-commerce revenue growth at NASM. He is currently open to VP AI, AI Transformation, Head of Growth, and Fractional CTO roles. Based in Mesa, AZ.

Get in touch