If your knowledge base is smaller than 200,000 tokens (about 500 pages of material), you can include the entire knowledge base directly in the prompt, with no need for RAG or similar methods.

How standard RAG works

RAG works by preprocessing a knowledge base using the following steps:

Chunking the corpus

Break down the knowledge base (the “corpus” of documents) into smaller chunks of text, usually no more than a few hundred tokens.

Embedding the chunks

Use an embedding model to convert these chunks into vector embeddings that encode meaning.

Storing in a vector DB

Store these embeddings in a vector database that allows for searching by semantic similarity.

At runtime, when a user inputs a query to the model, the vector database is used to find the most relevant chunks based on semantic similarity to the query. Then, the most relevant chunks are added to the prompt sent to the generative model.

Hybrid RAG: BM25 + embeddings

RAG solutions can more accurately retrieve the most applicable chunks by combining embeddings and BM25 techniques using the following steps:

Chunking

Break down the knowledge base (the "corpus" of documents) into smaller chunks of text, usually no more than a few hundred tokens.

Representing chunks

Create TF-IDF encodings and semantic embeddings for these chunks.

Exact-match retrieval (BM25)

Use BM25 to find top chunks based on exact matches.

Semantic retrieval (embeddings)

Use embeddings to find top chunks based on semantic similarity.

Rank fusion

Combine and deduplicate results from (3) and (4) using rank fusion techniques.

Augment the prompt

Add the top-K chunks to the prompt to generate the response.

Problem with traditional RAG

But these traditional RAG systems have a significant limitation: they often destroy context.

Contextual Retrieval

Contextual Retrieval solves this problem by prepending chunk-specific explanatory context to each chunk before embedding.

Introducing Contextual Retrieval

How standard RAG works

Hybrid RAG: BM25 + embeddings

Problem with traditional RAG

Contextual Retrieval