If your knowledge base is smaller than 200,000 tokens (about 500 pages of material), you can include the entire knowledge base directly in the prompt, with no need for RAG or similar methods.
How standard RAG works
RAG works by preprocessing a knowledge base using the following steps:
Break down the knowledge base (the “corpus” of documents) into smaller chunks of text, usually no more than a few hundred tokens.
Use an embedding model to convert these chunks into vector embeddings that encode meaning.
Store these embeddings in a vector database that allows for searching by semantic similarity.
At runtime, when a user inputs a query to the model, the vector database is used to find the most relevant chunks based on semantic similarity to the query. Then, the most relevant chunks are added to the prompt sent to the generative model.
Hybrid RAG: BM25 + embeddings
RAG solutions can more accurately retrieve the most applicable chunks by combining embeddings and BM25 techniques using the following steps:
Break down the knowledge base (the "corpus" of documents) into smaller chunks of text, usually no more than a few hundred tokens.
Create TF-IDF encodings and semantic embeddings for these chunks.
Use BM25 to find top chunks based on exact matches.
Use embeddings to find top chunks based on semantic similarity.
Combine and deduplicate results from (3) and (4) using rank fusion techniques.
Add the top-K chunks to the prompt to generate the response.
Problem with traditional RAG
But these traditional RAG systems have a significant limitation: they often destroy context.
Contextual Retrieval
Contextual Retrieval solves this problem by prepending chunk-specific explanatory context to each chunk before embedding.
