Back to Reviews

Introducing Contextual Retrieval

Visit Source
Research PaperAnthropicSep 19, 2024
LLMRAG

If your knowledge base is smaller than 200,000 tokens (about 500 pages of material), you can include the entire knowledge base directly in the prompt, with no need for RAG or similar methods.


How standard RAG works

RAG works by preprocessing a knowledge base using the following steps:

  • Chunking the corpus

      Break down the knowledge base (the “corpus” of documents) into smaller chunks of text, usually no more than a few hundred tokens.

  • Embedding the chunks

      Use an embedding model to convert these chunks into vector embeddings that encode meaning.

  • Storing in a vector DB

      Store these embeddings in a vector database that allows for searching by semantic similarity.

  • At runtime, when a user inputs a query to the model, the vector database is used to find the most relevant chunks based on semantic similarity to the query. Then, the most relevant chunks are added to the prompt sent to the generative model.


    Hybrid RAG: BM25 + embeddings

    RAG solutions can more accurately retrieve the most applicable chunks by combining embeddings and BM25 techniques using the following steps:

  • Chunking

      Break down the knowledge base (the "corpus" of documents) into smaller chunks of text, usually no more than a few hundred tokens.

  • Representing chunks

      Create TF-IDF encodings and semantic embeddings for these chunks.

  • Exact-match retrieval (BM25)

      Use BM25 to find top chunks based on exact matches.

  • Semantic retrieval (embeddings)

      Use embeddings to find top chunks based on semantic similarity.

  • Rank fusion

      Combine and deduplicate results from (3) and (4) using rank fusion techniques.

  • Augment the prompt

      Add the top-K chunks to the prompt to generate the response.


  • Problem with traditional RAG

    But these traditional RAG systems have a significant limitation: they often destroy context.


    Contextual Retrieval

    Contextual Retrieval solves this problem by prepending chunk-specific explanatory context to each chunk before embedding.