Chapter Notes: LLM Engineer’s Handbook - RAG Feature Pipeline

5 minute read

Published:

Book:

Chapter 4: RAG Feature Pipeline

  • Retrieval-augmented generation (RAG) Feature Pipeline: Qdrant vector DB for online serving and ZenML artifacts for offline training.
  • Naive RAG
    • Chunking
    • Embedding
    • Vector DBs
  • Chapter teaches you what RAG is and how to implement it.
  • The main sections of this chapter are:
    • Understanding RAG
    • An overview of advanced RAG
    • Exploring the LLM Twin’s RAG feature pipeline architecture
    • Implementing the LLM Twin’s RAG feature pipeline
  • A RAG system is composed of three main modules independent of each other:
    • Ingestion pipeline: A batch or streaming pipeline used to populate the vector DB
    • Retrieval pipeline: A module that queries the vector DB and retrieves relevant entries to the user’s input
    • Generation pipeline: The layer that uses the retrieved data to augment the prompt and an LLM to generate answers

Ingestion Pipeline

  • For the ingestion pipeline, first we need to collect the data.
  • This can be from DBs, APIs, or web pages. And depending on the source, your cleaning step might be different.
  • The cleaned data is then chunked (depending on the model’s input size)
  • Then the chunks are embedded.
  • The chunks data along with it’s metadata is taken by the loading module.

So the flow for ingestion pipeline is: Collect->Clean->Chunk->Embed->Load

Retrieval Pipeline

  • A retrieval pipeline uses the user input to output the similar chunks of data.
  • For this, first, the user input need to be translated to the same vector spaces as the chunks of data.
  • Then we use distance formula to get the ‘K’ nearest elements to it.
  • Those elements is used to augment the prompt.

Here, cosine distance formula is one of the most popular distance formula used to get the distance. But it is said that the distance formula depends on the data and the embedding model we have. How do we decide on the best distance formula?

Generation Pipeline

  • The final prompt results from a system and prompt template populated with the user’s query and retrieved context. You might have a single prompt template or multiple prompt templates, depending on your application. Usually, all the prompt engineering is done at the prompt template level.

Critical aspects affecting the accuracy of RAGs:

  • Embedding used,
  • Similarity function used.

Embeddings

  • Algorithm for creating vector indexes: Random Projection, Hierarchial Navigable Small World (HNSW), Product Quantization (PQ), and Locality Sensitive Hashing (LSH).

The vanilla RAG framework we just presented doesn’t address many fundamental aspects that impact the quality of the retrieval and answer generation, such as:

  • Are the retrieved documents relevant to the user’s question?
  • Is the retrieved context enough to answer the user’s question?
  • Is there any redundant information that only adds noise to the augmented prompt?
  • Does the latency of the retrieval step match our requirements?
  • What do we do if we can’t generate a valid answer using the retrieved information?

Therefore, for RAG we need two things;

  • robust evaluation of retrieval
  • retrieval limitation should be addressed in the algorithm itself.

Advanced RAG

The vanilla RAG design can be optimized at three different stages:

  • Pre-retrieval
  • Retrieval
  • Post-retrieval

Pre-retrieval

most of the data indexing techniques focus on better preprocessing and structuring the data to improve retrieval efficiency, such as:

  • Sliding Window
  • Enhancing Data Granularity
  • Metadata
  • Optimizing index structures
  • Small-to-big

For query optimization,

  • Query routing
  • Query rewriting
    • Paraphrasing
    • Synonym substitution
    • Sub-queries
    • Hypothetical document embeddings (HyDE)
  • Query Expansion
    • Self-Query

Retrieval Pipeline Optimization

There are two ways

  • Improve the Embedding model
    • by fine-tuning the pre-trained model (very computationally costly, evan financially)
    • using Instruction models (less costly)
  • Leveraging the DB’s filter and search features

Post-Retrieval Pipeline Optimization

  • Re-ranking
  • Prompt compression

Exploring LLM Twin’s RAG feature pipeline

To implement the RAG feature pipeline, we have two design choice:

Batch PipelineStreaming Pipeline
regular intervalcontinuous
simplecomplex
when data processing is not criticalwhen it is critical
handles large data efficientlyhandles single data points

Core steps for RAG feature pipeline

  • Data Extraction
  • Cleaning
  • Chunking
  • Embedding
  • Data Loading: Embedding + Metadata + Chunks

Change data capture (CDC)

  • a strategy that allows you to optimally keep two or more data storage types in sync without computing and I/O overhead.
  • It captures any CRUD operation done on the source DB and replicates it on a target DB.
  • Optionally, you can add preprocessing steps in between the replication.

The CDC (Change Data Capture) pattern addresses these issues using two main strategies:

  • Push: The source DB actively sends changes to targets, enabling real-time updates. A messaging system buffers changes to prevent data loss if targets are unavailable.
  • Pull: The source DB logs changes, and targets fetch them periodically. This reduces source load but introduces delays; a messaging buffer ensures reliability.

The main CDC patterns that are used in the industry:

  • Time-stamp based: overhead to the source as we have to query the whole table/dataset.
  • Trigger based: same overhead.
  • Log-based: no overhead to the source system, however since logs are not standardized, we have to implement vendor-wise implementations.

Why is the data stored in two snapshots?

  • After the data is cleaned: For fine-tuning LLMs
  • After the documents are chunked and embedded: For RAG

Tags: , ,

Categories: