(1) What is RAG?

Retrieval-Augmented Generation (RAG) is an architectural approach that enhances the efficacy of large language model (LLM) applications by leveraging custom data. Here’s how it works:

Large Language Models (LLMs), such as GPT-3 or GPT-4, are trained on vast volumes of data and use billions of parameters to generate original output for tasks like answering questions, translating languages, and completing sentences.
Challenges with LLMs:
The RAG Approach:
Benefits of Retrieval-Augmented Generation:

In summary, RAG combines the generative power of LLMs with the precision of specialized data search mechanisms, resulting in a system that offers nuanced responses by leveraging both pre-trained models and external knowledge

(2) What is the Architecture of RAG?

There are four main building blocks for architecting RAG solutions: ingesting the data, storing it in a suitable database, retrieving the required contents from the database, and finally, compiling a prompt with the context retrieved from the knowledge base and making an API call to LLM for generating the response to the user's query. Let's briefly review each building block before getting to the details regarding the RAG architecture options.

Indexing

Indexing is like arranging shelves in your local library. The books are indexed physically by genre in sections and then by author within that section. Indexing ensures that information can be quickly found when needed.

To index the unstructured data, it is best to split it into smaller chunks before indexing; for example, if the document has 2000 words, it can be split into 10 chunks of 200 words, or if the video is 60 minutes, it can be split into smaller 5-minute chunks.

These chunks are transformed into embeddings, i.e. vector representations of chunks of data so that they can be stored and used for search. This process is also called vectorisation.

Chunking needs careful consideration as it determines what data will be retrieved together at a later stage by accessing the relevant chunks. If we have a text document, we may want to chunk it by paragraph as opposed to a fixed number of tokens.

Storing

The database storage must be able to store vector embeddings. There are many choices for vector databases, such as Open Seach, PostgreSQL, and Lance DB. It is best to look for various factors before choosing the database. Look for the capabilities provided and make sure you select the database that enables key capabilities while meeting your budget and integration requirements.

Retrieving

Once the data is split into chunks and stored in a suitably selected vector database, it can be searched and retrieved.

Most modern vector databases offer hybrid search. Based on the user query, the combined semantic similarity vector and keyword searches may return multiple relevant vector embeddings. The database's ability to rank and select the best match is one of the key capabilities desired for vector databases.

Generating Response

To get a response from the LLM, the information retrieved from the vector database must be structured to ensure the generated response provides maximum accuracy and relevance to the user. This means building the context from the initial user's prompt and the retrieved information and combining it with additional information before making the API call to the LLM.

Based on the above building blocks, let's go through some options for building RAG systems.

RAG Solution Architecture Options

RAG solutions can be architected to meet the user's requirements. It is important to understand that RAG may not be required for all use cases, and different RAG architectures could be considered depending on the complexity of the requirements.

Option 1 - Basic RAG

A basic RAG system transforms the user's question into an embedding and searches in the knowledge base to retrieve relevant external data before engineering a prompt. The prompt is then passed to the LLM to get the response and send it back to the user.

You can also check out my article "Building a Gen AI APP with RAG".

Option 2 - Selective RAG

Tools to interact with LLMs are becoming more popular, and we can use a tool to leave to the LLM to decide whether RAG is required to respond to the user's query. The query is redirected to the knowledge base if the LLM requires more context. The rest will follow the Basic RAG as described in Option 1.

Option 3 - Selective Databases RAG

If RAG is required, we may want to direct the query to the right Knowledge Base (KB), for example, the Customer or Product KB. This capability could be built into the tool that interacts with the LLM so that when RAG is required, the query is directed to the correct knowledge base for retrieval.

Option 4 - Feedback RAG

When using basic RAG, the initial user's query may not be tuned to get the best response from the LLM. The good news is that we can use LLM to rewrite the query before deciding to go to the next step for the search in the knowledge base.

RAG solutions are built by building a prompt to interact with the LLM. The prompt combines the initial input from the user with additional context from external data. This increases accuracy and reduces risk factors such as bias while not sharing the external data with the LLM.

Understanding the building block of a RAG solution and the RAG architecture options increases the chance of success of the RAG system.

(3) How to build RAG?

Here's a step-by-step guide to implementing a Retrieval-Augmented Generation (RAG) system for a task such as question answering or content generation. This guide will cover the essential components: building a retriever, integrating a generator, and executing the RAG process.

Step 1: Define Requirements

Determine Scope: Define the type of information you need to retrieve (e.g., scientific articles, general knowledge).
Performance Metrics: Set goals for retrieval accuracy, speed, and scalability.

Step 2: Data Collection and Preparation

Gather Data: Collect a comprehensive dataset relevant to your application, such as Wikipedia dumps or domain-specific corpora.
Preprocessing: Clean and preprocess the data. This includes tokenization, removing stopwords, and possibly lemmatization.

Step 3: Building a Sparse Retriever (TF-IDF)

Tokenization: Convert text into tokens or words.
TF-IDF Vectorization: Use libraries like scikit-learn in Python to convert the text data into TF-IDF vectors.

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_vectors = vectorizer.fit_transform(documents)

Indexing: Store these vectors in a way that they can be efficiently searched. You can use simple array storage for small datasets or specialized systems like Anserini or Elasticsearch for larger datasets.

Step 4: Building a Dense Retriever

Choose Model: Select a pre-trained model appropriate for generating dense embeddings (BERT, RoBERTa, etc.).
Generate Embeddings: Convert the documents into vector embeddings using the chosen model.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-mpnet-base-v2')
embeddings = model.encode(documents)

Indexing: Use libraries like FAISS or Annoy for efficient similarity search in high-dimensional spaces.

import faiss
index = faiss.IndexFlatL2(embeddings.shape[1])
index.add(embeddings)  # Add vectors to the index

Step 5: Implementing the Search Function

Query Processing: Convert the query into the same format as your documents (TF-IDF or embeddings).

query_vector = model.encode([query])  # For dense retriever

Searching: Implement a function to search the index for the most similar documents.

D, I = index.search(query_vector, k=5)  # Search for top-5 closest vectors in FAISS

Step 6: Retrieval Evaluation

Evaluation Metrics: Use precision, recall, or F1-score to evaluate how well your retriever is performing.
Test Queries: Run test queries to assess the performance of your retriever.

Step 7: Optimization and Tuning

Parameter Tuning: Optimize parameters like the number of dimensions for embeddings, the algorithm for indexing, and the k-value in nearest neighbor searches.
Quality of Embeddings: Experiment with different pre-trained models or fine-tune them on your specific dataset to improve retrieval quality.

Step 8: Integration

API Development: Develop an API that can receive queries and return the retrieved documents.
Interface with Generator: Ensure the retrieval mechanism can seamlessly interface with the generative model for downstream tasks.

Step 9: Deployment

Deployment: Deploy the retrieval mechanism on a suitable server or cloud environment.
Monitoring: Set up monitoring for performance and errors to ensure the system operates reliably.

Step 10: Set Up the Generator

Model Selection: Choose a language generation model. Options include GPT-2, GPT-3, or BERT for generating potential answers. GPT-3 or newer models like BERT for sequences can provide better contextual generation.
Fine-tuning: Optionally, fine-tune the generator on a relevant dataset to adapt it to the specifics of your task. This can improve relevance and accuracy in the generated text.

Step 11: Integration of Retriever and Generator

Retrieval Interface: Develop a mechanism where the generator can access retrieved documents. This usually involves querying the retriever with an input question or prompt and passing the results to the generator.
Context Incorporation: Design your system so that the generator can use the context from retrieved documents. This might involve concatenating the input with retrieved texts or embedding these texts into the model's input.

Step 12: Implementing the RAG System

Query Processing: When a query (e.g., a question) is received, use the retriever to fetch the most relevant documents.
Content Generation: Feed the query along with the retrieved documents into the generator. The generator synthesizes the inputs to produce a coherent and contextually enriched response.
Response Refinement: Implement a post-processing step to refine responses, which can include grammar corrections, relevance checking, and trimming unnecessary information.

Step 13: Evaluation and Optimization

Evaluation Metrics: Evaluate the performance of your RAG system using appropriate metrics such as BLEU for translation, ROUGE for summarization, or accuracy and F1 score for question answering.
Optimization: Based on evaluation results, fine-tune the retriever and generator. Adjustments might include changing the number of documents retrieved, tweaking the retrieval mechanism, or further training the generator.

Step 14: Deployment

Deployment Environment: Deploy your RAG system in an environment where it can handle real-time queries, if required. This might involve setting it up on a cloud server or within an enterprise environment.
Interface Design: Design user interfaces or API endpoints to interact with the system effectively.

Step 15: Continuous Learning and Updating

Feedback Loop: Establish mechanisms for capturing user feedback and system performance.
Model Updating: Periodically retrain both the retriever and generator with new data and feedback to keep the system current and improving over time.

By following these steps, you can effectively develop and maintain a RAG system tailored to your specific needs and challenges, leveraging the power of both retrieval and generation for advanced NLP tasks.

A Guide to Building RAG