top of page

Retrieval Augmented Generation: Increasing knowledge of your LLM

Updated: May 25


Large Language Models (LLMs)[1] have revolutionized the field of natural language processing (NLP) by demonstrating the ability to perform a wide range of tasks with impressive accuracy and minimal additional work. These models, trained on vast corpora of text, can generate coherent and contextually relevant responses to diverse prompts, making them invaluable tools for various applications. However, there are significant challenges when it comes to utilizing LLMs with private or domain-specific data.

One of the primary challenges is that LLMs, by default, do not have access to private datasets unless they are explicitly trained on them. To leverage the power of LLMs on private data, one would typically need to fine-tune the model. Fine-tuning involves further training the pre-trained model on a specific dataset to adapt it to new information. This process, however, is resource-intensive. Fine-tuning a large model demands substantial computational power and time, which translates to high costs. Consequently, fine-tuning may not be a viable solution for many practical applications, especially for organizations with limited resources.

An alternative to fine-tuning is to provide the necessary data directly within the prompt. This approach involves including relevant information that the model needs to process and generate a response to. However, this method comes with its own set of challenges. One of the critical limitations is that LLMs can only process a fixed amount of text at a time. Therefore, when injecting data into the prompt, it is crucial to select the information judiciously to ensure that the model can generate accurate and relevant outputs without exceeding its processing capacity.

The Solution: Retrieval Augmented Generation

To tackle the challenges associated with knowledge-intensive tasks and the limitations of traditional LLM usage, researchers at Meta AI introduced an innovative method known as Retrieval Augmented Generation (RAG)[2]. This approach seamlessly integrates information retrieval with text generation, providing a robust solution for enhancing the capabilities of LLMs without the need for extensive fine-tuning.

RAG operates by combining two primary components: an information retrieval system and a text generator model. Here’s how it works:

  1. : When a prompt is provided, the retrieval component of RAG is activated. This system searches a designated source of information, such as your local set of documents, to find a set of them that are relevant and supportive of the given input. The retrieval process ensures that the most pertinent pieces of information are selected, and tailored to the specific needs of the task.

  2. Contextual Integration: Once the relevant documents are retrieved, they are concatenated with the original input prompt. This step is crucial as it enriches the context available to the text generator model. By combining the input prompt with additional supportive information, the model is better equipped to generate accurate and contextually relevant outputs.

  3. Text Generation: The combined input prompt and retrieved documents are then fed into the text generator model. Leveraging the enhanced context, the model produces the final output, which is more informed and precise than what could be achieved by the LLM alone.

RAG pipeline
RAG pipeline

This method of combining retrieval and generation allows RAG to effectively utilize external knowledge sources, addressing the need for domain-specific information without the high costs associated with fine-tuning. By retrieving and incorporating relevant documents into the context, RAG enhances the LLM's ability to handle complex queries and generate high-quality responses.


RAG offers numerous advantages, making it a highly effective and adaptive method for handling knowledge-intensive tasks.

One of the most significant benefits of it is its ability to adapt to situations where facts and information change over time. Traditional LLMs rely on static parametric knowledge, which remains fixed after training. In contrast, RAG continuously retrieves the most up-to-date information from external sources, ensuring that the generated outputs reflect the latest facts and developments.

This adaptability is especially valuable in dynamic environments. Since the inherent knowledge of LLMs is static, RAG provides a dynamic approach by allowing models to access and incorporate the latest information without the need for retraining. This capability is particularly useful in fields such as news reporting, scientific research, and other domains where information is constantly evolving.

By bypassing the need for extensive retraining, RAG enables language models to generate reliable outputs based on the most recent data available. This retrieval-based approach ensures that the generated responses are grounded in current and relevant information, enhancing their accuracy and relevance.

Furthermore, RAG contributes to greater factual consistency in the outputs of language models. By retrieving and incorporating accurate, up-to-date information from credible sources, RAG minimizes the risk of generating outdated or incorrect responses, maintaining the integrity and reliability of the information presented.

The integration of retrieval mechanisms into the generation process improves the overall reliability of the responses produced by language models. By providing contextually relevant and accurate information, RAG enhances the credibility of the outputs, making them more trustworthy and useful for users.

Additionally, RAG helps mitigate the problem of hallucination [3], a common challenge with LLMs where the model generates plausible-sounding but incorrect or nonsensical information. By grounding the generation process in retrieved documents that provide a factual basis for the responses, RAG reduces the likelihood of hallucination, ensuring that the outputs are more accurate and coherent.


A key aspect of Retrieval Augmented Generation (RAG) lies in its innovative approach to finding the specific data required for a given query. The main challenge of this technique centers around the information retrieval stage, where the model must effectively locate and utilize relevant data to generate accurate and contextually appropriate responses.

RAG operates in two primary scenarios regarding the nature of the data being retrieved: structured databases and unstructured databases.

When dealing with structured databases, such as those stored in Excel, SQL, JSON, or similar formats, the retrieval process is relatively straightforward. Structured data is organized in a predefined manner, making it easier to query and extract specific pieces of information. The challenge here is to develop efficient querying mechanisms that can quickly and accurately pull the necessary data from these well-defined sources. Structured data's inherent organization facilitates this process, allowing RAG to leverage precise information to enhance the generation process.

In contrast, retrieving information from unstructured databases presents a more complex challenge. Unstructured data can be found in formats such as Google Drive, Dropbox, or a collection of PDFs, where the information is not organized in a predefined schema. This type of data requires more sophisticated retrieval techniques, often involving natural language processing (NLP) and machine learning algorithms to parse, understand, and extract relevant content. The retrieval component in RAG must be capable of navigating these diverse and often unorganized sources to identify and pull out the necessary information for the task at hand.

In both scenarios, the effectiveness of RAG hinges on the ability to accurately retrieve the required data. For structured databases, this involves crafting efficient queries that can pinpoint the needed information with high precision. Unstructured databases require advanced retrieval algorithms capable of interpreting and extracting useful content from a myriad of formats and sources.


Consider implementing Retrieval Augmented Generation for a customer service bot at a bank. In this scenario, the database consists of structured data, which might include tables with customer information, transaction records, and frequently asked questions (FAQs).

For example, imagine a database structured as follows:

A structured database example about customer support data of a bank
A structured database example about customer support data of a bank

When a customer asks, "I have lost my card," the RAG-enabled system processes this inquiry through several steps to provide an accurate and helpful response.

  1. Initial Request: The system first sends a request to the LLM, asking it to categorize the query. For example: “Which is the category and subcategory of this question: ‘I have lost my card’?”

  2. Categorization: The LLM responds with: “Category: Cards, Subcategory: Stolen.”

  3. Information Retrieval: Using this categorization, the system retrieves relevant information from the database about how to proceed with a lost or stolen card.

  4. Response Generation: The system then formulates a detailed response by combining the retrieved information with the customer’s query. It frames the request to the LLM like this: “You are part of the customer service staff of the Bank. Respond to the question delimited by ### using the information provided between &&&. ###I have lost my card### &&&[retrieved information]&&&.”

This is just an example of an implementation of a particular case of structured data. If your data is in SQL format, you can run an SQL query to get the needed information. The idea here is to apply an index search to get the relevant information. It is not rock science, the interesting approach is with unstructured data!


Handling unstructured databases is the most challenging scenario for RAG. Before the advent of LLMs, there were limited solutions for effectively managing unstructured data. Now, it is possible to interact with a diverse set of documents using advanced techniques.

The core idea involves vectorizing the corpus of documents to enable semantic search. By converting documents into vectors, we can apply semantic search based on the principle that vectors of similar sentences will be close to each other in the vector space (to dive deeper into this vector concept, read our article about “The Mathematics of Language”!). This allows the system to find and retrieve relevant information from unstructured sources, such as PDFs, Google Drive, or Dropbox, making it feasible to use RAG for a wide range of applications.

An Unstructured Database example of a company’s internal information
An Unstructured Database example of a company’s internal information

The steps needed to apply this approach could be described as follows:

  1. Split Documents: Break your documents into smaller, manageable pieces to facilitate more precise retrieval.

  2. Generate Embeddings: Use the LLM to generate embeddings (vector representations) for each piece of text.

  3. Store Embeddings: Store all the embeddings in a vector database, which allows for efficient similarity searches.

  4. Embed the Question: When a query is received, embed the question to convert it into a vector representation.

  5. Find Nearest Embeddings: Search the vector database to find the N nearest embeddings to the query.

  6. Retrieve Corresponding Texts: Extract the text pieces corresponding to the nearest embeddings.

  7. Incorporate into Prompt: Include these text pieces in the prompt and ask the model to generate a response using this enhanced context.

So, by following these steps, you can leverage RAG to handle unstructured data efficiently, enabling the model to provide accurate and contextually relevant responses.

How to Find the Nearest Neighbors 

Finding the nearest neighbors is a crucial step. Applying a classic k-nearest neighbors (KNN)[4] algorithm can be computationally expensive and inefficient, particularly with large datasets. Instead, RAG employs a more sophisticated approach known as Approximate Nearest Neighbors (ANN)[5].

The ANN method used in RAG leverages the heuristic Hierarchical Navigable Small Worlds (HNSW)[6]. This heuristic is inspired by the concept of "small worlds" from social science, which refers to networks characterized by short paths between nodes and a high degree of local clustering.

Here’s how HNSW works to efficiently find the nearest neighbors:

  1. Graph Construction: The HNSW algorithm transforms the vector dataset into a graph structure. Each vector (representing a piece of text) is a node in this graph. The edges between nodes are determined based on their similarity, creating connections between nodes that are close to each other in the vector space.

  2. Small World Properties: The resulting graph exhibits properties of a small world graph, where most nodes can be reached from any other node by a small number of steps. This is achieved by maintaining a hierarchical structure and local connectivity, which ensures that both global and local search spaces are navigable.

  3. Efficient Search: During the search process, the algorithm navigates through the graph to find approximate nearest neighbors efficiently. It uses a combination of greedy search within local neighborhoods and hierarchical traversal to quickly converge on the nearest nodes. This approach significantly reduces the computational cost compared to exhaustive KNN searches.

  4. Scalability and Performance: HNSW is highly scalable and performs well even with large datasets. Its efficiency in finding nearest neighbors makes it suitable for real-time applications where speed is critical.

The use of HNSW in RAG ensures that the retrieval process is both accurate and efficient, enabling the system to quickly find the most relevant pieces of information. This, in turn, enhances the quality of the generated responses by providing the model with the best possible context.


Retrieval Augmented Generation represents a significant advancement in the field of natural language processing, addressing many of the limitations inherent in traditional LLMs. By seamlessly integrating information retrieval with text generation, RAG enhances the ability of language models to handle complex, knowledge-intensive tasks without the need for costly and time-consuming fine-tuning.

RAG's strength lies in its innovative approach to retrieving relevant information. Whether dealing with structured databases, where data can be easily queried and retrieved, or unstructured databases, where advanced techniques like Hierarchical Navigable Small Worlds (HNSW) are employed, RAG ensures that models are provided with the most accurate and contextually relevant information. This method allows RAG to adapt to dynamic environments and evolving facts, maintaining the accuracy and reliability of generated responses.

The advantages of RAG are clear: it enables language models to bypass retraining, access the latest information, and produce responses with greater factual consistency. This is particularly important for applications such as customer service, where accurate and timely information is crucial. Additionally, RAG helps mitigate the problem of hallucination by grounding generated responses in retrieved, fact-based information.

In summary, RAG offers a robust and efficient solution for enhancing the capabilities of language models. By combining retrieval with generation, it ensures that models can adapt to changing information landscapes, maintain factual accuracy, and deliver high-quality, trustworthy outputs. As the field of NLP continues to evolve, techniques like RAG will play an increasingly important role in developing intelligent, responsive, and reliable AI systems.


[3]What are AI hallucinations?

[4] Cover, T., & Hart, P. (1967). Nearest neighbor pattern classification. IEEE transactions on information theory, 13(1), 21-27.

[5] Indyk, P., & Motwani, R. (1998, May). Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing (pp. 604-613).

[6] Malkov, Y. A., & Yashunin, D. A. (2018). Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE transactions on pattern analysis and machine intelligence, 42(4), 824-836.


bottom of page