adesso Blog

27. August 2024 By Immo Weber and Sascha Windisch

‘From RAGs to Riches": The path from simple to advanced retrieval augmented generation

Retrieval Augmented Generation

With the introduction of ChatGPT and the like, the triumphant advance of Large Language Models (LLM) in many areas can hardly be stopped. In addition to the creative application possibilities, such as text generation, the ability to ask questions of your own domain-specific data and documents is of great interest in industry and public administration. A methodological framework that enables precisely this use case is the so-called Retrieval Augmented Generation (RAG, see blog post ‘Retrieval Augmented Generation: LLM on steroids’). Despite the many possible applications of RAG, there are typical problems, the solutions to which will be presented in the following article.

Retrieval Augmented Generation solves an important problem of large language models. Although they have been trained on a broad knowledge base, they cannot make any statements about domain-specific knowledge, e.g. internal company data. One valid option is to subsequently train LLMs with precisely this data. However, this approach is expensive and involves a great deal of effort. Instead, it is possible with a RAG framework to transfer the additional domain knowledge together with the user's query externally as a context to the language model. To do this, the domain knowledge is first broken down into smaller sections (so-called chunks, Figure 1, II) and then vectorised with the help of a so-called embedding model (Figure 1, III) and stored in a vector database (Figure 1, IV). The advantage of vectorisation is that the content, which can consist of thousands to many millions of sections, can be searched much more efficiently in response to user queries. If a user submits a query to the data (Figure 1, 1), this is also vectorised (Figure 1, 2) and compared with the entries in the vector database using a similarity search (Figure 1, 3). The most similar entries are then sorted (ranked) and passed to a language model together with the query (Figure 1, 4a and 4b) in order to finally generate a response (Figure 1, 5).

Figure 1: Naive RAG framework and some problems that can occur in each step.

Challenges

In recent months, RAG has become the buzzword par excellence when it comes to the possible applications of LLM. It promises a lot, but all too often the user has to realise after implementing a corresponding pipeline that the results do not quite meet expectations. Like many AI-based tools, LLM falls into the ‘easy to learn but difficult to master’ category. Often, the response from LLM does not correspond to the query, or only insufficiently.

There can be many reasons for this. Some of them are illustrated below (see also Figure 1).

A) Loss of information after vectorisation: It is conceivable, for example, that the most relevant sections are not found during the similarity search, known as retrieval. One reason for this is vectorisation. The performance gain, i.e. the speed of the search, is achieved at the cost of a high degree of compression of the text sections. The sections are typically converted into 768- or 1536-dimensional individual vectors, which can sometimes lead to a loss of information during the similarity search. This is particularly the case because it is not yet clear at the time of embedding which information is relevant for a later query and which can be compressed.
B) Chunks that are too small and too little context: Another problem occurs if the chunks found are relevant but were broken down into chunks that are too small during pre-processing. In this case, the LLM may have too little information or context available to generate a complete, relevant and correct answer.
C) Dilution of the context due to chunks that are too large: Conversely, too much information, for example due to too many chunks or chunks found in the retrieval for too long, can also lead to problems when the LLM generates an answer. The ability of an LLM to extract and process information from the text chunks in the given context is referred to as LLM recall. It has been shown that LLMs have particular difficulty finding information that occurs in the middle of the context. For this reason, the so-called ‘needle-in-a-haystack’ test is now part of the repertoire of LLM benchmarks.
D) Queries that are too complex or specific: Another cause of retrieval problems can be prompts or queries that are too complex or too specific. These can lead to problems in the similarity search if rather irrelevant aspects of the prompt are compared with the contents of the source database after compression as part of the vectorisation process.
E) Semantic distance between question and answer: Since relevant information for answering a user query is obtained by a similarity search between the query and the often factual statements of the source documents, the fact that user queries are usually formulated as questions can also pose a problem. Regardless of the factual content, questions and statements already exhibit a certain semantic distance, which makes a comparison fundamentally more difficult.

Solution approaches

The pipeline shown in Figure 1 only represents the basic structure of a RAG framework. It can work, especially for simple use cases. But it doesn't have to! While LLMs themselves are mainly developing vertically (with the exception of concepts such as the so-called Mixture-of-Experts models) and are being trained with more and more parameters, RAG has mainly developed horizontally. Due to the great interest and the now widespread use of RAG systems, the problems described are being addressed by the development of ever new methods. Four or five approaches are presented below.

Reranking

The first approach to optimising RAG deals with problem A, i.e. incorrect or suboptimal retrieval. Since it cannot be known at the time of embedding the source documents which information of a text section could be relevant in the context of a user query, highly relevant information can be lost during vectorisation. As a result, potentially relevant text sections are ranked much lower in the similarity search than they should actually be and are therefore not even transferred to the LLM. One conceivable solution could be to simply pass more text sections to the LLM in order to increase the probability of receiving more relevant sections. However, in addition to the problem C described above (‘thinning of the context’), the different maximum context size (input of the LLM) depending on the model also prevents this solution approach. Instead, an attempt is made to maximise the retrieval recall (proportion of relevant sections found) by searching for as many relevant documents as possible. At the same time, an attempt is made to maximise LLM recall by passing as few highly relevant documents as possible to the LLM.

This last step is made possible by so-called ranking or cross-encoding models. These generate highly relevant similarity scores for each document found by embedding both the user query and the respective document together. In contrast to initial embedding in the classic RAG system, the specific context of the user query can be taken into account when embedding the documents. The disadvantage, however, is that embedding takes place during the runtime of the inference and therefore leads to significantly higher latencies in the enquiry. In a final step, the documents determined by the similarity search are sorted according to their ranking score and the k most relevant documents are transferred to the LLM.

Reranking coding example

We utilise the experience from our first blog post on RAG (see above1) and once again rely on Python and LlamaIndex as a framework for connecting to AI models and algorithms. The basics for an executable environment can be found in the LlamaIndex documentation.

We start by creating or importing the vectorised data.

	
		# check if storage already exists
		PERSIST_DIR = "./storage"
		if not os.path.exists(PERSIST_DIR):
		    # load the documents and create the index
		    documents = SimpleDirectoryReader("data").load_data()
		    index = VectorStoreIndex.from_documents(documents)
		    # store it for later
		    index.storage_context.persist(persist_dir=PERSIST_DIR)
		else:
		    # load the existing index
		    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
		    index = load_index_from_storage(storage_context)

This code is used to create a new vector store or read in an existing one. The index object provides us with a document structure that we can then use to create a query object query_engine.

	
		query_engine = index.as_query_engine(
		    similarity_top_k=10,
		    node_postprocessors=[
		        LLMRerank(
		            choice_batch_size=5,
		            top_n=3,
		        )
		    ],
		    response_mode="tree_summarize",
		)
		response = query_engine.query("What did the author do growing up?")
		print("Reranked response:")
		print(response)

The procedure is as follows:

We first search for 10 relevant chunks using the similarity search. Normally, this number is significantly lower at three or five. With the larger number, we optimise our retrieval recall and can make better use of the reranking.

In a post-processing step, we create a post-processor that calls up the reranking and then returns the three most relevant hits.

There are now various approaches to how reranking can be implemented in practice. In the LLMRerank module, for example, a different language model is called up and a corresponding prompt is issued to re-evaluate the previous hits.

Context expansion/enrichment

Problems B and C can be compensated for by so-called context expansion or context enrichment. A relatively simple approach is to limit the embedding for the similarity search to individual or only a few sentences, thereby increasing the relevance of the content found. However, when the found sections are transferred to the language model, surrounding sentences are also transferred in order to enrich the context for the answer generation (‘enrichment’). The number of surrounding sentences or the scope of the additional content represents an (optimisable) parameter.

The so-called parent-child chunking approach goes one step further. Here, chunking does not specify a single fixed chunk size as a parameter, but at least one larger (‘parent’) and one smaller (‘child’) section size. The child chunks that are part of a parent chunk are referenced to each other in the (vector) database. The smaller and therefore more relevant sections are used for the similarity search. However, if several child chunks are found for a parent chunk, the parent chunk is passed to the LLM for answer generation and not several smaller sections.

Context expansion/enrichment coding example

For the code example, we remain in the familiar development environment. However, we need to adapt the structure of the vector database slightly. To do this, we use a parser before creating the vector database, which not only creates the chunks, but also saves the environment of the chunk (records before and after) depending on the transferred parameters.

	
		node_parser = SentenceWindowNodeParser.from_defaults(
		    window_size=3,
		    window_metadata_key="window",
		    original_text_metadata_key="original_text",
		)

Take the following sentence, for example:

“I didn't write essays."

The surroundings are accordingly:

“What I Worked On February 2021
Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.”

As part of the search query, the chunk found is then enriched with the sentence environment during post-processing by the post-processor and transferred to the language model. With these enriched texts, the language model is called up accordingly in order to generate a significantly more relevant answer.

	
		query_engine = sentence_index.as_query_engine(
		    similarity_top_k=2,
		    node_postprocessors=[
		        MetadataReplacementPostProcessor(target_metadata_key="window")
		    ],
		)
		response = query_engine.query("What did the author do growing up?")
		print("Response:")
		print(response)

Query transformation

The problem D of overly complex queries can be tackled using the query transformation technique, among other things. There are also various approaches for this technique, but they all have in common that an LLM is used to change or adapt a user query. If the complexity of a query results from the fact that it essentially consists of several sub-questions, an LLM can be used to split the original query into these sub-questions in an intermediate step (‘sub-query decomposition’). These sub-questions are then first processed individually by the similarity search and finally brought together again in a final step by a single answer from the LLM.

In step-back prompting, an LLM is again used to derive a more general question from the specific user query. This is also processed in the similarity search to give the LLM more context for the actual answer to the question.

Query transformation coding example

For the following code example, we want to answer the following question:

‘Which project, Langchain or LlamaIndex, has had more updates in the last 3 months?’

To answer the question, we first need to answer the sub-questions of how many updates Langchain and LlamaIndex each had in the last 3 months and then combine the individual answers. For the code example, we assume that we have each stored a PDF document ‘Updates_LangChain’ and ‘Updates_LlamaIndex’ in a ‘Data’ folder. The ‘Updates_LangChain’ document could then contain the information that seven updates have been carried out in the last three months, for example. Similarly, the other document would contain the information that four updates were carried out in the same period. Firstly, we load our modules and environment variables again:

	
		import os
		from dotenv import load_dotenv
		from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
		from llama_index.llms.openai import OpenAI
		load_dotenv('.env')
		openai_api_key = os.getenv('openai_api_key')
		os.environ["OPENAI_API_KEY"] = openai_api_key
		Dann initialisieren wir unser LLM und laden unsere Dokumente in den Vectorstore:
		llm = OpenAI()
		# Erstelle den Index aus den Dokumenten.
		documents = SimpleDirectoryReader("data").load_data()
		index = VectorStoreIndex.from_documents(documents)

We then define our initial query, define a prompt to split the query into sub-queries and save them as a list:

	
		 # Der komplexe Query
		complex_query = "Welches Projekt, Langchain oder LlamaIndex, hatte in den letzten 3 Monaten mehr Updates?"
		# Zerlege den komplexen Query in einfachere Sub-Queries
		sub_queries_prompt = (
		    f"Zerlege den folgenden komplexen Prompt in eine Reihe von einfacheren, "
		    f"spezifischeren Prompts, die einzeln beantwortet werden können. Gib die Sub-Queries als Liste aus: "
		f"'{complex_query}'"
		)
		# Generiere Sub-Queries dem LLM
		sub_queries = llm.complete(sub_queries_prompt)
		sub_queries = str(sub_queries)
		print(sub_queries)
		sub_queries_list = [line.strip() for line in sub_queries.split('\n') if line.strip()]

Finally, the sub-queries are queried individually via a loop, the answers are aggregated and passed to an LLM together with the initial question for answering:

	
		# Erstelle das Query-Engine-Objekt
		query_engine = index.as_query_engine()
		# Bearbeite jede Sub-Query und speichere die Antworten
		responses = []
		for sub_query in sub_queries_list:
		    response = query_engine.query(sub_query)
		    responses.append(response)
		# Führe die Antworten zusammen und generiere die finale Antwort
		final_query = f"Hier sind die Antworten auf die einzelnen Teile: {responses}. "
		final_response = llm.complete(
		    f"Nutze die folgenden Informationen: '{final_query}' um die folgende Frage, inklusive Begründung, zu beantworten: '{complex_query}'"
		)
		# Ausgabe der finalen Antwort
		print("Finale Antwort:")
		print(final_response)
		Finale Antwort:
		Langchain hatte in den letzten 3 Monaten mehr Updates als LlamaIndex. Dies ergibt sich aus den Informationen, dass Langchain 7 Updates hatte, während LlamaIndex nur 4 Updates verzeichnete. Somit ist Langchain das Projekt mit den meisten Updates in diesem Zeitraum.

Hypothetical Questioning/HyDe

The problem of the low semantic proximity of the user questions to the factual statements in the source documents can be solved in two ways. Basically, one approach (so-called hypothetical questioning) consists of generating a hypothetical question from each chunk using LLM and embedding it in place of the chunks. In this way, the user questions are no longer compared with statements of fact, but with questions again, which should improve their semantic proximity. A reverse procedure is also conceivable with the so-called HyDe approach (‘Hypothetical Document Embedding’). Here, a hypothetical document, i.e. a hypothetical answer to the question, is generated from the user enquiry and this in turn is compared with the source documents.

Outlook

This blog post has shown which techniques can be used to extend a classic RAG architecture in order to address common problems in LLM-based searches in domain-specific source data. In addition to these algorithmic approaches, there are also more fundamental strategies to improve the performance of classical RAG systems. One possibility is to use graph databases instead of conventional relational databases to store the domain-specific source documents. This makes it possible to provide the LLM not only with factual content but also with relationships between different content when generating answers. The concept of the so-called GraphRAG will be presented in the next blog post.

Would you like to find out more about exciting topics from the world of adesso? Then take a look at our previously published blog posts.

Author Immo Weber

Immo Weber is a habilitated Senior Consultant at adesso specialising in AI, GenAI and data science in public administration.

Author Sascha Windisch

Sascha Windisch is Competence Centre Manager at adesso and a consultant specialising in business and software architecture, AI, GenAI and data science, requirements analysis and requirements management for complex distributed systems.

Category:	AI
Tags:	artificial intelligence Large Language Models