adesso Blog

Open-source information retrieval – a data protection-compliant alternative

The insurance industry faces the challenge of efficiently processing large volumes of text-based documents such as policies, claims reports, contract amendments and customer correspondence. These often unstructured documents require time-consuming manual processing, which ties up resources. Technologies such as Large Language Models (LLMs) offer new automation possibilities here: they can categorise documents, extract information and integrate relevant data into processes. This saves costs, increases efficiency and minimises errors.

An important application of LLMs is document surveying, in which content-related questions about documents are answered using a retrieval-augmented generation (RAG) system. However, there is often a lack of objective evaluation of the accuracy of the answers. Furthermore, the question arises as to how models can be adapted and evaluated by fine-tuning them to industry-specific requirements. Many solutions are based on systems like OpenAI's GPT family, which are powerful but can pose different privacy risks depending on the provider and implementation. This creates risks because data is often transferred to external providers, which, especially in the insurance industry with sensitive customer data, could violate data protection laws such as the GDPR (General Data Protection Regulation).

For these reasons, we have been looking at how we can optimise an RAG system and create a basis for data protection compliance. In this blog post:

  • We explain the concept of RAG systems
  • We show optimisation and fine-tuning methods
  • Compare our chosen open-source model with OpenAI's proprietary text-embedding ADA-002 model
  • Highlight the strengths and potential of open-source RAG systems specifically for the insurance industry

RAG systems for document surveys: a solution for the insurance industry

RAG systems combine language models (LLMs) with retrieval methods: the retriever searches documents for relevant passages that the language model processes. These systems are ideal for document surveys in the insurance industry, where targeted questions about documents are asked and automatically answered. We determined the optimal retriever model for our optimised RAG system by means of comparative tests.

For developing and testing our RAG system, we used a data set that is very similar to insurance documents in structure and content. The documents come from the ‘Service Finder’ of the Munich city administration. This is where information about the services offered by the city is presented online. The information ranges from applying for an ID card to waste disposal. Similar to insurance documents, these documents are texts written in formal German. In addition, these documents contain industry-specific knowledge that goes beyond general language model knowledge. There is also a question-answer (QA) dataset that matches the data set. This contains questions and matching answers for the documents. The QA dataset provides a suitable basis for evaluating the models.

Our focus was on optimising the retriever, as it forms the basis for the accuracy of the entire system. The results of our tests show that targeted fine-tuning can significantly improve performance. We describe how we went about this in the next section. Our RAG system is an efficient alternative to proprietary solutions with data protection risks. By using an open-source model, we were able to design a cost-efficient solution and enable the basis for a system that is compliant with data protection regulations, as companies can fully control data processing internally. This is ensured by self-hosting on company-owned servers, the absence of external APIs and the option of complete data deletion.

Optimisation of a RAG system: Our methodology

We systematically optimised the accuracy of the system in several steps. To measure the effectiveness of each individual optimisation, we used a question-answer data set for evaluation. This data set contains realistic questions that are aligned with the content of the documents used. Accuracy was used as the key metric, checking how many of the generated answers correctly matched the actual answers defined in the dataset.

This evaluation procedure was used to objectively assess the results of each step and to quantify the progress made in optimising the retriever. The results of the hyperparameter optimisation, the fine-tuning and the comparison with proprietary models are based on this methodological approach.

Step 1: Hyperparameter optimisation

First, we started with hyperparameter optimisation, in which pre-set parameters of a model are adjusted to maximise its performance – without additional training. In contrast to classic training, in which the model learns from the data, hyperparameter tuning aims to modify the parameters so that the model is optimally adapted to the requirements.

For this process, we chose a pre-trained open-source embedding model as a retriever, based on RoBERTa and optimised for the German language. We experimented with:

  • Chunksize and overlap: influence how documents are broken down into smaller sections.
  • Parameter k: determines the number of hits that the retriever returns per query.
  • Splitting strategies: efficient methods for document decomposition.

Result: By making targeted adjustments, we improved accuracy from 68 per cent to 86 per cent, demonstrating the effectiveness of hyperparameter optimisation. This shows that remarkable results can be achieved without additional training and by selecting the right open-source model and adjusting the hyperparameters.

Step 2: Fine-tuning the retriever

After demonstrating how high accuracy can be achieved through hyperparameter optimisation alone, we went on to investigate how fine-tuning further improves performance. Fine-tuning refers to the process of training a pre-trained model further with additional specific data to better adapt it to the requirements of a specific task or domain. There are various techniques for doing this, such as masked language modelling (MLM) or contrastive learning, each of which takes a different approach to improving model performance.

We used a small, general BERT model with only 100M parameters that initially only achieved an accuracy of 8 percent. Two different approaches were used:

1. Masked Language Model (MLM) training

In MLM training, the model is trained to mask certain words in a sentence and to fill the gaps correctly. This helps the model to better understand the specific vocabulary and terminology of the insurance industry. We used 1,000 random sentences from the documents.

Result: Accuracy increased from an initial 8 per cent to 28 per cent. This shows that MLM training improves vocabulary, but has no significant effect on retrieval accuracy.

2. Contrastive Learning

Contrastive Learning trains the model to match similar pairs of questions and contexts and to separate dissimilar ones. To do this, question-context pairs from a QA data set were used with the MultipleNegativesRankingLoss function to train the model. A total of 1,700 question-answer pairs were available for this purpose.

Result: Accuracy increased to an impressive 70 per cent, demonstrating the effectiveness of contrastive learning for retrieval tasks.

In addition, the RoBERTa model from step 1 was readjusted. However, since this model is larger and, with 400M parameters, requires more training data, it achieved an accuracy of 85 per cent after fine-tuning. Overall, it has been shown that models can be fine-tuned to specific data, even if they have had no previous experience in the respective domain. The amount of data required for this depends mainly on the size of the model, which makes it possible to flexibly adapt the adjustment to different needs.

Step 3: Comparison with proprietary models

Finally, we compared our open-source model with the proprietary text-embedding-ada-002 model from OpenAI. In the following, it will be referred to as the Ada model. Proprietary models like the Ada model are fully controlled by the provider, which makes users dependent on their infrastructure and pricing and can limit flexibility and cost structure. Open-source models, such as the one we use, on the other hand, offer more control and, depending on the provider, more customisability, as they can be hosted and modified by the user. Our model is also smaller than Ada. The Ada model has been trained on many languages and has a longer context length than our model. Our model has been trained exclusively on the German language and can process fewer words at once.

Despite its smaller size, our model achieves comparable performance and can be operated cost-effectively on its own hardware. In our tests, both models were tested on the QA data set and the accuracy was calculated. The Ada model achieved an accuracy of 90 per cent, while our open-source model achieved 86 per cent. The slightly higher accuracy of the Ada model could be due to its larger training data set, support for multiple languages and larger context length. Nevertheless, our open-source approach offers decisive advantages:

  • Approach to GDPR compliance: data remains under your own control.
  • Control over the development process: Adaptation to specific requirements.
  • Cost efficiency: Lower operating costs by hosting on your own infrastructure.

These advantages make open-source models an attractive alternative for use in the insurance industry.


LLM in the insurance industry

How disruptive technology use cases are transforming the industry

Language models are currently the focus of a lot of hype. With the advent of GPT-3, GPT-4 and other advanced models, they have attracted a great deal of attention. The ability of these powerful AI systems to understand natural language and generate human-like text has generated great enthusiasm in various industries, including insurance.

Learn more


Open-source RAG systems: The future of text processing in the insurance industry

Our work proves that open-source RAG systems are a powerful and future-oriented alternative to proprietary solutions, especially for data-sensitive industries such as insurance. The test results show that targeted hyperparameter optimisation can already improve accuracy, while fine-tuning significantly improves even smaller models with limited data sets.

Larger models like RoBERTa, on the other hand, benefit particularly from extensive and high-quality data, which further increases their performance for specialised tasks. This approach combines maximum flexibility with a strategy for GDPR compliance, giving companies full control over their data and the development process. However, open-source solutions often require a higher level of maintenance and technical expertise to ensure smooth implementation and continuous optimisation.

Open-source RAG systems enable insurance companies to handle data in a cost-effective and flexible way and provide a solution for GDPR compliance. They offer comparable performance to proprietary systems while reducing dependencies and privacy risks.

Also interesting

Picture Hannah Fischer

Author Hannah Fischer

Hannah Fischer has been working as a student trainee in the ‘Data Driven Insurance’ competence centre at adesso since 2024, specialising in complex solutions in the area of large language models and machine learning.

Picture Sina Barghidarian

Author Sina Barghidarian

Sina Barghidarian is an experienced data scientist and machine learning engineer with expertise in software development. He specialises in natural language processing (NLP) and has extensive knowledge in the development and deployment of AI solutions on the Azure cloud.