MedPrompt turns conventional language models into medical experts without a fine-tuning process

MedPrompt is a prompting strategy developed by Microsoft. Microsoft has succeeded in training large language models (LLMs) such as GPT-4 on a domain-specific expert without fine-tuning. This strategy is currently focussed exclusively on the medical field, where the methodology has also been successfully tested in nine benchmarks.

Prompt engineering vs. fine-tuning of language models

The fine-tuning process of a language model is very time-consuming, as it requires a large amount of domain-specific data and expertise in order for the language model to deliver better results for the use case. The process is therefore very time-consuming and cost-intensive. Figure 1 shows how much better MedPrompt performs compared to other non-fine-tuning and fine-tuning models. It is clear that MedPrompt achieves higher test results, while a selection of other models are further behind.

Figure 1: Accuracy of Medprompt compared to other medically specialised language models. Source: https://ar5iv.labs.arxiv.org/html/2311.16452

Med-PaLM-2 is a pure fine-tuning model that pioneered medical text understanding before MedPrompt. As can be seen in Figure 2, MedPrompt (outer line) performs better than Med-PaLM-2 and GPT-4 in the different benchmarks, showing that the use of curated data does not make a model an expert, but can cause the language model to get a kind of tunnel vision. Each benchmark represents different medical categories, for example the benchmark "MMLU Anatomy" contains only anatomy questions and "MMLU College Biology" contains exam questions from a biology course at a university. An "MMLU" benchmark stands for Massive Multitask Language Understanding, which means that the benchmark contains multiple choice questions. The benchmarks themselves then measure the accuracy of the answers to their questions.

Figure 2: MedPrompt, Med-PaLM 2 and GPT 4 in comparison of accuracy in different medical specialities. Source: https://synthedia.substack.com/p/gpt-4-beats-medpalm-2-for-medical

But how does this work with specialised knowledge without fine-tuning?

Various prompting techniques are used to achieve the domain-specific expertise of MedPrompt. These are shown in Figure 3.

Figure 3: Sequence of MedPrompt prompting strategies. Source: https://www.microsoft.com/en-us/research/blog/the-power-of-prompting/

The zero-shot prompting strategy uses prompts that were not part of the training data, but the model can still produce a desired result. The model is asked about a specific topic without examples and then gives an answer based on its contextual or general knowledge. Examples are the classification of spam e-mails or the translation of a sentence.
Random Few-Shot is a prompting strategy in which several examples are given for a task. The response to a foundation model enables the models to quickly adapt to a specific domain, for example medicine, and learn the task format. The model is shown 2 or more examples. This allows a concept to be better communicated. For example, the invention of a new word that is explained to the LLM using two examples and the LLM can then reproduce this word in other contexts.
Chain-of-Thought (CoT) uses statements in natural language, such as "Let's think step by step", to explicitly encourage the model to generate a series of intermediate steps.
The K-nearest neighbour (kNN) algorithm is a simple machine learning algorithm. It is based on assigning similar labels to similar data points. The algorithm then classifies a new data point using the "k-value". This "k-value" is a similarity measure that indicates how many neighbours should be taken into account. A small "k" value, for example 1, leads to a strong fit to the training data and can cause the model to adapt too strongly to outliers in the training data. A large "k" value, for example 5 or greater, means that the model reacts less sensitively to individual data points. This makes it easier to generalise the model to new, unknown data.
In the ensembling-prompting strategy, the results of several algorithms are combined in order to achieve better prediction performance than a single algorithm. To answer multiple-choice questions, the choice shuffling strategy is used, in which the relative order of the answer options is shuffled before the individual reasoning paths are generated. In this way, the most consistent answer is selected, i.e. the one that is least sensitive to the change of choice, which increases the robustness of the answer.

All prompts are automated by a separate AI, so that only a few human domain experts are needed for validation. Preprocessing takes place before the language model is given a task. During preprocessing, each question in the training dataset is run through a lightweight embedding model to generate an embedding vector. This is followed by what is known as inference, which determines what happens when the question is answered. This involves re-embedding a test question with the same embedding model as in pre-processing, and using kNN to find similar examples from the pre-processed pool.

Effectiveness of MedPrompt: More appearance than reality?

Of course, as always with generative AI, it is important to take a critical look at MedPrompt and scrutinise its actual effectiveness. As can be seen in Figure 1, although there is a four per cent difference between MedPrompt and Med-PaLM-2, it is important to note that MedPrompt was published several months after Med-PaLM-2. This means that an older model is being compared with a newer one, which is why it is not possible to determine exactly how big the difference actually is. In addition, the two language models are based on different foundation models. MedPrompt is based on GPT-4, while Med-PaLM-2 is based on PaLM, which could explain the difference.

In addition, it is important to clarify that zero-shot, random zero-shot and chain-of-thought (CoT) are older prompting techniques, some of which have been in use since 2020 and are only explained very superficially in the corresponding study. Figure 4 shows the effectiveness of the individual prompting strategies in detail. It can be seen that the older prompting techniques account for a very high proportion of the performance in the benchmarks. However, the newer ensemble prompting and the combination of the individual techniques appear to be very effective.

Figure 4: Effectiveness of the individual prompting techniques. Source: https://arxiv.org/pdf/2311.16452

How will this affect the future?

In the future, this methodology will be extended to all kinds of areas. With MedPrompt+, this is already being implemented and tested in various benchmarks, such as maths, application-oriented tasks, a human-evaluated LLM consisting only of code, one that tests the reading comprehension of paragraphs, and one that evaluates the advanced natural language understanding and common sense of AI models.

For adesso and its customers, this new strategy means that generative AI solutions can deliver better results without having to invest time and resources in fine-tuning measures. However, implementing this approach also requires a deeper understanding of artificial intelligence and the respective model - a specialised skill that Prompt Engineers need to acquire quickly.

Would you like to find out more about exciting topics from the world of adesso? Then take a look at our previous blog posts.

GenAI

From the idea to implementation

GenAI will change our business lives just as much as the Internet or mobile business. Today, companies of all sizes and from all sectors are laying the foundations for the effective use of this technology in their business.

A key challenge: integrating GenAI applications into their own processes and existing IT landscape. You can find out how to do this and how we can support you on our website.

To the GenAI website

Author Christian Hammer

After successfully completing his degree in business informatics at the University of Applied Sciences in Würzburg with a focus on e-commerce, Christian Hammer went through a career spanning several stations and technologies in the development of data analytics solutions. Over the years, he took on increasing responsibility, first as a lead developer, later as an architect and project manager - including in the merger of E-Plus and O2. In the meantime, he almost exclusively takes on consulting assignments in strategy consulting or as project or program manager. Christian focuses on business analysis in the context of data integration, data platforms, big data and artificial intelligence.

Author Jasper Rosenbaum

Jasper Rosenbaum graduated from Maastricht University with a B.Sc. in International Business. Jasper is currently an intern at the GenAI Solutioning Unit before he starts his Master in Information Management and Business Intelligence back in Maastricht.

Category:	AI
Tags:	GenAI artificial intelligence Healthcare sector Prompt Engineering