Optimizing biomedical information retrieval with a keyword frequency-driven prompt enhancement strategy

Background Mining the vast pool of biomedical literature to extract accurate responses and relevant references is challenging due to the domain's interdisciplinary nature, specialized jargon, and continuous evolution. Early natural language processing (NLP) approaches often led to incorrect answers as they failed to comprehend the nuances of natural language. However, transformer models have significantly advanced the field by enabling the creation of large language models (LLMs), enhancing question-answering (QA) tasks. Despite these advances, current LLM-based solutions for specialized domains like biology and biomedicine still struggle to generate up-to-date responses while avoiding “hallucination” or generating plausible but factually incorrect responses. Results Our work focuses on enhancing prompts using a retrieval-augmented architecture to guide LLMs in generating meaningful responses for biomedical QA tasks. We evaluated two approaches: one relying on text embedding and vector similarity in a high-dimensional space, and our proposed method, which uses explicit signals in user queries to extract meaningful contexts. For robust evaluation, we tested these methods on 50 specific and challenging questions from diverse biomedical topics, comparing their performance against a baseline model, BM25. Retrieval performance of our method was significantly better than others, achieving a median Precision@10 of 0.95, which indicates the fraction of the top 10 retrieved chunks that are relevant. We used GPT-4, OpenAI's most advanced LLM to maximize the answer quality and manually accessed LLM-generated responses. Our method achieved a median answer quality score of 2.5, surpassing both the baseline model and the text embedding-based approach. We developed a QA bot, WeiseEule (https://github.com/wasimaftab/WeiseEule-LocalHost), which utilizes these methods for comparative analysis and also offers advanced features for review writing and identifying relevant articles for citation. Conclusions Our findings highlight the importance of prompt enhancement methods that utilize explicit signals in user queries over traditional text embedding-based approaches to improve LLM-generated responses for specialized queries in specialized domains such as biology and biomedicine. By providing users complete control over the information fed into the LLM, our approach addresses some of the major drawbacks of existing web-based chatbots and LLM-based QA systems, including hallucinations and the generation of irrelevant or outdated responses. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-024-05902-7.


Setting up JavaScript (JS) environment
For Linux: Install yarn package manager by running the following commands • Install Node.js if not installed by following the instructions in this link: https://treehouse.github.io/installation-guides/mac/node-mac.html • Install Yarn using by following the instructions in this link: https://tecadmin.net/install-yarn-macos/• Get an OpenAI API KEY by following these steps https://maisieai.com/help/how-to-getan-openai-api-key-for-chatgpt

Setting up API keys
Open .bashrcfile located at your home directory and create environment variable to hold API keys as shown below: export OPENAI_API_KEY='your-openai-key-here' export ENTREZ_API_KEY='your-entrez-key-here' export PINECONE_API_KEY='your-pinecone-key-here' Note, keep the names of the environment variables for API keys as mentioned in this doc.If you would like to change the names, then feel free to edit the WeiseEule source code where they appear.

Setting up PINECONE environment
• Login to your account by registering at https://www.pinecone.io/see Fig. S2 for more details.
• Open .bashrcfile located in your home directory and set the pinecone index, region variables as shown below export PINECONE_INDEX="your-pinecone-index-name" export PINECONE_REGION="your-pinecone-region-name" (optional) The index name and region are the one which you set during registration and index creation in https://pinecone.io

Launching the app
First, make sure to activate the new python environment in a terminal as described in section 2 Setting up Python environment.Then using the same terminal go to the app directory and install the required JS packages locally by running following command: yarn install After the successful execution of the above command a folder named node_modules will be created inside the app directory.Then you can launch the app by running the following command: This will start the app on http://localhost:9000/

How to use WeiseEule app?
The following three steps are required to get answers to your queries as demonstrated in Fig. S3.
a. Select an LLM from the sidebar under `Set Parameters` navigation bar (Fig. S3, red circle 1) .
b. Select appropriate namespace under `Set Parameters` navigation bar (Fig. S3, red circle 2).
c. Type your question in the chat box on the main window and click the paper-plane icon or hit enter (Fig. S3, red circle 3).If you are unsure about certain parameters just hover your cursor on ?symbol which serve as help-point, will pop up an explanatory window.

Chat response
By default, the app streams the answers to queries in PES1 mode using MedCPT retriever, to use the app in the PES2 mode, users need to set `Rerank` flag to TRUE (See Fig. S4).By default, the construction of prompts for GPT-3.5 and GPT-4 typically involves considering the top five and top ten rows/chunks, respectively.You can change this and set advanced params to configure the app for more sophisticated searches by clicking on the Set Params button and launching a modal with advance params (see Fig. S4).Hover the mouse on the question mark symbols (help-points) to obtain information about each param and, set accordingly.Note, that this step is optional and that, most of the time, there is no need to change params on this modal.In PES2 mode, which ranks the relevant chunks based on the keyword frequencies, by default the keywords are extracted from the query automatically.However, sometimes the automatic approach may not select all the desired keywords and may be preferable to extract them manually.If that is the case, then start the query with a '#' symbol and mark keywords using ** as shown in Fig. S3.The idea here is to use the marked keywords from the user's query and rank paragraphs/chunks (from research articles) based on the keyword frequency.The rationale is that the chunks with higher frequency of user's keyword are more likely to have contexts useful to produce a good quality answer.The idea is depicted in the main text Figure 5.In the app, you can view the keyword frequency table by clicking on the `Show Table ` button (See Fig. S6).

Accessible LLMs
Currently we provide access to the four LLMs below, but advanced users can add or remove LLMs by modifying the source code: 1. gpt-4: This is GPT-4 model and often provides better responses than GPT-3.5 models.
It also costs more.
2. gpt-4-1106-preview: This is a variant of GPT-4 family and performance wise comparable to `gpt-4` but slightly cheaper than that model.
3. gpt-4o: This is also GPT-4 series model but it the fastest and most affordable flagship model from OpenAI.This model is selected by default.
We recommend using `gpt-4o`, for complex queries seeking specific information.

Look-up publications in namespace
It is also possible to search whether a publication is present in a selected namespace, as shown in Fig. S7.The search result not only reveals the publication's presence/absence but it also indicates whether the full text or just the abstract is accessible for the specified PubMed ID in that namespace.This is useful when judging the quality of a response or investigating why a specific publication was not included in the reference.Error messages come with an explanation.For example, when you see a message as shown in Fig. S8, it means that the keywords could not be found in any of the chunks in your namespace.There could be several reasons: a.You have selected a namespace that is not relevant to your query.
b.There are typos in your keywords.
c. Or sometimes you have a paper in mind which you expect a specific publication to be retrieved to show up, but that it may not be part of your namespace.In that case, it is recommended to confirm if that the paper is indeed part of the selected namespace by using the 'search namespace' feature of the app (See Fig. S8).Sometimes only the abstract will be available and it is possible that the keywords are not mentioned there.

Namespace creation using GUI
Using the WeiseEule GUI, articles can be obtained from PubMed, as illustrated in Fig. S9.The namespace "name" is determined based on the entered keyword(s) to make the naming relevant.
For example, if the entered keyword is "dosage compensation," and the text embedder model is "MedCPT" then the namespace is named as "dosage_compensation_MedCPT," where "MedCPT" part at the end implies that in this namespace, article chunks are embedded using MedCPT article encoder.In the future, we will provide separate fields on the GUI to change the chunk size and naming a namespace.Currently, you can achieve that by modifying the chunk_size and namespace variables in the endpoint: @app.websocket("/ws/fetch_articles")located in pycodes/ fastapi_app.pyinside the app directory (see Fig. S1).Note, this approach can create a namespace by fetching the full text of articles that are available (in xml format) on the PMC server, and for others, it only fetches the abstract from PubMed.
However, for many articles in PMC the full text is not available on the PMC in xml format.In those cases, we have a script called PDF2NAMEPSACE_with_PMIDs.py to download those articles as PDFs and then extract texts to augment the existing namespace.It may be run either in a terminal or in an IDE such as Spyder.Note, the PDFs are saved in a local folder pdf_downloaded inside the app directory.

Namespace creation for custom PDFs
In some cases, a user may already have a collection of PDFs in a local folder that can serve as a knowledge base.In that case, namespace creation is only possible by running the PDF2NAMEPSACE_NO_PMIDs.py file either in a terminal or in an IDE such as Spyder.In the

Fig. S1 :
Fig. S1: App directory structure -The WeiseEule app directory contains the source code of the

•
Update package list by running the following command to ensure the most recent versions of packages and their dependencies are downloaded: sudo apt update • Install Node.js.If Node.js is not installed, then install it by running: sudo apt install nodejs npm • Install Yarn using npm (Node Package Manager) by running: npm install --global yarn • Verify the installation by checking the version of Yarn installed:

Fig. S3 :
Fig. S3: Three essential steps to get answers -Parameters can be set by interacting with the

Fig. S4 :
Fig. S4: Advance parameters -Use the `Set Params` button to launch a modal with advance

Fig. S6 :
Fig. S6: Chunks are re-ranked by keyword frequencies -Each row refers to a chunk.The

Fig. S7 :
Fig. S7: Illustrates the search namespace feature -Input a valid PubMed ID and hit GO button.

Fig. S9 :
Fig. S9: Article accumulation -Only publications that contain keywords either on the title or

•
Install Conda by following instructions given in the official Anaconda documentation https://docs.anaconda.com/free/anaconda/install/linux/Gototheapp directory as given in Fig.S1and create Python environment by running conda env create -f qa_env.yml•After creating the new Python environment, activate it as follows: https://docs.anaconda.com/anaconda/install/mac-os/•