Prompt Engineering for Biomedical NLI: An Exploratory Study.

Alkhawaf, Hasan Fadhil Qasim; Faili, Heshaam

doi:10.22060/miscj.2025.24382.5421

Prompt Engineering for Biomedical NLI: An Exploratory Study.

Document Type : Research Article

Authors

Hasan Fadhil Qasim Alkhawaf ¹

Heshaam Faili ²

¹ Alborz Campus, University of Tehran, Tehran, Iran

² College of Engineering, School of Electrical and Computer Engineering, University of Tehran, Tehran, Iran

10.22060/miscj.2025.24382.5421

Abstract

Abstract

Biomedical Natural Language Inference (BioNLI) is a core task in biomedical NLP that looks to identify whether the biomedical premise entails the hypothesis from the premise. Prompt-based methods are gaining traction as one of the simplest and fastest approaches for effectively using large language models (LLMs) for these sorts of inference tasks without the need for complex, time-consuming fine-tuning of the model. However, the inherent difficulty of biomedical NLI problems represents a significant challenge for prompt engineering given the heavy use of terminology specific to the biomedical domain. During in-context prompting or using predefined examples for zero-shot or few-shot prompting, there is often a lack of contextual information or generalizability to the heterogeneity found in biomedical texts for determining entailment decisions. In this work, we present a comprehensive evaluation of a variety of prompting methods (zero-shot, few-shot static, few-shot dynamic, Chain-of-Thought, self-consistency, and Tree-of-Thought) with two LLMs, DeepSeek-R1-Distill-Qwen-14B and LLaMA-3.1-8B-Instruct, from the prompt-engineering perspective. We applied these methods to the BioNLI dataset and reported on key evaluation metrics across all methods. Our results show that dynamic contextual in-context prompting, together with structured reasoning, produces high-quality inference in our context. Between all of the models and configurations, few-shot ToT prompting using the DeepSeek model produced the best results, scoring a macro-F1 score of 71.05, even outperforming retrieval-augmented models reported on in prior studies. These findings show that prompt engineering alone can handle complex biomedical reasoning effectively, without needing retrieval or full fine-tuning.

Keywords

Chain-of-Thought

Tree-of-Thought

self-consistency prompting

few-shot reasoning

DeepSeek-R1-Distill-Qwen-14B

Subjects