OBJECTIVE: Large Language Models (LLMs) hold the potential for clinical task-shifting by processing unstructured clinical text, enabling tasks such as clinical concept extraction and medical question answering from electronic health records. If implemented reliably, such approaches may benefit over-burdened healthcare systems, particularly in resource-limited settings and for traditionally overlooked populations, provided that local fine-tuning is supported by appropriate clinical and technical expertise. However, this powerful technology remains largely understudied in real-world contexts, particularly in the Global South. This study aims to assess whether openly available LLMs can be used reliably for processing medical notes in real-world settings in South Asia. METHODS: We used publicly available LLMs to parse de-identified clinical notes from a large electronic health records (EHR) database in Pakistan, containing hospital records for 8.2 million patients. ChatGPT (GPT-3.5) as a general-purpose LLM, and GatorTron (base), BioMegatron, BioBert and ClinicalBERT as medical LLMs were evaluated when applied to these data, after fine-tuning them with (a) publicly available clinical datasets namely Informatics for Integrating Biology & the Bedside (I2B2) and National NLP Clinical Challenges (N2C2) for medical concept extraction (MCE) and emrQA for medical question answering (MQA), and (b) the local Pakistani de-identified EHR dataset, which includes inpatient Discharge Summaries (DS) and Subjective, Objective, Assessment, and Plan (SOAP) notes, as detailed in this paper. MCE models were applied to these clinical notes using both 3-label and 9-label formats, while MQA models were applied to medical questions. Internal and external validation performance was measured for (a) and (b) using F1 score, precision, recall, and accuracy for MCE and BLEU and ROUGE-L, which measure lexical and sequence similarity, for MQA. RESULTS: When clinical LLMs were not fine-tuned on the local EHR dataset, their performance during external validation on local data was notably poorer compared to internal validation on the dataset used for fine-tuning, with reductions of at least 15% in F1 scores for MCE and 35% in ROUGE-L and BLEU scores for MQA tasks. This suggests potential bias and highlights the inability of the medical LLMs to reliably handle the data distribution of the local population without further fine-tuning and adaptation. This trend persisted across two distinct natural language processing tasks: concept extraction and question answering, spanning a spectrum of task complexities. However, fine-tuning the LLMs with local EHR data significantly improved model performance across both tasks, yielding a 7.5% to 15% increase in the F1 score for MCE and a 27% to 53% increase in ROUGE-L and BLEU scores for MQA. Notably, ChatGPT, as a general-purpose LLM, stood out as an exception, demonstrating superior performance across all measured metrics on the local dataset compared to the publicly available dataset, with improvements ranging from 3% to 17% on the local EHR dataset, even without fine-tuning on the local data. CONCLUSIONS: Publicly available LLMs, predominantly trained on data from high-income regions, were found to be unreliable when applied in a real-world clinical setting in Pakistan. Fine-tuning them with local EHR data and regional clinical contexts improved their reliability, demonstrating a feasible adaptation strategy that is substantially less resource-intensive than training large language models from scratch. Close collaboration between local clinical and technical experts to curate and leverage more representative, inclusive, and unbiased medical datasets, can play a crucial role in further ensuring reliability of LLMs for resource-limited, overburdened settings, to be used in ways that are safe, fair, and beneficial for all. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12911-026-03366-8.
Journal article
2026-02-25T00:00:00+00:00
26
Clinical note processing, Electronic health records, External validation, Fine-tuning, Global health, Health equity, Large language models, Medical concept extraction, Medical question answering