Performance Evaluation of Popular Open-Source Large Language Models in Healthcare

International Conference in Greece on Informatics, Management and Technology in Healthcare 2025, 2025

Authors

Saif Khairat, Tianyi Niu, John Geracitano, Kaushalya Mendia

Abstract

This paper evaluated user preferences and performance metrics for two widely used open-source large language models (LLMs), Llama 3.1 8B and Mistral 3 Small 24B (AWQ), compared to the proprietary model GPT-4o, in the context of serving as a user-oriented healthcare assistant. The study highlighted the advantages of open-source LLMs, including transparency, cost-effectiveness, and customization potential for specific applications. A dual approach was used: first, ten participants ranked model-generated responses to various healthcare questions; second, computational performance metrics like response time, throughput, and time-to-first-token were benchmarked under different user loads. Results indicated that the majority of participants preferred GPT-4o responses; however, both open-source LLMs had relatively similar ratings. Furthermore, the benchmarking results underscored the efficiency and reliability of the models under load, showcasing their capabilities for real-world applications. This research contributed to understanding how open-source LLMs could meet the needs of diverse users across different domains, encouraging further exploration and adoption in various industries.

Paper: https://pubmed.ncbi.nlm.nih.gov/40588913/