Large Language Models in Medicine: Capabilities, Limitations, and the Path Forward
A comprehensive review establishing what LLMs can and cannot do in healthcare, providing the evidence base for informed clinical adoption.
What They Studied
Thirunavukarasu and colleagues conducted a comprehensive review of the state of large language models in medicine, examining their demonstrated capabilities across clinical tasks, their known limitations, and the regulatory and ethical considerations surrounding their deployment. The review synthesized evidence from across healthcare disciplines to map where LLMs show genuine promise and where caution is warranted.
What They Found
- LLMs demonstrated strong performance on medical knowledge assessments, with models like GPT-4 passing USMLE-style examinations and performing at or above average physician level on many knowledge-based tasks.
- Clinical reasoning capabilities were inconsistent: models could generate plausible-sounding explanations but sometimes arrived at conclusions through pattern matching rather than genuine clinical logic.
- Significant bias risks were identified, including underrepresentation of minority populations in training data and the potential to perpetuate existing healthcare disparities.
- LLMs showed particular promise in administrative and documentation tasks, where the stakes of minor errors are lower than in direct diagnostic decision-making.
- The review emphasized that regulatory frameworks for AI in medicine lag significantly behind the pace of development and clinical adoption.
Methodology
This was a narrative review synthesizing published literature, preprints, and technical reports on LLMs in healthcare through 2023. The authors drew from clinical validation studies, benchmark evaluations, and policy analyses. As a review rather than an original study, it reflects the authors’ interpretation of a rapidly evolving evidence base.
What This Means for SLPs
- This paper provides the foundational knowledge for understanding what these tools realistically can and cannot do. Essential reading before adopting any AI tool.
- The finding that LLMs excel at documentation and administrative tasks more than clinical reasoning aligns with the recommended “copilot” approach for SLP practice.
- Bias risks are real and relevant. SLPs working with linguistically diverse populations need to be especially cautious about AI-generated recommendations that may reflect training data biases.
- The gap between AI capability and regulatory oversight means that individual clinicians currently bear significant responsibility for vetting AI outputs.
- Understanding LLM limitations helps SLPs set appropriate expectations when institutions push for AI adoption.
Limitations to Keep in Mind
- As a 2023 review, some findings may not reflect the capabilities of more recent models, which continue to evolve rapidly.
- The review focused broadly on medicine. Speech-language pathology was not specifically addressed, and SLP-specific evidence remains limited.
- Much of the reviewed literature came from benchmark tests rather than real-world clinical deployment studies.
The Bottom Line
LLMs show genuine promise for healthcare documentation and knowledge tasks, but their clinical reasoning is inconsistent and their bias risks are real, making informed, cautious adoption essential.