Generative AI, and the associated models, including large language models (LLMs) have huge promise to revolutionize healthcare and other industries but comes with substantial risks as well. There is a variety of paradigms to classify the risks of AI in general, including the National Institute of Standards and Technology’s (NIST’s) Artificial Intelligence Risk Management Framework (AI RMF) and the aligned Blueprint for Trustworthy AI from the Coalition for Health AI.
NIST’s framework for trustworthiness asks if a model is Valid and Reliable (Safe, Secure, and Resilient, Explainable and Interpretable, Privacy Enhanced, and Fair with Harmful Bias Managed) and if it is Accountable and Transparent. Privacy and Security have particular importance in healthcare; so the next article in this series is dedicated to Security and Privacy considerations in the context of LLMs in healthcare.
Validity and reliability are crucial for Generative AI, and related models, in healthcare, since failures in healthcare can be catastrophic. Generative AI is prone to “hallucinations” (generated content that is factually incorrect). These events occur because generative AI, including LLM, is designed to predict the most likely completions of given content but does not truly understand the content. Notably, some authors recommend alternative terms to “hallucinations” such as “confabulations” or “AI misinformation” to reduce perceptions of model sentience and to avoid stigmatization of people who have hallucinations.
Regardless of what they are called, these errors can take many forms. Examples in healthcare, specifically to large language models (LLMs), which is the text-generating portion of generative AI, include:
These are all actual examples of hallucinations and illustrate one of the risks in reliance on LLMs. Non-existent references are especially pernicious as they may make model output appear more authoritative and reliable, and many people may not understand the need to vet them carefully.
A cautionary tale from another industry: a law firm was recently fined for one of its lawyers submitting a court brief that contained non-existent cases referenced; these had been suggested by a LLM in a classic example of a hallucination.
A number of strategies have been described to reduce the risk of hallucinations (e.g., asking a LLM itself to review output for inaccuracies) but whatever strategy is adopted, they are clearly a risk to be aware of and monitor closely.
In addition to hallucinations, the specifics of how a model is implemented may introduce other risks. For example, a LLM given the HPI from a series of 35-40 Emergency Department charts, only generated a differential diagnosis including the correct diagnosis half the time (Fast Company, 2019). It missed several life-threatening conditions such as ectopic pregnancy, aortic rupture, and brain tumor. The author of that report also expressed concerns that the LLM, if used to generate a differential diagnosis, would reinforce cognitive errors of the physician – e.g., if the HPI didn’t include questions probing a particular diagnosis, the model would be less likely to suggest that diagnosis in the differential. I.e., using a model to develop a differential diagnosis from a HPI to help a provider could encourage or even cause Premature Closure errors.
This is just a specific example – the underlying concept is that whenever implementing a generative AI model, such as LLM, close consideration must be given to how it could increase or decrease safety. Some potential use cases are much higher risk than others.
Bias is another significant risk with LLMs and can be overt or subtle. Underlying training data often is scraped from the internet and as such can contain biased content. Even with aggressive attempts to label that training data and/or filter out biased responses, they can come through, sometimes egregiously. More subtly, models may be likely to associate certain characteristics with a given race, gender, religion, etc. Importantly for healthcare, many LLMs have also been shown to perpetuate race-based medicine.
There are other potential risks and pitfalls of generative AI, and related models, including:
Discussing these in detail is beyond the scope of this article.
As described elsewhere in this series, LLMs, and generative AI as a whole, has immense promise in healthcare as in other industries. However, there are significant risks and pitfalls to be aware of, especially in the high-stakes arena of healthcare. It is critical when implementing generative AI models, like the LLM, to carefully consider these risks and how to mitigate them.
In the next section of this series, section five, we will discuss the security that needs to be top of mind for data and privacy in relation to LLMs. To view pervious sections in this series, please see the links below.
Quanta Magazine. (2022, December 8). What Causes Alzheimer’s? Scientists Are Rethinking the Answer. Quanta Magazine. https://www.quantamagazine.org/what-causes-alzheimers-scientists-are-rethinking-the-answer-20221208/
Fast Company. (2019, June 6). ChatGPT, the AI language model, struggled in medical diagnosis, report says. Fast Company. https://www.fastcompany.com/90863983/chatgpt-medical-diagnosis-emergency-room
The views and opinions expressed in this content or by commenters are those of the author and do not necessarily reflect the official policy or position of HIMSS or its affiliates.