Virtually everyone has experimented with large language models (LLMs) in some fashion. Whether a person is conducting such parlor tricks as generating “Choose Your Own Adventure” children’s stories, looking for recipes in order to utilize a strange assortment of leftover ingredients, or composing eloquent poetry, the list of non-serious uses is vast. Putting the LLM to work, however, requires a substantial step up from these applications.
Explore This Issue
ACEP Now: Vol 43 – No 07 – July 2024Interest in LLMs for tasks beyond the chatbot realm stems from recognition that the underlying machine learning (ML) structure enables their use as generalizable prediction engines.1 Other, non-transformer-based ML methods can be used on clean, well-structured data sets, but LLMs are frequently capable of feats of unanticipated competence right out of the box. Owing to general availability, most of these studies evaluate the use of ChatGPT (created by OpenAI) as the testbed for processing text extracted from the electronic health record, although experiments from non-English-speaking countries utilize other LLMs.
Recently published experiments with LLMs in the emergency department fall into a handful of primary areas of interest. The first seems to be the use of LLMs as agents for triage and risk stratification. One of these studies evaluated the use of LLMs to return a Canadian Triage and Acuity Scale (CTAS) score.2 In it, authors developed six different prompts based on 61 previously validated clinical vignettes with corresponding CTAS scores. The goal of the varied prompt development was to determine to what extent the prompt format could either improve or diminish the accuracy and reproducibility of the results.
The experiment on CTAS scores was grossly unsuccessful, with an accuracy of less than 50 percent and repeatable variation of 20 percent. This sort of application illustrates where general models are limited by the nature of their basic design and shows that calculations are particularly challenging. In this case, the LLM lacks specific domain training on the CTAS score, and the predictions do not have the appropriate foundational training to operate accurately. These data do not indicate that an LLM cannot be a useful tool at triage, but that the use of general purpose LLMs will not be sufficient.
In a similar vein, another study evaluated the ability of an LLM to assess “clinical acuity” in the emergency department.3 Taking a different approach, this study did not utilize information available at triage, but instead used full-note text from completed documentation of history, examination, and assessment. Rather than being directly applicable to triage, then, this study aimed to test whether an LLM can grossly estimate clinical acuity in general, using the Emergency Severity Index as the “gold standard.”
Here, the results tilted more favorably towards LLMs. Given pairs of patients with differing acuity as classified by ESI, the LLM was able to discern the patient of higher acuity approximately 90 percent of the time. This performance was roughly comparable with a corresponding physician reviewer in a subset of clinical notes undergoing manual review. These findings are less applicable to triage and demonstrated that identifying the “sicker” patient is still a leap away from clinical use. However, this nominal success represents the potential for future LLM deployment in this space.
Last, with respect to triage, a rather interesting article from Korea described passive augmentation of the triage process.4 Using fine-tuned versions of Korean language-specific LLMs, this study demonstrated its use as a passive listener to medical conversations. Using text generated from listening to clinical interviews, including both human and automated text-to-speech transcriptions, these authors evaluated the ability of the LLM to identify the top three most clinically important utterances.
Following a fine-tuning process for these models, the authors were able to determine the best specific model, KLUE-RoBERTa, and found moderate similarity between human and LLM rankings of clinical importance. Interestingly, as a next step, both human reviewers and the LLM were prompted to provide an explanation regarding their selections of clinical importance. These explanations were used for a sort of Turing test, in which additional reviewers rated the quality of the explanations. Although humans provided more appropriate explanations, the differences were not particularly profound. As a mechanism for augmenting clinical operations, this approach may show value in passively collecting and feeding information into other systems.
The next area of attempted augmentation lay in summarization tools attempting to assist the over-burdened clinician. These include tools for synthesizing free-text data to assist in risk stratification, such as in a study evaluating the ability to automate HEART (history, ECG, age, risk factors, and troponin) score determination.5 In this study, the authors developed a framework within which to iteratively refine LLM prompts to automate determination of the HEART score. These prompts were tested on batches of synthetic clinical notes for four hypothetical patients. The goal of having several notes available was to test the ability of the LLM to combine multiple sources of data into its calculations.
Unlike the triage calculations, this demonstration was a bit more successful. Most prominently with GPT-4, the iterative prompt design process improved effectiveness, resulting in a mean error in calculating any individual subscore of only 0.10. Ultimately, in their test cases, each hypothetical patient was placed in the correct HEART score risk bin. Although this appears impressive, these results are not robust enough to generalize them to using GPT-4 in real-world applications, but, rather, they serve as a better example of the necessity of tuning the prompts used to digest clinical data for specific use cases.
Another summarization tool is recently described in a preprint of the use of LLMs to create an ED discharge summary.6 Although many electronic medical records can generate generic discharge paperwork, a concise textual summary of a visit represents a time-consuming burden for clinicians. In this instance, GPT-3.5 and GPT-4 were used, and only one generic “write me a discharge summary” prompt preceded the text of the clinical note. Human reviewers rated each summary on accuracy, presence of AI hallucinations, and omissions of relevant information.
In this simplistic experiment, the authors reported only 10 percent of 100 sample GPT-4 summaries contained inaccuracies, defined as reported facts incongruent with the original clinician note. These results represented the performance high point, however, with 42 percent of summaries exhibiting hallucinations and 47 percent containing clinically important omissions. The examples provided by the authors include all manner of misreported physical findings, confabulated follow-up recommendations, and elements mixed among sections of the clinical note.
As usual, this sample of published literature represents just a snapshot of work in the field. The evolving capabilities of LLMs far outpace the ability of researchers to test and report their performance. Most importantly, these articles demonstrate the need to cut through the hype and perform objective measurement prior to considering real-world use. Individual clinicians ought to use an abundance of caution when experimenting with publicly available LLMs in their own practice.
Dr. Radecki (@emlitofnote) i s an emergency physician and informatician with Christchurch Hospital in Christchurch, New Zealand. He is the Annals of Emergency Medicine podcast co-host and Journal Club editor.
References
- Jiang LY, Liu XC, Nejatian NP, et al. Health system–scale language models are all-purpose prediction engines. Nature. 2023;619(7969):357-362. doi:10.1038/s41586-023-06160-y.
- Franc JM, Cheng L, Hart A, et al. Repeatability, reproducibility, and diagnostic accuracy of a commercial large language model (ChatGPT) to perform emergency department triage using the Canadian triage and acuity scale. CJEM. 2024;26(1):40-46.
- Williams CYK, Zack T, Miao BY, et al. Use of a large language model to assess clinical acuity of adults in the emergency department. JAMA Netw Open. 2024;7(5):e248895.
- Lee S, Lee J, Park J, et al. Deep learning-based natural language processing for detecting medical symptoms and histories in emergency patient triage. Am J Emerg Med. 2024;77:29-38.
- Safranek CW, Huang T, Wright DS, et al. Automated HEART score determination via ChatGPT: Honing a framework for iterative prompt development. J Am Coll Emerg Physicians Open. 2024;5(2):e13133.
- Williams CYK, Bains J, Tang T, et al. Evaluating large language models for drafting emergency department discharge summaries. Health Informatics; 2024.
Pages: 1 2 3 4 | Multi-Page
No Responses to “Tread Cautiously with Artificial Intelligence in Clinical Scenarios”