Dr. Chatbot Will See You Now

If you’re wondering about the hype with chatbots in medicine, perhaps it’s because they’re nothing new: the first medical chatbot, after all, was developed back in 1964. Using a simple pattern-matching and reflection script entitled DOCTOR, the ELIZA program simulated a Rogerian psychotherapist. Even this basic initial experiment evoked unique responses from those interacting with the software, and a new field of human-computer interaction was born. These natural-language capabilities have evolved over many years, from those found as digital assistants ubiquitous on websites, to the current state-of-the-art ChatGPT, currently based on the generative pre-trained transformer-4 architecture, better known as GPT-4.

The GPT-4 used in ChatGPT and Bing and its cousins Language Model for Dialogue Applications, or LaMDA, at Google and Large Language Model Meta AI, or LLaMA, at Meta, are examples of large language models, or LLMs. These are neural-network models tuned and trained on vast amounts of data, on the order of hundreds of billions of words) split into smaller components called tokens. These models take prompts (again in tokens) comprised usually of words, numbers, and text annotations, and generate an output based on statistical predictions of the next token in sequence. As all the tokens in sequence are words and parts of words, the form of this output takes the form of coherent sentences. This is similar to the “auto-complete” sometimes seen in word-processing applications, or in text messaging on mobile phones, but dramatically more sophisticated.

The power of being able to prompt using natural language and generate output based on, effectively, near-encyclopedic knowledge of everything is immediately obvious and easy to demonstrate. One of the most-publicized demonstrations by the team responsible for the development of GPT-4 involves its performances on medical licensing examinations.¹ The team responsible for GPT-4 obtained a set of questions from the United States Medical Licensing Examination (USMLE) for Step 1, 2 and 3, and prompted the GPT-4 LLM with the questions as text-only prompts precisely as they would appear in a live examination. The model was asked to simply supply a response indicating the correct answer from the multiple-choice set presented. Whereas previous versions of GPT were unable to pass the medical examinations, GPT-4 demonstrated scores of approximately 85 percent correct on each Step of USMLE, well above the passing threshold.

If the response to such an achievement is “that’s all well and good, but basic science and general medical knowledge ain’t brain surgery,” that line of investigation has already been covered as well.² A team from Brown University obtained the Self-Assessment Neurosurgery (SANS) from the American Board of Neurological Surgery and administered its content to GPT-4. The average performance of a neurosurgical trainee on this examination is 73 percent, with a passing threshold of 69 percent. The GPT-4 model scored 83.4 percent, obtaining a passing score and comfortably exceeding the average human performance. Given these examples, it seems likely LLMs can competently guess the correct answers with sufficient accuracy to pass any multiple-choice medical examination.