
A recent study assessed the performance of trainee doctors and several chatbots to diagnose pediatric cases of respiratory diseases. Data showed that ChatGPT performed the highest, surpassing the solutions provided by doctors.
Lead author Manjith Narayanan, MD, PhD, presented the abstract, “Clinical Scenarios in Paediatric Pulmonology: Can Large Language Models Fare Better Than Trainee Doctors?” at the European Respiratory Society (ERS) Congress in Vienna, Austria. Dr. Narayanan is a consultant in pediatric pulmonology at the Royal Hospital for Children and Young People, Edinburgh, and an honorary senior clinical lecturer at the University of Edinburgh, UK.
“Large language models [LLMs], like ChatGPT, have come into prominence in the last year and a half with their ability to seemingly understand natural language and provide responses that can adequately simulate a human-like conversation,” said Dr. Narayanan in a press release. “These tools have several potential applications in medicine. My motivation to carry out this research was to assess how well LLMs are able to assist clinicians in real life.”
To investigate the capability of LLMs in a medical setting, Dr. Narayanan and his colleagues created clinical scenarios that commonly occur within pediatric practice. Topics included respiratory conditions such as asthma, cystic fibrosis, breathlessness, chest infections and sleep disordered breathing. The scenarios did not have straightforward diagnoses, and there were not published guidelines, evidence or expert consensus that would direct the doctors to a specific outcome or plan.
In the study, 10 trainee doctors, who each had less than four months of pediatric clinical experience, were given one hour to evaluate each scenario. The doctors were allowed to use the internet, but not chatbots, and they were asked to write a diagnostic resolution of 200 to 400 words.
Three LLMs — ChatGPT (version 3.5), Google’s Bard and Microsoft’s Bing — were given the same scenarios. Pediatric respiratory experts graded the human- and chatbot-generated responses, evaluating them for correctness, comprehensiveness, usefulness, plausibility, coherence and “humanness,” and assigned each response an overall score out of nine.
The LLMs scored higher than the trainee doctors in all instances. According to the abstract, the trainee doctors scored a median overall score of 4 (IQR 3-6). Bing also scored 4 (IQR 3-5), Bard scored 6 (IQR 5-7) and ChatGPT scored the highest at 7 out of 9 (IQR 6-8.25). Additionally, the group of experts consistently identified ChatGPT as a human response, but not Bing or Bard.
Dr. Narayanan said this study is the first to compare LLM performance to trainee doctors in scenarios based on real-life clinical practice. The results suggest that artificial intelligence (AI) solutions would be beneficial in clinical application and could even help ease the burden of health care workers.
“We have not directly tested how LLMs would work in patient-facing roles. However, it could be used by triage nurses, trainee doctors and primary care physicians, who are often the first to review a patient,” he said.
Next steps include assessing the chatbots against senior doctors as well as exploring the potential of newer and more advanced LLMs.
“This is a fascinating study. It is encouraging, but maybe also a bit scary, to see how a widely available AI tool like ChatGPT can provide solutions to complex cases of respiratory illness in children. It certainly points the way to a brave new world of AI-supported care,” said Hilary Pinnock, ERS Council Chair and professor of Primary Care Respiratory Medicine at the University of Edinburgh, who was not involved with the study.
“However, as the researchers point out, before we start to use AI in routine clinical practice, we need to be confident that it will not create errors either through ‘hallucinating’ fake information or because it has been trained on data that does not equitably represent the population we serve,” she said. “As the researchers have demonstrated, AI holds out the promise of a new way of working, but we need extensive testing of clinical accuracy and safety, pragmatic assessment of organizational efficiency and exploration of the societal implications before we embed this technology in routine care.”