Can AI answer medical questions better than your doctor?

A second look at a study rating quality and empathy when answering patient questions.

Print This Page

March 27, 2024

By Robert H. Shmerling, MD, Senior Faculty Editor, 天博体育 Publishing; Editorial Advisory Board Member, 天博体育 Publishing

Illustration of woman with brown hair looking at computer screen with healthcare symbol and chatbot robot; concept is AI in healthcare

Last year, headlines describing (AI) were eye-catching, to say the least:

At first glance, the idea that a chatbot using AI might be able to generate good answers to patient questions isn't surprising. After all, ChatGPT boasts that it , , and .

But showing more empathy than your doctor? Ouch. Before assigning final honors on quality and empathy to either side, let's take a second look.

What tasks is AI taking on in health care?

Already, a rapidly growing list of includes drafting doctor's notes, suggesting diagnoses, helping to read x-rays and MRI scans, and monitoring real-time health data such as heart rate or oxygen level.

But the idea that AI-generated answers might be more empathetic than actual physicians struck me as amazing — and sad. How could even the most advanced machine outperform a physician in demonstrating this important and particularly human virtue?

Can AI deliver good answers to patient questions?

It's an intriguing question.

Imagine you've called your doctor's office with a question about one of your medications. Later in the day, a clinician on your health team calls you back to discuss it.

Now, imagine a different scenario: you ask your question by email or text, and within minutes receive an answer generated by a computer using AI. How would the medical answers in these two situations compare in terms of quality? And how might they compare in terms of empathy?

To answer these questions, researchers collected 195 questions and answers from anonymous users of an online social media site that were posed to doctors who volunteer to answer. The questions were later submitted to ChatGPT and the chatbot's answers were collected.

A panel of three physicians or nurses then rated both sets of answers for quality and empathy. Panelists were asked "which answer was better?" on a five-point scale. The rating options for quality were: very poor, poor, acceptable, good, or very good. The rating options for empathy were: not empathetic, slightly empathetic, moderately empathetic, empathetic, and very empathetic.

What did the study find?

The results weren't even close. For nearly 80% of answers, ChatGPT was considered better than the physicians.

Good or very good quality answers: ChatGPT received these ratings for 78% of responses, while physicians only did so on 22% of responses.
Empathetic or very empathetic answers: ChatGPT scored 45% and physicians 4.6%.

Notably, the length of the answers was much shorter for physicians (average of 52 words) than for ChatGPT (average of 211 words).

Like I said, not even close. So, were all those breathless headlines appropriate after all?

Not so fast: Important limitations of this AI research

The study wasn't designed to answer two key questions:

Do AI responses offer accurate medical information and improve patient health while avoiding confusion or harm?
Will patients accept the idea that questions they pose to their doctor might be answered by a bot?

And it had some serious limitations:

Evaluating and comparing answers: The evaluators applied untested, subjective criteria for quality and empathy. Importantly, they did not assess actual accuracy of the answers. Nor were answers assessed for fabrication, a problem that has been noted with ChatGPT.
The difference in length of answers: More detailed answers might seem to reflect patience or concern. So, higher ratings for empathy might be related more to the number of words than true empathy.
Incomplete blinding: To minimize bias, the evaluators weren't supposed to know whether an answer came from a physician or ChatGPT. This is a common research technique called "blinding." But AI-generated communication does not always sound exactly like a human, and the AI answers were significantly longer. So, it's likely that for at least some answers, the evaluators were not blinded.

The bottom line

Could physicians learn something about expressions of empathy from AI-generated answers? Possibly. Might AI work well as a collaborative tool, generating responses that a physician reviews and revises? Actually, in this way.

But it seems premature to rely on AI answers to patient questions without solid proof of their accuracy and actual supervision by healthcare professionals. This study wasn't designed to provide either.

And by the way, ChatGPT agrees: I asked it if it could answer medical questions better than a doctor. Its answer was no.

We'll need more research to know when it's time to set the AI genie free to answer patients' questions. We may not be there yet — but we're getting closer.

Want more information about the research? Read , such as answers to a concern about consequences after swallowing a toothpick.

About the Author

Robert H. Shmerling, MD, Senior Faculty Editor, 天博体育 Publishing; Editorial Advisory Board Member, 天博体育 Publishing

Dr. Robert H. Shmerling is the former clinical chief of the division of rheumatology at Beth Israel Deaconess Medical Center (BIDMC), and is a current member of the corresponding faculty in medicine at Harvard Medical School. … See Full Bio

View all posts by Robert H. Shmerling, MD

Print This Page

Disclaimer:

As a service to our readers, 天博体育 Publishing provides access to our library of archived content. Please note the date of last review or update on all articles.

No content on this site, regardless of date, should ever be used as a substitute for direct medical advice from your doctor or other qualified clinician.