A new study finds that up to half of AI-generated medical answers are flawed, raising concerns about misleading information that appears credible but may lack accuracy or reliable sources.
AI chatbots are increasingly being used for medical advice, offering fast, confident responses to complex health questions.
But new research suggests those answers may be far less reliable than they appear.
Flawed but fluent
A study published in BMJ Open found that many popular AI chatbots frequently produce misleading or inaccurate medical information. According to the research, nearly half of all responses were classified as problematic, even when they appeared polished and authoritative.
The study tested five widely used tools—ChatGPT, Gemini, Grok, Meta AI and DeepSeek—by asking 50 health-related questions covering topics such as cancer, vaccines, nutrition and athletic performance.
Two medical experts reviewed each response. They found that around 20% were highly problematic, while only a minority avoided significant issues altogether.
Accuracy varies widely
Performance differed depending on the subject matter.
Chatbots handled structured, well-researched areas like vaccines and cancer relatively better, though still produced incorrect or misleading answers roughly a quarter of the time.
They struggled most with topics like nutrition and fitness, where online information is often inconsistent or poorly evidenced.
Open-ended questions proved especially challenging. About 32% of those responses were rated highly problematic, compared to just 7% for more straightforward, closed questions.
The illusion of credibility
One of the study’s most concerning findings involved references.
When asked to provide scientific sources, chatbots frequently produced incomplete, incorrect or entirely fabricated citations. The median accuracy score for references was just 40%, and none of the systems consistently generated fully reliable lists.
This creates a false sense of authority, as neatly formatted citations can make inaccurate information appear trustworthy to users.
Why errors happen
Researchers say the issue lies in how these systems work.
Large language models do not verify facts. Instead, they generate responses based on patterns in their training data, which can include both reliable research and low-quality online content.
The study deliberately used “red teaming” techniques—questions designed to expose weaknesses—meaning the results reflect how chatbots perform under pressure, but also mirror real-world use where users often ask vague or leading questions.
A broader pattern
Other recent studies reinforce these concerns.
Research published in Nature Medicine found that while chatbots can sometimes arrive at correct answers, users often misinterpret them, resulting in accuracy rates below 35% in practice.
Another study in JAMA Network Open showed AI systems struggled to suggest correct diagnoses when given limited information, failing more than 80% of the time.
Meanwhile, findings in Nature Communications Medicine revealed that chatbots can repeat and expand on entirely fabricated medical concepts.
Use with caution
Despite these limitations, experts say AI tools can still be useful when used appropriately.
They can help summarise complex topics or assist users in preparing questions for healthcare professionals. However, the study warns against treating chatbots as standalone medical authorities.
Users are advised to verify claims, treat references with skepticism and be cautious of answers that sound confident but lack nuance or disclaimers.
Sources: BMJ Open, Nature Medicine, JAMA Network Open, Nature Communications Medicine