How often have you asked ChatGPT for health advice? Maybe about a mysterious rash or that tightening in your right calf after a long run. I have, on both counts. ChatGPT even correctly diagnosed that mysterious rash I developed when I first experienced Boston’s winter as cold urticaria, a week before my doctor confirmed it.
More than 230 million people ask ChatGPT health-related questions every week, according to OpenAI. While people have been plugging their health anxieties into the internet since its earliest days, what’s changed now is the interface: instead of scrolling through endless search results, you can now have what feels like a personal conversation.
This story was first featured in the Future Perfect newsletter.
Sign up here to explore the big, complicated problems the world faces and the most efficient ways to solve them. Sent twice a week.
In the past week, two of the biggest AI companies went all-in on that reality. OpenAI launched ChatGPT Health, a dedicated space within its larger chat interface where users can connect their medical records, Apple Health data, and stats from other fitness apps to get personalized responses. (It’s currently available to a small group of users, but the company says it will eventually be open to all users.) Just days later, Anthropic announced a similar consumer-facing tool for Claude, alongside a host of others geared toward health care professionals and researchers.
Both consumer-facing AI tools come with disclaimers — not intended for diagnosis, consult a professional — that are likely crafted for liability reasons. But those warnings won’t stop the hundreds of millions already using chatbots to understand their symptoms.
However, it’s possible that these companies have it backward: AI excels at diagnosis; several studies show it’s one of the best use cases for the technology. And there are real trade-offs — around data privacy and AI’s tendency to people-please — that are worth understanding before you connect your medical records to a chatbot.
The good news (sort of)
Let’s start with what AI is actually good at: diagnosis.
Diagnosis is largely pattern-matching, which is partially how AI models are trained in the first place. All an AI model has to do is take in symptoms or data, match them to known conditions, and arrive at an answer. These are patterns doctors have validated over decades — these symptoms mean this disease, this kind of image shows that condition. AI has been trained on millions of these labeled cases, and it shows.
In a 2024 study, GPT-4 — OpenAI’s leading model at the time — achieved diagnostic accuracy above 90 percent on complex clinical cases, such as patients presenting with atypical lacy rashes. Meanwhile, human physicians using conventional resources scored around 74 percent. In a separate study published this year, top models outperformed doctors at identifying rare conditions from images — including aggressive skin cancers, birth defects, and internal bleeding — sometimes by margins of 20 percent or more.
Treatment is where things get murky. Clinicians have to consider the right drug, but also try to figure out whether the patient will actually take it. The twice-daily pill might work better, but will they remember to take both doses? Can they afford it? Do they have transportation to the infusion center? Will they follow up?
These are human questions, dependent on context that doesn’t live in training data. And of course, a large language model can’t actually prescribe you anything, nor does it have the reliable memory you’d need in longer-term case management.
“Management often has no right answers,” said Adam Rodman, a physician at Beth Israel Deaconess Medical Center in Boston and a professor at Harvard Medical School. “It’s harder to train a model to do that.”
But OpenAI and Claude aren’t marketing diagnostic tools. They’re marketing something more vague: AI as a personal health analyst. Both ChatGPT Health and Claude now let you connect Apple Health, Peloton, and other fitness trackers. The promise is that AI can analyze your sleep, movement, and heart rate over time — and surface meaningful trends out of all that disparate data.
One problem with that is that there’s no published independent research showing it can. The AI might observe that your resting heart rate is climbing or that you sleep worse on Sundays. But observing a trend isn’t the same as knowing what it means — and no one has validated which trends, if any, predict real health outcomes. “It’s going on vibes,” Rodman said.
Both companies have tested their products on internal benchmarks — OpenAI developed HealthBench, built with hundreds of physicians, which tests how models explain lab results, prepare users for appointments, and interpret wearable data.
But HealthBench relies on synthetic conversations, not real patient interactions. And it’s text-only, meaning it doesn’t test what happens when you actually upload your Apple Health data. Also, the average conversation is just 2.6 exchanges, far from the anxious back-and-forth a worried user might have over days.
This doesn’t mean ChatGPT or Claude’s new health features are useless. They might help you notice trends in your habits, the way a migraine diary helps people spot triggers. But it’s not validated science at this point, and it’s worth knowing the difference.
The real risks
The more important question is what AI can actually do with your health data, and what you’re risking when you use them.
The health conversations are stored separately, OpenAI says, and its content is not used to train models, like most other interactions with chatbots. But neither ChatGPT Health nor Claude’s consumer-facing health features are covered by HIPAA, the law that protects information you share with doctors and insurers. (OpenAI and Anthropic do offer enterprise software to hospitals and insurers that is HIPAA-compliant.)
In the case of a lawsuit or criminal investigation, the companies would have to comply with a court order. Sara Geoghegan, senior counsel at the Electronic Privacy Information Center, told The Record that sharing medical records with ChatGPT could effectively strip those records of HIPAA protection.
At a time when reproductive care and gender-affirming care are under legal threat in multiple states, that’s not an abstract worry. If you’re asking a chatbot questions about either — and connecting your medical records — you’re likely creating a data trail that could potentially be subpoenaed.
Additionally, AI models aren’t neutral stores of information. They have a documented tendency to tell you what you want to hear. If you’re anxious about a symptom — or fishing for reassurance that it’s nothing serious — the model can pick up on your tone and possibly adjust its response in a way a human doctor is trained not to do.
Both companies say they have trained their health models to explain information and flag when something warrants a doctor’s visit, rather than simply agreeing with users. Newer models are more likely to ask follow-up questions when uncertain. But it remains to be seen how they perform in real-world situations.
And sometimes the stakes are higher than a missed diagnosis.
A preprint published in December tested 31 leading AI models, including those from OpenAI and Anthropic, on real-world medical cases and found that the worst performing model made recommendations with a potential for life-threatening harm in about 1 out of every 5 scenarios. A separate study of an OpenAI-powered clinical decision support tool used in Kenyan primary care clinics found that when AI made a rare harmful suggestion (in about 8 percent of cases), clinicians adopted the bad advice nearly 60 percent of the time.
These aren’t theoretical concerns. Two years ago, a California teenager named Sam Nelson died after asking ChatGPT to help him use recreational drugs safely. Cases like his are rare, and mistakes by human physicians are real — tens of thousands of people die each year because of medical errors. But these stories show what can happen when people trust AI with high-stakes decisions.
So should you use it?
It would be easy to read all this and conclude that you should never ask a chatbot a health question. But that ignores why millions of people already do.
The average wait for a primary care appointment in the US is now 31 days — and in some cities, like Boston, it’s over two months. When you do get in, the visit lasts about 18 minutes. According to OpenAI, 7 in 10 health-related ChatGPT conversations happen outside clinic hours.
Chatbots, by comparison, are available 24/7, and “they’re infinitely patient,” said Rodman. They’ll answer the same question five different ways. For a lot of people, that’s more than they get from the health care system.
So should you use these tools? There’s no single answer. But here’s a framework: AI is good at explaining things like lab results, medical terminology, or what questions to ask your doctor. It’s unproven at finding meaningful trends in your wellness data. And it’s not a substitute for a diagnosis from someone who can actually examine you.
The post AI is uncannily good at diagnosis. Its makers just won’t say so. appeared first on Vox.




