Originally published by Forbes on May 16, 2023
Munjal Shah envisions a future where everyone has access to a nutritionist, a genetics counselor and a health insurance billing specialist at the touch of a button. None of them, however, will be human – they will all be voice or text chatbots. These bots, he says, will answer patient’s questions and provide guidance with one major caveat: they won’t diagnose medical conditions (at least not yet).
“We're forecasted to have a 3 million person gap of total healthcare workers in the next few years,” says Shah. “We believe one of the biggest risks to the quality of care in America is actually staffing and having enough staffing. We just need to fill that and we need to use technology to help us.”
Shah and seven cofounders have raised a $50 million seed round from General Catalyst and Andreessen Horowitz to develop the large language model that will power all of these different healthcare bots. They’re calling the Palo Alto-based startup Hippocratic AI in a nod to the code of ethics doctors take. That code, based on writings attributed to the ancient Greek physician Hippocrates, is often summarized as “do no harm.”
But generative AI models can’t swear to uphold ethical codes, and, as the viral chatbot ChatGPT has demonstrated, can also produce false information in response to questions. Regulators have vowed to take a closer look at their use in healthcare with FDA Commissioner Robert Califf saying he sees the “regulation of large language models as critical to our future” at a conference earlier this month.
While the future regulatory landscape is unclear, Shah says Hippocratic AI is taking a three-pronged approach for testing its large language model in healthcare settings, which involves passing certifications, training with human feedback and testing for what the company calls “bedside manner.” Rather than give health system customers access to the entire model, Shah says Hippocratic AI is planning to provide access to different healthcare “roles,” which will be released when a given role has achieved a certain level of “performance and safety.” One key measuring stick will be the licensing exams and certifications that a human would have to pass in order to operate in that role.
That approach is one of the reasons Julie Yoo, general partner at Andreessen Horowitz, decided to invest. “[It] takes a lot more initial rigor and heft on the build side to get right, versus just building a prototype and throwing it over the fence as you would with a typical enterprise software company,” says Yoo, whose firm invested in Shah’s previous company Health IQ. That company used AI to match seniors with Medicare plans based on their health records.
Future doctors spend years painstakingly preparing for a series of national medical licensing exams that test their knowledge garnered from books, lectures and hands-on experience. In April, Google said its medical large language model Med-PaLM 2 reached 85.4% accuracy on the U.S. Medical Licensing Exam, while Microsoft and OpenAI said GPT-4, which is trained on public internet data, achieved an 86.65%. Shah says each company is running a subset of the exam (and the models may not be answering the same questions), so it’s hard to compare, but Hippocratic AI’s model beat GPT-4 by 0.43% on text-based questions when they tried to approximate the same subset.
Shah says Hippocratic AI tested its model against GPT-4 on 114 different benchmarks, including exams and certifications used for doctors, nurses, dentists, pharmacists, audiologists and medical coders, among others. Hippocratic beat GPT-4 on 105, tied on six and lost on three.
But this gets to the bigger question of what exactly is captured when a machine takes a test and what test-taking suggests about human equivalence. Shah acknowledged that test-taking was “necessary but not sufficient” when it comes to deploying these models in real-world settings. He declined to name any of the specific healthcare datasets Hippocratic is trained on.
“When humans take these kinds of exams, we're making all kinds of assumptions,” says Curt Langlotz, a professor of radiology and medical informatics and director of the Center for Artificial Intelligence in Medicine and Imaging at Stanford, who is not affiliated with Hippocratic AI. The assumptions are that the human has gone to college and medical school and has clinical training and experience. “These language models are a different kind of intelligence. They are both a lot smarter than we are and a lot dumber than we are,” he says. They are trained on enormous troves of data but also have the potential to “hallucinate,” generating false answers and making simple math errors.
One of the other guardrails that Hippocratic AI plans to implement is using real humans to refine the model’s answers, which is known as reinforcement learning with human feedback. This means for a given role, say dietician, Hippocratic AI will have human dieticians rank its answers and adjust accordingly. The company will also continue to develop a set of benchmarks it’s calling “bedside manner,” which involves scoring the AI model on performance metrics like empathy and compassion.
“The same techniques that are helpful for improving the communication of information … are useful for recognizing when a model doesn't know or when a model shouldn't answer,” says David Sontag, a professor of electrical engineering and computer science at MIT, who is not affiliated with Hippocratic AI and is working on his own stealth startup. He gives the example of a scenario where the right answer should be to tell the patient to call 911. Training the model not to answer is an important part of the reinforcement learning process, he says.
Hippocratic AI will use healthcare workers to train its models, and the plan is to work closely in partnership with healthcare system customers during the development phase, since their patients will be the end users. While the company is not announcing any customers yet, Hemant Taneja, CEO and managing director at General Catalyst said there’s a “ton of interest” across the different health systems his firm works with. “To solve the workforce shortage problem, and by unleashing that human potential at greater scale, you can make it more affordable for more and more people,” he says. “I think it's a huge health equity play.”