New Study Puts Claude3 and GPT-4 up Against a Medical Knowledge Pressure Test

New Study Puts Claude3 and GPT-4 up Against a Medical Knowledge Pressure Test

Utilizing a dataset of objective, core evidence-based medical knowledge questions based on Kahun’s proprietary Knowledge Graph, the world’s largest map of medical knowledge, Claude3 surpassed GPT-4 in accuracy, but human medical experts outperformed both AI models

Kahun, the evidence-based clinical AI engine for healthcare providers, shares the findings from a new study on the medical capabilities of readily-available large language models (LLMs). The study compared the medical accuracy of OpenAI’s GPT-4 and Anthropic’s Claude3-Opus to each other and human medical experts through questions based on objective medical knowledge drawn from Kahun’s Knowledge Graph. The study revealed that Claude3 edged out above GPT-4 on accuracy, but both paled in comparison to both human medical experts and objective medical knowledge. Both LLMs answered about a third of the questions wrong, with GPT4 answering almost half of the questions with numerical-based answers incorrectly.

According to a recent survey, 91 percent of physicians expressed concerns about how to choose the correct generative AI model to use and said they need to know the model’s source materials were created by doctors or medical experts before using it. Physicians and healthcare organizations are utilizing AI for its prowess in administrative tasks, but to assure the accuracy and safety of these models for clinical tasks we need to address the limitations of generative AI models. 

By leveraging its proprietary knowledge graph, comprised of a structured representation of scientific facts from peer-reviewed sources, Kahun utilized its unique position to lead a collaborative study on the current capabilities of two popular LLMs: GPT-4 and Claude3. Encompassing data from more than 15,000 peer-reviewed articles, Kahun generated 105,000 evidence-based medical QAs (questions and answers) classified into numerical or semantic categories spanning multiple health disciplines that were inputted directly into each LLM.

Numerical QAs deal with correlating findings from one source for a specific query (ex. The prevalence of dysuria in female patients with urinary tract infections) while semantic QAs involve differentiating entities in specific medical queries (ex. Selecting the most common subtypes of dementia). Critically, Kahun led the research team by providing the basis for evidence-based QAs that resembled short, single-line queries a physician may ask themselves in everyday medical decision-making processes.

Analyzing more than 24,500 QA responses, the research team discovered these key findings:

  1. Claude3 and GPT-4 both performed better on semantic QAs (68.7 and 68.4 percent, respectively) than on numerical QAs (63.7 and 56.7 percent, respectively), with Claude3 outperforming on numerical accuracy.
  2. The research shows that each LLM would generate different outputs on a prompt-by-prompt basis, emphasizing the significance of how the same QA prompt could generate vastly opposing results between each model.
  3. For validation purposes, six medical professionals answered 100 numerical QAs and excelled past both LLMs with 82.3 percent accuracy, compared to Claude3’s 64.3 percent accuracy and GPT-4’s 55.8 percent when answering the same questions.
  4. Kahun’s research showcases how both Claude3 and GPT-4 excel in semantic questioning, but ultimately supports the case that general-use LLMs are not yet well enough equipped to be a reliable information assistant to physicians in a clinical setting.
  5. The study included an “I do not know” option to reflect situations where a physician has to admit uncertainty. It found different answer rates for each LLM (Numeric: Claude3-63.66%, GPT-4-96.4%; Semantic: Claude3-94.62%, GPT-4-98.31%). However, there was an insignificant correlation between accuracy and answer rate for both LLMs, suggesting their ability to admit lack of knowledge is questionable. This indicates that without prior knowledge of the medical field and the model, the trustworthiness of LLMs is doubtful.

The QAs were extracted from Kahun’s proprietary Knowledge Graph, comprising over 30 million evidence-based medical insights from peer-reviewed medical publications and sources, encompassing the complex statistical and clinical connections in medicine. Kahun’s AI Agent solution allows medical professionals to ask case-specific questions and receive clinically grounded answers, referenced in medical literature. Referencing its answers to evidence-based knowledge and protocols, the AI Agent enhances physicians’ trust, thus improving overall efficiency and quality of care. The company’s solution overcomes the limitations of current generative AI models, by providing factual insights grounded in medical evidence, ensuring consistency and clarity essential in medical knowledge dissemination.

“While it was interesting to note that Claude3 was superior to GPT-4, our research showcases that general-use LLMs still don’t measure up to medical professionals in interpreting and analyzing medical questions that a physician encounters daily. However, these results don’t mean that LLMs can’t be used for clinical questions. In order for generative AI to be able to live up to its potential in performing such tasks, these models must incorporate verified and domain-specific sources in their data,” says Michal Tzuchman Katz, MD, CEO and Co-Founder of Kahun. “We’re excited to continue contributing to the advancement of AI in healthcare with our research and through offering a solution that provides the transparency and evidence essential to support physicians in making medical decisions.”

The full preprint draft of the study can be found here: https://arxiv.org/abs/2406.03855.

Sign up for the free insideAI News newsletter.

Join us on Twitter: https://twitter.com/InsideBigData1

Join us on LinkedIn: https://www.linkedin.com/company/insideainews/

Join us on Facebook: https://www.facebook.com/insideAINEWSNOW