Skip to content
Primary Navigation Menu
Menu
  • Home
  • Press Releases
  • Industry News
  • About

New Study Reveals High Rates of Fabricated and Inaccurate Citations in LLM-Generated Mental Health Research

On November 17, 2025
Tagged academic research, AI, artificial intelligence, Citations, Deakin University, GPT-4o, large language models, mental health, Psychiatry
IPad featuring mental health terms and images

(Toronto, November 17, 2025) A new study published in the peer-reviewed journal JMIR Mental Health by JMIR Publications highlights a critical risk in the growing use of Large Language Models (LLMs) like GPT-4o by researchers: the frequent fabrication and inaccuracy of bibliographic citations. The findings underscore an urgent need for rigorous human verification and institutional safeguards to protect research integrity, particularly in specialized and less publicly known fields within mental health.

Nearly 1 in 5 Citations Fabricated by GPT-4o in Literature Reviews

The article, titled “Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study,” found that 19.9% of all citations generated by GPT-4o across six simulated literature reviews were entirely fabricated, meaning they could not be traced to any real publication. Furthermore, among the seemingly real citations, 45.4% contained bibliographic errors, most commonly incorrect or invalid Digital Object Identifiers (DOIs).

This timely research is highly relevant as academic journals have encountered instances of seemingly AI-hallucinated references in recent submissions. These bibliographic hallucinations and errors are not just formatting issues; they break the chain of verifiability, mislead readers, and fundamentally compromise the integrity and trustworthiness of scientific results and the cumulative knowledge base. This makes the need for careful scrutiny and verification paramount to safeguard academic rigor.

Reliability Varies by Topic Familiarity and Specificity

The research, conducted by a team including Jake Linardon, PhD, from Deakin University and his colleagues, systematically tested the reliability of GPT-4o’s output across mental health topics with varying levels of public awareness and scientific maturity: major depressive disorder (high familiarity), binge eating disorder (moderate), and body dysmorphic disorder (low). They also tested general versus specialized review prompts (e.g., focusing on digital interventions).

  • Fabrication Risk is Highest for Less Familiar Topics: Fabrication rates were significantly higher for topics with lower public familiarity and research coverage, such as binge eating disorder (28%) and body dysmorphic disorder (29%), compared to major depressive disorder (6%).
  • Specialized Topics Pose a Higher Risk: While not universally true, stratified analysis showed that fabrication rates were significantly higher for specialized reviews (e.g., evidence for digital interventions) compared to general overviews for certain disorders, such as binge eating disorder.
  • Overall Inaccuracy is Pervasive: In total, nearly two-thirds of all citations generated by GPT-4o were either fabricated or contained errors, indicating a major reliability issue.

Urgent Call for Human Oversight and New Safeguards

The study’s conclusions issue a strong warning to the academic community: Citation fabrication and errors remain common in GPT-4o outputs. The authors stress that the reliability of LLM-generated citations is not fixed but is contingent on the topic and the way the prompt is designed.

Key Implications Highlighted in the Study:

  • Rigorous Verification is Mandatory: Researchers and students must subject all LLM-generated references to careful human verification to validate their accuracy and authenticity.
  • Journal and Institutional Role: Journal editors and publishers must implement stronger safeguards, potentially using detection software that flags citations that do not match existing sources, signaling a potential hallucination.
  • Policy and Training: Academic institutions must develop clear policies and training to equip users with the skills to critically assess LLM outputs and to design strategic prompts, especially when exploring less visible or highly specialized research topics.

Original article:

Linardon J, Jarman H, McClure Z, Anderson C, Liu C, Messer M. Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study. JMIR Ment Health 2025;12:e80371. URL: https://mental.jmir.org/2025/1/e80371

DOI: 10.2196/80371

About JMIR Publications

JMIR Publications is a leading open access publisher of digital health research and a champion of open science. With a focus on author advocacy and research amplification, JMIR Publications partners with researchers to advance their careers and maximize the impact of their work. As a technology organization with publishing at its core, we provide innovative tools and resources that go beyond traditional publishing, supporting researchers at every step of the dissemination process. Our portfolio features a range of peer-reviewed journals, including the renowned Journal of Medical Internet Research.

To learn more about JMIR Publications, please visit jmirpublications.com or connect with us via X, LinkedIn, YouTube, Facebook, and Instagram.

Head office: 130 Queens Quay East, Unit 1100, Toronto, ON, M5A 0P6 Canada


Media Contact:

Dennis O’Brien, Vice President, Communications & Partnerships. JMIR Publications. communications@jmir.org

The content of this communication is licensed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, published by JMIR Publications, is properly cited.

2025-11-17
Previous Post: Training Doctors for the Digital Age: Canadian Study Charts New Course for Health Education
Next Post: New Study Reveals Blueprint of How Super Users Leverage Best Practices to Deliver Virtual Health Care

Recent Posts

  • New study shows patients taking GLP-1 drug who commit to digital coaching achieve 30% better results vs medication alone
  • Super Mario Bros. Help Fight Burnout: New Study Links Classic Games to Boosted Happiness
  • New Study Reveals Blueprint of How Super Users Leverage Best Practices to Deliver Virtual Health Care
  • New Study Reveals High Rates of Fabricated and Inaccurate Citations in LLM-Generated Mental Health Research
  • Training Doctors for the Digital Age: Canadian Study Charts New Course for Health Education

Archives

  • January 2026
  • December 2025
  • November 2025
  • September 2025
  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • April 2025
  • March 2025
  • February 2025
  • January 2025
  • December 2024
  • November 2024
  • October 2024
  • July 2024
  • June 2024
  • May 2024
  • April 2024
  • March 2024
  • October 2023
  • September 2023
  • August 2023
  • July 2023
  • June 2023
  • May 2023
  • April 2023
  • March 2023
  • February 2023
  • January 2023
  • December 2022
  • November 2022
  • October 2022
  • September 2022
  • August 2022
  • July 2022
  • June 2022
  • May 2022
  • April 2022
  • March 2022
  • February 2022
  • January 2022
  • December 2021
  • November 2021
  • October 2021
  • September 2021
  • August 2021
  • July 2021
  • June 2021
  • May 2021
  • April 2021
  • March 2021
  • December 2020
  • October 2020
  • September 2020
  • June 2020
  • May 2019
  • April 2019
  • January 2019
  • December 2018
  • November 2018
  • October 2018
  • July 2018
  • May 2018
  • March 2018

Categories

  • Industry News
  • Job Postings
  • Press Releases
  • Uncategorized

Designed using Chromatic. Powered by WordPress.