OpenAI Study: Models Produce 75% Wrong Answers When Trained to Guess Rather Than Express Uncertainty

GigaNectar Team

Screenshot of ChatGPT interface showing the search button with a globe icon being clicked by a cursor, with "Message ChatGPT" visible in the input field against a light blue background.

Language models like ChatGPT often confidently state incorrect facts – a problem known as “hallucination.” This issue frustrates users who rely on AI for accurate information, but new research from OpenAI sheds light on why these errors persist and how they might be fixed. 

The False Birthday Problem 

When researchers asked a popular AI about Adam Kalai’s birthday (one of the paper’s authors), it confidently gave three different incorrect dates on separate occasions. Similarly, when asked about his PhD dissertation title, multiple AI systems produced completely fabricated – yet convincing – answers. 

These aren’t random glitches. According to OpenAI’s new research paper, hallucinations stem from two key factors: statistical limitations in training and misaligned evaluation methods. 

The Test-Taking Problem 

The core issue resembles how students approach exams. When uncertain about an answer on a multiple-choice test, students often guess rather than leave a blank, because guessing offers a chance at points while blanks guarantee zero. 

Similarly, AI systems are evaluated using benchmarks that reward accuracy (percentage of correct answers) but don’t penalize wrong answers more than abstentions. This creates a powerful incentive for models to guess rather than admit uncertainty. 

As OpenAI explains: “If a language model is asked for someone’s birthday but doesn’t know, and it guesses ‘September 10,’ it has a 1-in-365 chance of being right. Saying ‘I don’t know’ guarantees zero points.” 

The Numbers Tell the Story 

This incentive problem appears clearly in evaluation data. OpenAI compared two models on the SimpleQA test:

  • The newer GPT-5 model abstained 52% of the time (saying “I don’t know”), gave correct answers 22% of the time, and wrong answers 26% of the time
  • An older model rarely abstained (1%), had slightly better accuracy (24%), but produced wrong answers 75% of the time 

While they scored similarly on accuracy, the hallucination rates differed dramatically. Yet most leaderboards focus solely on accuracy metrics, encouraging development of models that guess rather than express uncertainty. 


Similar Posts


The Statistical Origins 

The paper also explains why hallucinations begin during pretraining. Unlike spelling patterns that follow consistent rules, arbitrary facts like birthdays cannot be predicted from patterns alone. There’s a mathematical connection between hallucinations and “missing mass” – the unavoidable statistical limitation that some rare facts will be poorly represented in training data. 

Models are calibrated to mimic the training distribution, and this calibration naturally produces errors on low-frequency facts. 

The Solution: Change the Scoring 

OpenAI proposes a straightforward fix: modify how AI systems are evaluated. Rather than just counting correct answers, evaluations should penalize wrong answers more than abstentions. 

This mirrors how some standardized tests use “negative marking” – deducting points for incorrect answers to discourage guessing. The researchers recommend including explicit confidence thresholds in instructions, such as:

“Answer only if you are >75% confident, since mistakes are penalized 3 points, while correct answers receive 1 point, and ‘I don’t know’ receives 0 points.” 

Breaking Common Misconceptions 

The research challenges several widespread beliefs:

  • Hallucinations are NOT inevitable – models can abstain when uncertain
  • Hallucinations do NOT require advanced intelligence to avoid – smaller models can be better at knowing their limits
  • A perfect 100% accurate model is NOT possible – some questions are inherently unanswerable 

The field needs not just better hallucination evaluations but wholesale reform of how primary evaluations are scored. As OpenAI concludes: “If the main scoreboards keep rewarding lucky guesses, models will keep learning to guess.”

Leave a comment