When San Francisco startup OpenAI unveiled its ChatGPT online chatbot late last year, millions were wowed by the humanlike way it answered questions, wrote poetry, and discussed almost any topic. But most people were slow to realize that this new kind of chatbot often makes things up.
When Google introduced a similar chatbot several weeks later, it spewed nonsense about the James Webb telescope. The next day, Microsoft’s new Bing chatbot offered up all sorts of bogus information about the Gap, Mexican nightlife, and singer Billie Eilish. Then, in March, ChatGPT cited a half dozen fake court cases while writing a 10-page legal brief that a lawyer submitted to a federal judge in Manhattan.
The researchers argue that when these chatbots perform other tasks – beyond mere summarization – hallucination rates may be higher. Their research also showed that hallucination rates vary widely among the leading AI companies. OpenAI’s technologies had the lowest rate, around 3%. Systems from Meta hovered around 5%. A Google system, Palm chat, had the highest rate at 27%. Google declined to comment, and OpenAI and Meta did not respond to requests for comment.
Now, a new startup called Vectara, founded by former Google employees, is trying to figure out how often chatbots veer from the truth. The company’s research estimates that even in situations designed to prevent it from happening, chatbots invent information at least 3% of the time – and as high as 27%. Experts call this chatbot behavior “hallucination.”
It may not be a problem for people tinkering with chatbots on their personal computers, but it is a serious issue for anyone using this technology with court documents, medical information, or sensitive business data. Because these chatbots can respond to almost any request in an unlimited number of ways, there is no way of definitively determining how often they hallucinate. “You would have to look at all of the world’s information,” said Simon Hughes, the Vectara researcher who led the project.
Hughes and his team asked these systems to perform a single straightforward task that is readily verified: Summarize news articles. Even then, the chatbots persistently invented information. “We gave the system 10 to 20 facts and asked for a summary of those facts,” said Amr Awadallah, the CEO of Vectara and a former Google executive. “That the system can still introduce errors is a fundamental problem.”