SAN FRANCISCO, California: Tech behemoth OpenAI's artificial intelligence-powered transcription tool Whisper has been called out by experts for making up chunks of text or even entire sentences.
Whisper's claim to have near "human level robustness and accuracy" has been put to the test by more than a dozen software engineers, developers, and academic researchers who said some of the invented text - known in the industry as hallucinations - can include racial commentary, violent rhetoric and even imagined medical treatments.
Experts warn that Whisper's inaccuracies pose risks due to its extensive use across industries for tasks such as transcribing interviews, generating text in consumer technology, and subtitling videos.
Of particular concern is the rush by medical centers to adopt Whisper-based tools to transcribe doctor-patient consultations, despite OpenAI's warning against using it in "high-risk domains."
Although the extent of the issue is unclear, many researchers report encountering Whisper's "hallucinations" frequently. A University of Michigan researcher found hallucinations in eight out of ten audio transcriptions for a study of public meetings.
A machine learning engineer discovered hallucinations in roughly half of the 100 hours of transcripts he reviewed, while another developer observed errors in nearly all of the 26,000 transcripts he generated using Whisper.
Even short, clear audio clips aren't immune; in one recent study, computer scientists found 187 hallucinations among 13,000 straightforward audio snippets, suggesting a trend that could lead to tens of thousands of faulty transcriptions over millions of recordings.
Alondra Nelson, former head of the White House Office of Science and Technology Policy, said such errors are particularly concerning in healthcare settings, where they could have severe consequences.
Whisper also serves as a closed-captioning tool for Deaf and hard-of-hearing users, a population especially vulnerable to transcription errors since they cannot identify inaccuracies "hidden amongst all this other text," noted Christian Vogler, director of Gallaudet University's Technology Access Program.
The prevalence of these hallucinations has led experts, advocates, and former OpenAI employees to call for federal AI regulations and push OpenAI to address the issue. An OpenAI spokesperson stated that the company is actively studying ways to reduce hallucinations and values feedback from researchers, incorporating it into model updates.
Unlike other transcription tools, Whisper's hallucinations have been surprisingly frequent. The tool is now integrated into OpenAI's flagship chatbot, ChatGPT, and Oracle and Microsoft's cloud platforms.
In the past month, one recent version of Whisper was downloaded over 4.2 million times on the open-source platform Hugging Face. Whisper is commonly used in call centers and voice assistants. Sanchit Gandhi, a machine-learning engineer at Hugging Face, noted that Whisper is currently the most popular open-source speech recognition model.
Professors Allison Koenecke of Cornell University and Mona Sloane of the University of Virginia examined thousands of short snippets they obtained from TalkBank, a research repository hosted at Carnegie Mellon University. They determined that nearly 40 percent of the hallucinations were harmful or concerning because the speaker could be misinterpreted or misrepresented.
In an example they uncovered, a speaker said, "He, the boy, was going to, I'm not sure exactly, take the umbrella."
But the transcription software added: "He took a big piece of a cross, a teeny, small piece ... I'm sure he didn't have a terror knife, so he killed a number of people."
In another recording, a speaker described "two other girls and one lady." Whisper invented extra commentary on race, adding "two other girls and one lady, um, which were Black."
In a third transcription, Whisper invented a non-existent medication called "hyperactivated antibiotics."
Developers believe that hallucinations tend to occur during pauses or when background sounds or music are present, though the exact cause of these fabrications remains unclear.
In its online disclosures, OpenAI recommended against using Whisper in "decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes."