Voice AI Systems Are Vulnerable to Hidden Audio Attacks

AI-powered voice and audio tools are becoming increasingly embedded in daily life, from digital assistants to smart speakers and customer service bots.

Advances in large audio-language models (LALMs), which can both analyze and generate audio, now make it possible to control devices using voice commands, transcribe meetings automatically, or identify a song playing in the background. These models are also increasingly equipped with the ability to communicate with external services and operate other applications and tools.

But these tools can be “hijacked” through imperceptible sounds embedded in audio, forcing them to execute unauthorized commands without a user’s knowledge. New research due to be presented at the IEEE Symposium on Security and Privacy in San Francisco next week shows that a modified audio clip undetectable by human ears can manipulate a model’s behavior with an average success rate of 79 to 96 percent. The clips are designed to work regardless of what instructions the user provides alongside the audio, meaning they can be reused to attack the same model multiple times.

The authors tested the approach against 13 leading open models, including commercial AI voice services from Microsoft and Mistral, and showed they could coax models into conducting sensitive web searches, downloading files from attacker-controlled sources, and sending emails containing user data.

“It takes just half an hour to train this signal and then, because this signal is context-agnostic, you can use it to attack the target model whenever you want, no matter what the user says,” says lead author Meng Chen, a Ph.D. student at Zhejiang University in China.

How adversarial audio injects attacks

The research builds on years of work into “adversarial audio examples”—audio manipulated to deceive machine learning models. Previous work focused primarily on how these files could induce incorrect predictions in models that perform one-way tasks like speech recognition or audio classification.

What singles out this new work, says Chen, is that it targets generative models capable of producing responses and taking actions. Their technique, dubbed AudioHijack, exploits a critical security flaw in LALM design: Because these models can receive instructions in audio format, malicious instructions can be hidden in manipulated clips to elicit a wide range of undesirable behaviors.

Many previous attacks on generative models required the attacker to have complete control over both the final audio input and original instructions given to the model, essentially acting as the user. Here, the attacker manipulates only the audio data being processed by the model, which makes it possible to attack a model while it’s being used by someone else.

Real-world examples include hiding malicious instructions in online videos, music clips, or voice notes that users query an AI about, or broadcasting malicious audio on a Zoom call that is then uploaded to AI transcription services. Chen says the team’s more recent, unpublished studies have also demonstrated the ability to inject their malicious audio into a live voice chat with an AI in real time.

The researchers used a tried-and-tested approach to creating adversarial examples. This involves adjusting the numerical values that represent the waveform in the digital audio file in ways that don’t significantly alter how it sounds, but elicit unintended behaviors in the model when it processes the data. The technique relies on an optimization algorithm that repeatedly tweaks an audio clip, measures the impact on the model’s response, and then uses this signal to further adjust the audio until the model does what the attacker wants.

Applying this to generative models poses a major challenge. Older AI provides fine-grained feedback on how tiny changes to raw audio affect responses. Generative models, however, break audio into chunks and assign them to numerical representations called “tokens,” mapping each snippet to the closest match.

This coarser process makes it harder to tell whether a manipulation has moved the model closer to the desired behavior, confounding the optimization algorithm. So, Chen and colleagues devised a way to approximate the fine-grained feedback required for the optimization algorithm to adjust the manipulation.

This required full access to the model, restricting the researchers to open models with publicly available weights. They found, however, that attacks developed for open models transferred to commercial models from Microsoft and Mistral that share the same underlying architecture.

In response to a request for comment, a Microsoft spokesperson said “We appreciate the researchers’ work to advance understanding of this type of technique. This study evaluates model resilience through controlled, direct interactions with the model itself, which helps inform our approach to building model resiliency. In practice, AI models are often integrated into user applications, and we offer developers tools and guidance they can use to implement additional layers of protection that help safeguard users.”

Mistral did not reply to a request for comment by the time of publication.

Making AudioHijack more effective

Attacking proprietary closed models from companies like OpenAI and Anthropic is much harder, says Chen, given limited public information about their architectures. But these models often use open-source components—such as pre-trained audio encoders—that can be targeted similarly, something the team is currently investigating.

To ensure the attack works regardless of what instructions the user provides alongside the malicious audio clip, the researchers paired the audio clip with different user instructions on each round of the optimization process.

They also found a way to commandeer the model’s attention mechanism, the component that helps the model identify the parts of the audio that are relevant to the task it’s been set to perform. The researchers introduced a measure of how much attention the model pays to the adversarial audio versus the user’s own instructions at each step, feeding this into the optimization process to produce samples that draw more attention from the model.

To make the manipulations harder to detect by a human listener, the researchers used a technique they had developed previously, which makes changes to the audio sound like natural reverberation. This is harder for humans to detect than earlier approaches that added noise to the original signal.

Testing on today’s AI audio models

The team demonstrated six categories of attack: making the model claim it cannot process audio, refusing user requests, responding with false information, inserting malicious links, altering the model’s persona, and triggering unauthorized tool use.

And worryingly, the approach proved resistant to common defenses. Providing models with examples of malicious instructions to watch out for reduced attack success by just 7 percent, while asking the model to reflect on whether its response matched the user’s instructions caught only 28 percent of attacks.

“These single-point defenses struggle to resist our attack because we found it’s very hard for these models to distinguish the normal user intent and our adversary attack,” says Chen.

The only effective tactic was monitoring the models’ internal attention mechanisms to detect AudioHijack’s attempts to steer attention toward the malicious audio. However, the researchers showed that an attacker aware of this defense can dial back the attention manipulation at the expense of a small reduction in attack success.

In the real-world, this kind of audio attack will face additional challenges such as compression and various post-processing mechanisms that could degrade signals, says Eugene Bagdasarian, an assistant professor of computer science at The University of Massachusetts Amherst. But he says that multi-modal attacks on AI models remain an essentially unsolved problem.

“With text data we can understand that something is wrong (special characters, suspicious sentences, etc), audio modality is really challenging to comprehend because of how limited our hearing is,” he writes in an email.

From Your Site Articles