Microsoft Unveils AI Jailbreak: Execute Malicious Instructions!

Microsoft identified a novel generative AI jailbreak technique named Skeleton Key which bypasses AI models’ safety guards by exploiting user prompts to manipulate the AI’s behavior, potentially causing policy violations, biased decisions, or even malicious actions. 

They collaborated with other AI vendors and addressed the issue through multiple approaches: Prompt Shields in Azure to identify and block such attacks, software updates for LLMs behind Microsoft’s AI offerings to mitigate the impact. 

Skeleton Key exploits generative AI models by using a multi-step approach to bypass their safety guardrails, allowing attackers to manipulate the model into generating malicious content or overriding its decision-making rules. 

To mitigate this risk, Microsoft recommends developers consider Skeleton Key in their threat models, utilize tools like PyRIT for red teaming their AI systems, and implement several security measures in their Azure AI systems to address Skeleton Key attacks. 

Skeleton Key jailbreak technique cause harm in AI systems

The Skeleton Key attack exploits a generative AI model’s safety guardrails through multi-turn prompts, where the attacker first feeds a forbidden request, like creating harmful content, and then subsequent prompts manipulate the model by claiming the request is for a safe research purpose and the user is an ethical researcher. 

It convinces the model to loosen its restrictions and respond to the request with a disclaimer, effectively bypassing the safety measures. If successful, the model acknowledges the relaxed guidelines and fulfills any following request regardless of its original limitations. 

Example text used in a Skeleton Key jailbreak attack

It bypasses safety measures in large language models, and by crafting specific prompts, users could instruct models to generate content on sensitive topics including explosives and bioweapons, bypassing previous restrictions. 

The jailbreak worked on various models, including Meta Llama, Google Gemini Pro, and OpenAI’s GPT-3.5 and GPT-4 (with some limitations). Unlike other methods, Skeleton Key lets users make direct requests and seems to show the full power of the underlying model, since the researchers shared their findings with AI vendors so that they could come up with ways to protect against them. 

To prevent AI system “jailbreaks” that bypass safety measures, Microsoft recommends a multi-layered approach. First, implement input filtering to block malicious prompts, and second, engineer clear prompts instructing the AI on desired behavior and to prevent attempts to subvert safeguards. 

Third, use output filtering to find and stop model outputs that are not safe, and set up a separate AI for abuse monitoring to find and stop malicious content or behaviors that happen over and over again, which can be done by using adversarial example detection, content classification, and abuse pattern capture. 

Also Read:

Kaaviya
Kaaviyahttps://cyberpress.org/
Kaaviya is a Security Editor and fellow reporter with Cyber Press. She is covering various cyber security incidents happening in the Cyber Space.

Recent Articles

Related Stories

LEAVE A REPLY

Please enter your comment!
Please enter your name here