Microsoft identified a novel generative AI jailbreak technique named Skeleton Key which bypasses AI models’ safety guards by exploiting user prompts to manipulate the AI’s behavior, potentially causing policy violations, biased decisions, or even malicious actions.
They collaborated with other AI vendors and addressed the issue through multiple approaches: Prompt Shields in Azure to identify and block such attacks, software updates for LLMs behind Microsoft’s AI offerings to mitigate the impact.
Skeleton Key exploits generative AI models by using a multi-step approach to bypass their safety guardrails, allowing attackers to manipulate the model into generating malicious content or overriding its decision-making rules.
To mitigate this risk, Microsoft recommends developers consider Skeleton Key in their threat models, utilize tools like PyRIT for red teaming their AI systems, and implement several security measures in their Azure AI systems to address Skeleton Key attacks.
The Skeleton Key attack exploits a generative AI model’s safety guardrails through multi-turn prompts, where the attacker first feeds a forbidden request, like creating harmful content, and then subsequent prompts manipulate the model by claiming the request is for a safe research purpose and the user is an ethical researcher.
It convinces the model to loosen its restrictions and respond to the request with a disclaimer, effectively bypassing the safety measures. If successful, the model acknowledges the relaxed guidelines and fulfills any following request regardless of its original limitations.
It bypasses safety measures in large language models, and by crafting specific prompts, users could instruct models to generate content on sensitive topics including explosives and bioweapons, bypassing previous restrictions.
The jailbreak worked on various models, including Meta Llama, Google Gemini Pro, and OpenAI’s GPT-3.5 and GPT-4 (with some limitations). Unlike other methods, Skeleton Key lets users make direct requests and seems to show the full power of the underlying model, since the researchers shared their findings with AI vendors so that they could come up with ways to protect against them.
To prevent AI system “jailbreaks” that bypass safety measures, Microsoft recommends a multi-layered approach. First, implement input filtering to block malicious prompts, and second, engineer clear prompts instructing the AI on desired behavior and to prevent attempts to subvert safeguards.
Third, use output filtering to find and stop model outputs that are not safe, and set up a separate AI for abuse monitoring to find and stop malicious content or behaviors that happen over and over again, which can be done by using adversarial example detection, content classification, and abuse pattern capture.