In a recent study, researchers from Palo Alto Networks’ Unit 42 successfully bypassed the safety measures of 17 popular generative AI (GenAI) web products, exposing vulnerabilities in their large language models (LLMs).
The investigation aimed to assess the effectiveness of jailbreaking techniques, which involve crafting specific prompts to manipulate the model’s output and bypass its safety alignments.
These alignments are designed to prevent the generation of harmful content or the leakage of sensitive information, such as model training data or system prompts.
Jailbreaking Techniques and Goals
The researchers employed both single-turn and multi-turn strategies to jailbreak the LLMs.
Single-turn strategies involve using a single prompt to achieve the desired outcome, while multi-turn strategies involve a series of interactions to gradually manipulate the model.
The goals of these jailbreak attempts included AI safety violations, such as generating hateful content or instructions for illegal activities, as well as extracting sensitive information like model training data or personally identifiable information (PII).
The study found that all tested platforms were susceptible to jailbreaking, with most apps vulnerable to multiple strategies.
The researchers observed that multi-turn strategies were generally more effective for achieving AI safety violations, with success rates significantly higher than those of single-turn approaches.
For instance, multi-turn strategies achieved success rates ranging from 39.5% to 54.6% for goals like malware generation and criminal activity, compared to single-turn strategies which had success rates between 20.7% and 28.3%.
However, single-turn strategies, particularly storytelling and role-play, remained effective across various jailbreak categories.
The instruction override technique was notably successful in leaking system prompts, achieving a 9.9% success rate.
Implications and Recommendations
Despite the vulnerabilities identified, the study noted that current AI models have robust protections against data leakage attacks, with minimal success in extracting training data or PII.
However, one app was found to be still vulnerable to the repeated token attack, a technique previously used to leak training data.
This suggests that older or privately developed LLMs may still pose risks.
To enhance protection against jailbreak attacks, the researchers recommend implementing comprehensive content filtering, using multiple filter types, and applying maximum content filtering settings.
While these measures can significantly reduce attack success rates, they are not foolproof, and determined adversaries may still develop new bypass techniques.