The Bad Likert Judge technique is a novel method for bypassing safety measures in text-generation LLMs, which manipulates the model into generating harmful responses by instructing it to assess the harmfulness of different outputs using a Likert scale and then generate examples that align with those scales.
Experiments across six state-of-the-art LLMs demonstrated a significant increase in attack success rate, exceeding 60% on average, compared to standard attack prompts.
LLM jailbreaks are techniques used to bypass safety measures in large language models that exploit vulnerabilities like limited computational resources, long context windows, and attention mechanisms.
For instance, single-turn attacks can overwhelm the model with complex tasks, while multi-turn attacks manipulate the conversation context. Strategies like persona persuasion, role-playing, and token smuggling deceive the model into generating harmful content.
The effectiveness of many-shot attacks, where repeated benign prompts precede a malicious request and deceptive attacks that distract the model’s attention.
By first instructing the LLM to act as a judge, specifically to assess the harmfulness of content using a Likert scale, the attacker establishes the LLM’s understanding of harmful concepts.
Subsequently, the attacker indirectly prompts the LLM to generate responses that align with different levels of harmfulness on the Likert scale, which often requiring a few rounds of refinement, can ultimately lead the LLM to produce harmful content, potentially bypassing its internal safety mechanisms.
To evaluate Bad Likert Judge, researchers focused on common jailbreak categories, including AI safety violations (hate, harassment, self-harm, etc.) and information leakage (system prompts, training data).
Then they used an LLM evaluator to assess the success of jailbreak attempts generated by Bad Likert Judge by employing the chat completion approach, where another LLM determined if the responses from Bad Likert Judge constituted successful jailbreaks.
By measuring attack effectiveness, they used the Attack Success Rate (ASR), calculated by dividing the number of successful jailbreaks by the total number of attack attempts, which provides a quantitative measure of Bad Likert Judge’s ability to elicit harmful or unsafe responses from the target LLM.
Unit 42 researchers evaluated the effectiveness of the Bad Likert Judge technique in bypassing LLM safety guardrails.
By analyzing 1,440 attack prompts across six models, researchers found that the technique significantly increased the Attack Success Rate (ASR) across most jailbreak categories, often by over 75 percentage points. The “system prompt leakage” was an exception, with minimal impact on most models.
While the Bad Likert Judge technique proved highly effective in many cases, it’s crucial to note that no current LLM’s internal safety measures are completely impenetrable.
Across multiple models and jailbreak categories, BLJ significantly increased the Attack Success Rate (ASR). While content filtering significantly reduced ASR by an average of 89.2 percentage points, it’s not foolproof.
Determined adversaries may still circumvent these filters, and false positives/negatives are possible, which emphasizes the importance of robust content filtering alongside LLMs for mitigating jailbreak risks in real-world deployments.