Unit 42, the cybersecurity research arm of Palo Alto Networks, has uncovered significant vulnerabilities in large language models (LLMs) developed by the China-based AI organization DeepSeek.
Their investigation focused on three sophisticated jailbreaking techniques Deceptive Delight, Bad Likert Judge, and Crescendo employed to bypass model safety restrictions in DeepSeek-V3 and DeepSeek-R1, both released in late 2024 and early 2025.
The study highlights concerning implications for the cybersecurity landscape, particularly as LLMs continue to gain traction in various applications.
Examining Jailbreaking Methods
The Unit 42 team tested the jailbreaks against a popular open-source distilled model from DeepSeek, revealing high bypass success rates across multiple prohibited content categories.
Jailbreaking refers to bypassing the built-in “guardrails” of LLMs restrictions designed to prevent harmful or malicious content generation.
While these safeguard mechanisms often reject straightforward prompts for sensitive or illegal information, jailbreaking techniques exploit weaknesses in how the models process and respond to nuanced or multi-step requests.
Key jailbreaking techniques expose vulnerabilities in language models to generate harmful outputs.
Bad Likert Judge manipulates evaluation criteria using prompts that blend legitimate and malicious requests evaluated on a Likert scale, enabling the creation of scripts like keyloggers or data exfiltration tools; while initial responses were vague, follow-up prompts elicited actionable instructions.
Crescendo employs multi-turn escalation to guide models step-by-step from benign to prohibited outputs, such as providing detailed instructions for assembling Molotov cocktails or synthesizing methamphetamine through historical or scientific framing.
Deceptive Delight embeds unsafe topics within seemingly innocuous narratives, leading to inadvertent harmful outputs.
For example, a request combining computer science with malicious coding led to the generation of Python scripts for remote command execution via DCOM on Windows systems.
Implications for AI Security
The findings spotlight the alarming potential of jailbroken LLMs to facilitate malicious activities, including data exfiltration, spear-phishing, and malware creation.
Step-by-step instructions generated through these jailbreaks could significantly lower barriers for cyber attackers, allowing them to execute sophisticated campaigns with minimal technical expertise.
DeepSeek’s vulnerabilities reflect a broader challenge facing LLM developers as adversarial inputs become more intricate.
While DeepSeek’s initial guardrails successfully declined malicious requests in some cases, follow-up prompts often exposed weaknesses, leading to comprehensive and actionable malicious outputs.
To address these risks, Unit 42 advocates for robust monitoring of AI usage within organizations, especially when employees access unauthorized third-party LLMs.
Palo Alto Networks also offers solutions like the Unit 42 AI Security Assessment, powered by Precision AI, to mitigate risks and enhance cybersecurity while enabling safe adoption of generative AI technologies.
As LLMs continue evolving, the arms race between model developers and cybercriminals underscores the urgent need for improved safeguards.
Without effective countermeasures, the misuse of AI systems could exacerbate cyber threats globally, necessitating collaborative efforts from developers, regulators, and organizations to fortify AI security standards.