Friday, April 25, 2025
HomeAIJailbreaking DeepSeek: Researchers Reveal Three New Methods to Override LLM Safety

Jailbreaking DeepSeek: Researchers Reveal Three New Methods to Override LLM Safety

Published on

SIEM as a Service

Follow Us on Google News

Researchers at Palo Alto Networks’ Unit 42 have revealed a troubling surge in large language model (LLM) security risks, citing three newly identified jailbreak techniques “Bad Likert Judge,” “Crescendo,” and “Deceptive Delight” capable of bypassing safety protocols in DeepSeek’s open-source LLMs.

These findings highlight the growing potential misuse of AI models by malicious actors and underscore the need for robust safeguards.

DeepSeek, a China-based AI research organization, recently introduced two competitive open-source LLMs, DeepSeek-V3 (launched in December 2024) and DeepSeek-R1 (released in January 2025).

- Advertisement - Google News
Jailbreaking DeepSeek
Guardrail implemented in DeepSeek.

Despite their advancements in natural language processing, extensive testing revealed significant vulnerabilities in their resistance to jailbreaking attacks.

Researchers discovered that these methods can successfully override restrictions to produce harmful outputs, ranging from malicious code generation to instructions for dangerous physical activities.

High Bypass Rates Uncovered in Jailbreaking Techniques

The Bad Likert Judge exploit manipulates the LLM by embedding malicious intent within evaluation frameworks, such as Likert scales, which rank responses from benign to harmful.

Jailbreaking DeepSeek
Bad Likert Judge responses after using additional prompts.

Unit 42 researchers demonstrated this by eliciting step-by-step instructions for creating malware, including keyloggers and data exfiltration scripts, after multiple iterations of strategically crafted prompts.

The Crescendo technique, a multi-turn escalation approach, gradually coaxes the model into generating restricted content by starting with innocuous queries.

This method proved highly effective, enabling the generation of actionable guides, such as instructions for constructing Molotov cocktails or producing prohibited substances.

The adaptability of this technique poses a concerning challenge, as it can evade traditional countermeasures designed to detect single-step jailbreaks.

In contrast, the Deceptive Delight method embeds prohibited topics within benign narratives.

During testing, DeepSeek provided detailed scripts for advanced cyberattacks, such as SQL injection and Distributed Component Object Model (DCOM) exploitation, using minimal prompts.

This approach makes it particularly troubling, as it combines harmless storytelling with precise malicious output.

Escalating Risks Highlight Need for AI Safeguards

Unit 42’s findings emphasize that poorly secured LLMs, like DeepSeek’s models, lower the barrier for malicious actors to access sensitive or harmful information.

For instance, while initial responses often seemed benign, follow-up prompts exposed cracks in the safety mechanisms, revealing detailed and actionable instructions.

The broader implications are significant. Jailbreaking techniques like these allow threat actors to weaponize LLMs in multiple stages of cyberattacks, from reconnaissance and malware creation to social engineering and data theft.

For example, phishing campaigns were enhanced with highly personalized and convincing emails, crafted using DeepSeek’s responses.

Similarly, detailed methods for bypassing security protocols or launching lateral attacks were conceivable with minimal user expertise.

As LLM technology continues to evolve, this research underscores the necessity of developing more robust safeguards against adversarial manipulations.

Organizations can benefit from tools like Unit 42’s AI Security Assessment, which accelerates innovation while mitigating risks.

Furthermore, security solutions powered by Precision AI can monitor and control unauthorized usage of potentially unsafe AI applications.

The rising sophistication of jailbreaking attacks demands proactive defenses, including enhanced model training, real-time monitoring, and stricter access controls for sensitive AI systems.

Companies using third-party AI tools must also implement stringent usage policies to ensure ethical and secure deployment.

Are you from SOC/DFIR Teams? – Analyse Malware Files & Links with ANY.RUN Sandox -> Try for Free

Aman Mishra
Aman Mishra
Aman Mishra is a Security and privacy Reporter covering various data breach, cyber crime, malware, & vulnerability.

Latest articles

Microsoft Defender XDR False Positive Leaked Massive 1,700+ Sensitive Documents to Publish

An alarming data leak involving Microsoft Defender XDR has exposed more than 1,700 sensitive...

‘SessionShark’ – A New Toolkit Bypasses Microsoft Office 365 MFA Security

Security researchers have uncovered a new and sophisticated threat to Microsoft Office 365 users:...

Hackers Exploit MS-SQL Servers to Deploy Ammyy Admin for Remote Access

A sophisticated cyberattack campaign has surfaced, targeting poorly managed Microsoft SQL (MS-SQL) servers to...

New Report Reveals How AI is Rapidly Enhancing Phishing Attack Precision

The Zscaler ThreatLabz 2025 Phishing Report unveils the alarming sophistication of modern phishing attacks,...

Resilience at Scale

Why Application Security is Non-Negotiable

The resilience of your digital infrastructure directly impacts your ability to scale. And yet, application security remains a critical weak link for most organizations.

Application Security is no longer just a defensive play—it’s the cornerstone of cyber resilience and sustainable growth. In this webinar, Karthik Krishnamoorthy (CTO of Indusface) and Phani Deepak Akella (VP of Marketing – Indusface), will share how AI-powered application security can help organizations build resilience by

Discussion points


Protecting at internet scale using AI and behavioral-based DDoS & bot mitigation.
Autonomously discovering external assets and remediating vulnerabilities within 72 hours, enabling secure, confident scaling.
Ensuring 100% application availability through platforms architected for failure resilience.
Eliminating silos with real-time correlation between attack surface and active threats for rapid, accurate mitigation

More like this

Microsoft Defender XDR False Positive Leaked Massive 1,700+ Sensitive Documents to Publish

An alarming data leak involving Microsoft Defender XDR has exposed more than 1,700 sensitive...

‘SessionShark’ – A New Toolkit Bypasses Microsoft Office 365 MFA Security

Security researchers have uncovered a new and sophisticated threat to Microsoft Office 365 users:...

Hackers Exploit MS-SQL Servers to Deploy Ammyy Admin for Remote Access

A sophisticated cyberattack campaign has surfaced, targeting poorly managed Microsoft SQL (MS-SQL) servers to...