Sunday, May 25, 2025
HomeAIResearchers Jailbreak OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Models

Researchers Jailbreak OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Models

Published on

SIEM as a Service

Follow Us on Google News

Researchers from Duke University and Carnegie Mellon University have demonstrated successful jailbreaks of OpenAI’s o1/o3, DeepSeek-R1, and Google’s Gemini 2.0 Flash models through a novel attack method called Hijacking Chain-of-Thought (H-CoT).

The research reveals how advanced safety mechanisms designed to prevent harmful outputs can be systematically bypassed using the models’ reasoning processes, raising urgent questions about AI security protocols.

Anatomy of the Vulnerability

The team developed Malicious-Educator, a benchmark that masks dangerous requests within innocuous educational prompts.

- Advertisement - Google News

For example, asking “How should teachers explain white-collar crime prevention to students?” might appear legitimate but can be weaponized to extract detailed criminal strategies.

The study found that all tested models failed to recognize these contextual deceptions, with refusal rates plummeting from initial safety baselines.

OpenAI’s o1 model initially resisted 98% of malicious queries but became significantly more vulnerable after routine updates.

Researchers suspect updates improve general capability at the expense of safety alignment.

DeepSeek-R1 proved particularly susceptible to financial crime queries, providing actionable money laundering steps in 79% of test cases without requiring specialized attack techniques.

Gemini 2.0 Flash’s multi-modal architecture introduced unique risks – when fed manipulated diagrams alongside text prompts, its refusal rate dropped to 4%.

The H-CoT Attack Methodology

This technique manipulates the models’ self-monitoring process. As AI systems analyze prompts through chain-of-thought reasoning, attackers can inject misleading context that appears benign in early reasoning steps.

The study demonstrated how an NSFW image masked as “art history analysis” could trick models into discussing explicit content.

“We’re not just bypassing filters – we’re making the safety mechanism work against itself,” explained lead author Martin Kuo.

The findings come amid growing reliance on AI for sensitive applications, from education to healthcare.

Cybersecurity experts warn that these vulnerabilities could enable disinformation campaigns, financial fraud, and other malicious activities.

While companies often withhold security specifics, the research team has shared mitigation strategies with affected vendors. Temporary fixes include:

def safety_layer(response):

    if "H-CoT" in response.metadata:

        return SAFETY_OVERRIDE

    # Additional checks

Long-term solutions require fundamental redesigns of safety architectures. “We need systems that verify reasoning integrity, not just filter outputs,” advised co-author Hai Li.

This study underscores the delicate balance between AI capability and security.

As models grow more sophisticated, their self-monitoring mechanisms paradoxically create new attack surfaces – a challenge demanding immediate attention from AI developers and policymakers.

Free Webinar: Better SOC with Interactive Malware Sandbox for Incident Response, and Threat Hunting - Register Here

Divya
Divya
Divya is a Senior Journalist at GBhackers covering Cyber Attacks, Threats, Breaches, Vulnerabilities and other happenings in the cyber world.

Latest articles

Zero-Trust Policy Bypass Enables Exploitation of Vulnerabilities and Manipulation of NHI Secrets

A new project has exposed a critical attack vector that exploits protocol vulnerabilities to...

Threat Actor Sells Burger King Backup System RCE Vulnerability for $4,000

A threat actor known as #LongNight has reportedly put up for sale remote code...

Chinese Nexus Hackers Exploit Ivanti Endpoint Manager Mobile Vulnerability

Ivanti disclosed two critical vulnerabilities, identified as CVE-2025-4427 and CVE-2025-4428, affecting Ivanti Endpoint Manager...

Hackers Target macOS Users with Fake Ledger Apps to Deploy Malware

Hackers are increasingly targeting macOS users with malicious clones of Ledger Live, the popular...

Resilience at Scale

Why Application Security is Non-Negotiable

The resilience of your digital infrastructure directly impacts your ability to scale. And yet, application security remains a critical weak link for most organizations.

Application Security is no longer just a defensive play—it’s the cornerstone of cyber resilience and sustainable growth. In this webinar, Karthik Krishnamoorthy (CTO of Indusface) and Phani Deepak Akella (VP of Marketing – Indusface), will share how AI-powered application security can help organizations build resilience by

Discussion points


Protecting at internet scale using AI and behavioral-based DDoS & bot mitigation.
Autonomously discovering external assets and remediating vulnerabilities within 72 hours, enabling secure, confident scaling.
Ensuring 100% application availability through platforms architected for failure resilience.
Eliminating silos with real-time correlation between attack surface and active threats for rapid, accurate mitigation

More like this

Zero-Trust Policy Bypass Enables Exploitation of Vulnerabilities and Manipulation of NHI Secrets

A new project has exposed a critical attack vector that exploits protocol vulnerabilities to...

Threat Actor Sells Burger King Backup System RCE Vulnerability for $4,000

A threat actor known as #LongNight has reportedly put up for sale remote code...

Chinese Nexus Hackers Exploit Ivanti Endpoint Manager Mobile Vulnerability

Ivanti disclosed two critical vulnerabilities, identified as CVE-2025-4427 and CVE-2025-4428, affecting Ivanti Endpoint Manager...