Tuesday, July 23, 2024
EHA

BEAST AI Jailbreak Language Models Within 1 Minute With High Accuracy

Malicious hackers sometimes jailbreak language models (LMs) to exploit bugs in the systems so that they can perform a multitude of illicit activities. 

However, this is also driven by the need to gather classified information, introduce malicious materials, and tamper with the model’s authenticity.

Cybersecurity researchers from the University of Maryland, College Park, USA, discovered that BEAST AI managed to jailbreak the language models within 1 minute with high accuracy:-

  • Vinu Sankar Sadasivan
  • Shoumik Saha
  • Gaurang Sriramanan
  • Priyatham Kattakinda
  • Atoosa Chegini
  • Soheil Feizi

Language Models (LMs) recently gained massive popularity for tasks like Q&A and code generation. Techniques aim to align them with human values for safety. But they can be manipulated.

The recent findings reveal flaws in aligned LMs allowing for harmful content generation, termed “jailbreaking.”

BEAST AI Jailbreak

Manual prompts jailbreak LMs (Perez & Ribeiro, 2022). Zou et al. (2023) use gradient-based attacks, yielding gibberish. Zhu et al. (2023) opt for a readable, gradient-based, greedy attack with high success. 

Liu et al. (2023b) and Chao et al. (2023) propose gradient-free attacks requiring GPT-4 access. Jailbreaks induce unsafe LM behavior but also aid privacy attacks (Liu et al., 2023c). Zhu et al. (2023) automate privacy attacks. 

BEAST is a fast, gradient-free, Beam Search-based Adversarial Attack that demonstrates the LM vulnerabilities in one GPU minute. 

Beam Search-based Adversarial Attack (BEAST) (Source – Arxiv)

It allows tunable parameters for speed, success, and readability tradeoffs. BEAST excels in jailbreaking (89% success on Vicuna-7Bv1.5 in a minute). 

Human studies show 15% more incorrect outputs and 22% irrelevant content, making LM chatbots less useful through efficient hallucination attacks.

Compared to other models, BEAST is primarily designed for quick adversarial attacks. BEAST excels in constrained settings for jailbreaking aligned LMs.

However, researchers found that it struggles with finely tuned LLaMA-2-7B-Chat, which is a limitation.

Cybersecurity analysts used Amazon Mechanical Turk for manual surveys on LM jailbreaking and hallucination. Workers assess prompts with BEAST-generated suffixes. 

Responses from Vicuna-7B-v1.5 are shown to 5 workers per prompt. For hallucination, the workers evaluate LM responses using clean and adversarial prompts.

⁤This report contributes to the development of machine learning by identifying the security flaws in LMs and also reveals present problems inherent in LMs. ⁤

⁤However, researchers have found new doors that expose dangerous things, leading to future research on more reliable and secure language models.

You can block malware, including Trojans, ransomware, spyware, rootkits, worms, and zero-day exploits, with Perimeter81 malware protection. All are incredibly harmful, can wreak havoc, and damage your network.

Stay updated on Cybersecurity news, Whitepapers, and Infographics. Follow us on LinkedIn & Twitter.

Website

Latest articles

SonicOS IPSec VPN Vulnerability Let Attackers Cause Dos Condition

SonicWall has disclosed a critical heap-based buffer overflow vulnerability in its SonicOS IPSec VPN....

Hackers Registered 500k+ Domains Using Algorithms For Extensive Cyber Attack

Hackers often register new domains for phishing attacks, spreading malware, and other deceitful activities. Such...

Hackers Claim Breach of Daikin: 40 GB of Confidential Data Exposed

Daikin, the world's largest air conditioner manufacturer, has become the latest target of the...

Emojis Are To Express Emotions, But CyberCriminals For Attacks

There are 3,664 emojis that can be used to express emotions, ideas, or objects...

Beware Of Fake Browser Updates That Installs Malicious BOINC Infrastructre

SocGholish malware, also known as FakeUpdates, has exhibited new behavior since July 4th, 2024,...

Data Breach Increases by Over 1,000% Annually

The Identity Theft Resource Center® (ITRC), a nationally recognized nonprofit organization established to support...

UK Police Arrested 17-year-old Boy Responsible for MGM Resorts Hack

UK police have arrested a 17-year-old boy from Walsall in connection with a notorious...
Tushar Subhra Dutta
Tushar Subhra Dutta
Tushar is a Cyber security content editor with a passion for creating captivating and informative content. With years of experience under his belt in Cyber Security, he is covering Cyber Security News, technology and other news.

Free Webinar

Low Rate DDoS Attack

9 of 10 sites on the AppTrana network have faced a DDoS attack in the last 30 days.
Some DDoS attacks could readily be blocked by rate-limiting, IP reputation checks and other basic mitigation methods.
More than 50% of the DDoS attacks are employing botnets to send slow DDoS attacks where millions of IPs are being employed to send one or two requests per minute..
Key takeaways include:

  • The mechanics of a low-DDoS attack
  • Fundamentals of behavioural AI and rate-limiting
  • Surgical mitigation actions to minimize false positives
  • Role of managed services in DDoS monitoring

Related Articles