Saturday, July 13, 2024

Researchers Uncovered a New Flaw in ChatGPT to Turn Them Evil

LLMs are commonly trained on vast internet text data, often containing offensive content. To mitigate this, developers use “alignment” methods via finetuning to prevent harmful or objectionable responses in recent LLMs.

ChatGPT and AI siblings were fine-tuned to avoid undesirable messages like hate speech, personal info, or bomb-making instructions.

However, security researchers from the following universities showed recently how a simple prompt addition breaks defenses in multiple popular chatbots:-

  • Carnegie Mellon University (Andy Zou, J. Zico Kolter, Matt Fredrikson)
  • Center for AI Safety (Zifan Wang)
  • Bosch Center for AI (J. Zico Kolter)

New Flaw in AI Chatbots

Non-adversarially aligned LLMs fall victim to a single universal adversarial prompt, evading state-of-the-art commercial models, including:-

  • ChatGPT
  • Claude
  • Bard
  • Llama-2 

These outputs prove potential misuse with high probability, achieved by the “Greedy Coordinate Gradient” attack on smaller open-source LLMs.

Flow chain (Source – Arxiv)

New adversarial attacks exploit aligned language models to generate objectionable content by adding an adversarial suffix to user queries. 

However, the attack’s success lies in the careful combination of three key elements, previously seen in the theories but now reliably effective in practice.

Here below we have mentioned those three key elements:-

  • Initial affirmative responses.
  • Combined greedy and gradient-based discrete optimization.
  • Robust multi-prompt and multi-model attacks.

Clever AI chatbots’ tendency to go off the rails is not a minor problem but a fundamental weakness, challenging advanced AI deployment.

Adding specific information prompts the chatbots to generate harmful responses which bypasses the restrictions and leads to disallowed content.

Researchers alerted OpenAI, Google, and Anthropic of the exploit before publishing the findings. While the companies blocked specific exploits but still struggle to prevent adversarial attacks overall. 

Negative ChatGPT Prompt (Source – Arxiv)

Since Kolter discovered strings affecting ChatGPT and Bard, claiming to possess thousands of such strings.

Anthropic actively research stronger defenses against prompt injection and adversarial measures. They aim to make base models safer and explore additional layers of protection. 

While the OpenAI’s ChatGPT and similar models completely rely on vast language data to predict such characters.

Language models excel in generating intelligent output but are prone to discrimination and fabricating information. 

Adversarial attacks exploit data patterns, causing aberrant behaviors, like misidentification in image classifiers or responding to inaudible messages in speech recognition. The attack highlights the inevitability of AI misuse. 

AI safety experts should focus on safeguarding vulnerable systems like social networks from AI-generative disinformation rather than solely trying to “align” models.

Also Read

Keep yourself informed about the latest Cyber Security News by following us on GoogleNews, Linkedin, Twitter, and Facebook.


Latest articles

mSpy Data Breach: Millions of Customers’ Data Exposed

mSpy, a widely used phone spyware application, has suffered a significant data breach, exposing...

Advance Auto Parts Cyber Attack: Over 2 Million Users Data Exposed

RALEIGH, NC—Advance Stores Company, Incorporated, a prominent commercial entity in the automotive industry, has...

Hackers Using ClickFix Social Engineering Tactics to Deploy Malware

Cybersecurity researchers at McAfee Labs have uncovered a sophisticated new method of malware delivery,...

Coyote Banking Trojan Attacking Windows Users To Steal Login Details

Hackers use Banking Trojans to steal sensitive financial information. These Trojans can also intercept...

Hackers Created 700+ Fake Domains to Sell Olympic Games Tickets

As the world eagerly anticipates the Olympic Games Paris 2024, a cybersecurity threat has...

Japanese Space Agency Spotted zero-day via Microsoft 365 Services

The Japan Aerospace Exploration Agency (JAXA) has revealed details of a cybersecurity incident that...

Top 10 Active Directory Management Tools – 2024

Active Directory Management Tools are essential for IT administrators to manage and secure Active...
Tushar Subhra Dutta
Tushar Subhra Dutta
Tushar is a Cyber security content editor with a passion for creating captivating and informative content. With years of experience under his belt in Cyber Security, he is covering Cyber Security News, technology and other news.

Free Webinar

Low Rate DDoS Attack

9 of 10 sites on the AppTrana network have faced a DDoS attack in the last 30 days.
Some DDoS attacks could readily be blocked by rate-limiting, IP reputation checks and other basic mitigation methods.
More than 50% of the DDoS attacks are employing botnets to send slow DDoS attacks where millions of IPs are being employed to send one or two requests per minute..
Key takeaways include:

  • The mechanics of a low-DDoS attack
  • Fundamentals of behavioural AI and rate-limiting
  • Surgical mitigation actions to minimize false positives
  • Role of managed services in DDoS monitoring

Related Articles