LLMs are commonly trained on vast internet text data, often containing offensive content. To mitigate this, developers use “alignment” methods via finetuning to prevent harmful or objectionable responses in recent LLMs.
ChatGPT and AI siblings were fine-tuned to avoid undesirable messages like hate speech, personal info, or bomb-making instructions.
However, security researchers from the following universities showed recently how a simple prompt addition breaks defenses in multiple popular chatbots:-
Non-adversarially aligned LLMs fall victim to a single universal adversarial prompt, evading state-of-the-art commercial models, including:-
These outputs prove potential misuse with high probability, achieved by the “Greedy Coordinate Gradient” attack on smaller open-source LLMs.
New adversarial attacks exploit aligned language models to generate objectionable content by adding an adversarial suffix to user queries.
However, the attack’s success lies in the careful combination of three key elements, previously seen in the theories but now reliably effective in practice.
Here below we have mentioned those three key elements:-
Clever AI chatbots’ tendency to go off the rails is not a minor problem but a fundamental weakness, challenging advanced AI deployment.
Adding specific information prompts the chatbots to generate harmful responses which bypasses the restrictions and leads to disallowed content.
Researchers alerted OpenAI, Google, and Anthropic of the exploit before publishing the findings. While the companies blocked specific exploits but still struggle to prevent adversarial attacks overall.
Since Kolter discovered strings affecting ChatGPT and Bard, claiming to possess thousands of such strings.
Anthropic actively research stronger defenses against prompt injection and adversarial measures. They aim to make base models safer and explore additional layers of protection.
While the OpenAI’s ChatGPT and similar models completely rely on vast language data to predict such characters.
Language models excel in generating intelligent output but are prone to discrimination and fabricating information.
Adversarial attacks exploit data patterns, causing aberrant behaviors, like misidentification in image classifiers or responding to inaudible messages in speech recognition. The attack highlights the inevitability of AI misuse.
AI safety experts should focus on safeguarding vulnerable systems like social networks from AI-generative disinformation rather than solely trying to “align” models.
Also Read
Keep yourself informed about the latest Cyber Security News by following us on GoogleNews, Linkedin, Twitter, and Facebook.
Multiple vulnerabilities have been identified in SHARP routers, potentially allowing attackers to execute arbitrary code…
A Proof of Concept (PoC) exploit for the critical path traversal vulnerability identified as CVE-2024-38819…
The AhnLab Security Intelligence Center (ASEC) has detected a new strain of malware targeting poorly…
The Cybersecurity and Infrastructure Security Agency (CISA) has issued Binding Operational Directive (BOD) 25-01: Implementing…
A recent campaign dubbed FLUX#CONSOLE has come to light, leveraging Microsoft Common Console Document (.MSC) files to…
The Texas Tech University Health Sciences Center (TTUHSC) and Texas Tech University Health Sciences Center…