A new wave of cyber threats targeting large language models (LLMs) has emerged, exploiting their inherent inability to differentiate between informational content and actionable instructions.
Termed “indirect prompt injection attacks,” these exploits embed malicious directives within external data sources-such as documents, websites, or emails-that LLMs process during operation.
Unlike direct prompt injections, where attackers manipulate the model through crafted user inputs, indirect attacks hide payloads in seemingly benign content, making them stealthier and harder to detect.
As LLMs increasingly integrate into enterprise systems and software development workflows, the potential for data leaks, misinformation, and even malicious code injection poses a critical risk to digital ecosystems.
Stealthy Attacks Target LLMs
The potency of indirect prompt injection lies in its exploitation of LLMs’ trust in external data sources.
According to the Report, who introduced the first benchmark for these attacks, dubbed BIPIA, and published on arXivLabs, most LLMs are universally vulnerable due to their lack of boundary awareness when processing uncurated content.
This vulnerability allows attackers to manipulate a model’s behavior by poisoning datasets like public forums or repositories with malicious instructions.
For instance, attackers could seed GitHub repositories with vulnerable code or embed harmful prompts in package metadata on platforms like PyPI or npm, tricking AI-assisted tools into recommending or incorporating unsafe elements.
Chris Acevedo, a principal consultant at Optiv, likens this to a “poisoned well disguised as clean water,” emphasizing how attackers can bypass traditional security controls by leveraging trusted content channels.
In software supply chains, where LLMs are used for code generation and review, such attacks could cascade downstream, embedding vulnerabilities into countless systems, as warned by experts like Erich Kron from KnowBe4 and Jason Dion of Akylade.
Undermining Trust in AI-Driven Systems
Moreover, enterprise environments using LLMs trained on internal data, such as emails, face unique risks.
Christopher Cullen from Carnegie Mellon University’s CERT division notes that attackers can alter an LLM’s expected behavior by injecting malicious content into these datasets, even if blue teams believe they’ve blocked threats at the surface level.
Stephen Kowski of SlashNext adds that detection remains challenging without specialized AI security tools, as the attack activates only when the LLM processes the tainted content.
A notable, albeit benign, example involves Reddit users manipulating LLMs to avoid recommending popular restaurants, highlighting the technique’s potential for more sinister outcomes like promoting malicious code, as pointed out by Greg Anderson of DefectDojo.
To combat this, researchers propose novel defenses like boundary awareness and explicit reminders, with experiments showing significant mitigation, including near-zero attack success rates in white-box scenarios.
Acevedo urges immediate action, recommending organizations sanitize inputs, clearly delineate context from commands, tag untrusted sources, limit LLM actions like code execution, and regularly monitor outputs through red-teaming.
As LLMs continue to digest vast, unverified datasets, the question looms-whose words are they truly following? With indirect prompt injection already a reality, securing AI systems demands a proactive shift in how we handle the data they consume.
Setting Up SOC Team? – Download Free Ultimate SIEM Pricing Guide (PDF) For Your SOC Team -> Free Download