Monday, May 12, 2025
HomeCyber Security NewsDeepSeek Data Leak Exposes 12,000 Hardcoded API Keys and Passwords

DeepSeek Data Leak Exposes 12,000 Hardcoded API Keys and Passwords

Published on

SIEM as a Service

Follow Us on Google News

A sweeping analysis of the Common Crawl dataset—a cornerstone of training data for large language models (LLMs) like DeepSeek—has uncovered 11,908 live API keys, passwords, and credentials embedded in publicly accessible web pages. 

The leaked secrets, which authenticate successfully with services ranging from AWS to Slack and Mailchimp, highlight systemic risks in AI development pipelines as models inadvertently learn insecure coding practices from exposed data.

Researchers at Truffle Security traced the root cause to widespread credential hardcoding across 2.76 million web pages archived in the December 2024 Common Crawl snapshot, raising urgent questions about safeguards for AI-generated code.

- Advertisement - Google News

The Anatomy of the DeepSeek Training Data Exposure

The Common Crawl dataset, a 400-terabyte repository of web content scraped from 2.67 billion pages, serves as foundational training material for DeepSeek and other leading LLMs. 

When Truffle Security scanned this corpus using its open-source TruffleHog tool, it discovered not only thousands of valid credentials but troubling reuse patterns.

Root AWS key
Root AWS key exposed

For instance, a single WalkScore API key appeared 57,029 times across 1,871 subdomains, while one webpage contained 17 unique Slack webhooks hardcoded into front-end JavaScript. 

Mailchimp API keys dominated the leak, with 1,500 unique keys enabling potential phishing campaigns and data theft.

Infrastructure at Scale: Scanning 90,000 Web Archives

To process Common Crawl’s 90,000 WARC (Web ARChive) files, Truffle Security deployed a distributed system across 20 high-performance servers. 

Each node downloaded 4GB compressed files, split them into individual web records, and ran TruffleHog to detect and verify live secrets.

To quantify real-world risks, the team prioritized verified credentials—keys that actively authenticated with their respective services.

Server Response
Server Response

Notably, 63% of secrets were reused across multiple sites, amplifying breach potential.

This technical feat revealed startling cases like an AWS root key embedded in front-end HTML for S3 Basic Authentication—a practice with no functional benefit but grave security implications. 

Researchers also identified software firms recycling API keys across client sites, inadvertently exposing customer lists.

Why LLMs Like DeepSeek Amplify the Threat

While Common Crawl’s data reflects broader internet security failures, integrating these examples into LLM training sets creates a feedback loop.

Models cannot distinguish between live keys and placeholder examples during training, normalizing insecure patterns like credential hardcoding. 

Invalid Secret
Invalid Secret

This issue gained attention last month when researchers observed LLMs repeatedly instructing developers to embed secrets directly into code—a practice traceable to flawed training examples.

The Verification Gap in AI-Generated Code

Truffle Security’s findings underscore a critical blind spot: even if 99% of detected secrets were invalid, their sheer volume of training data skews LLM outputs toward insecure recommendations. 

For instance, a model exposed to thousands of front-end Mailchimp API keys may prioritize convenience over security, ignoring backend environment variables.

Example of a root AWS key exposed in front-end HTML.
Example of a root AWS key exposed in front-end HTML.

This problem persists across all major LLM training datasets derived from public code repositories and web content.

Industry Responses and Mitigation Strategies

In response, Truffle Security advocates for multilayered safeguards. Developers using AI coding assistants can implement Copilot Instructions or Cursor Rules to inject security guardrails into LLM prompts. 

For example, a rule specifying “Never suggest hardcoded credentials” steers models toward secure alternatives.

On an industry level, researchers propose techniques like Constitutional AI to embed ethical constraints directly into model behavior, reducing harmful outputs. 

However, this requires collaboration between AI developers and cybersecurity experts to audit training data and implement robust redaction pipelines.

This incident underscores the need for proactive measures:

  1. Expand secret scanning to public datasets like Common Crawl and GitHub.
  2. Reevaluate AI training pipelines to filter or anonymize sensitive data.
  3. Enhance developer education on secure credential management.

As LLMs like DeepSeek become integral to software development, securing their training ecosystems isn’t optional—it’s existential.

The 12,000 leaked keys are merely a symptom of a deeper ailment: our collective failure to sanitize the data shaping tomorrow’s AI.

Collect Threat Intelligence on the Latest Malware and Phishing Attacks with ANY.RUN TI Lookup -> Try for free

Divya
Divya
Divya is a Senior Journalist at GBhackers covering Cyber Attacks, Threats, Breaches, Vulnerabilities and other happenings in the cyber world.

Latest articles

Threat Actors Leverage DDoS Attacks as Smokescreens for Data Theft

Distributed Denial of Service (DDoS) attacks, once seen as crude tools for disruption wielded...

20-Year-Old Proxy Botnet Network Dismantled After Exploiting 1,000 Unpatched Devices Each Week

A 20-year-old criminal proxy network has been disrupted through a joint operation involving Lumen’s...

“PupkinStealer” – .NET Malware Steals Browser Data and Exfiltrates via Telegram

A new information-stealing malware dubbed “PupkinStealer” has emerged as a significant threat to individuals...

Phishing Campaign Uses Blob URLs to Bypass Email Security and Avoid Detection

Cybersecurity researchers at Cofense Intelligence have identified a sophisticated phishing tactic leveraging Blob URIs...

Resilience at Scale

Why Application Security is Non-Negotiable

The resilience of your digital infrastructure directly impacts your ability to scale. And yet, application security remains a critical weak link for most organizations.

Application Security is no longer just a defensive play—it’s the cornerstone of cyber resilience and sustainable growth. In this webinar, Karthik Krishnamoorthy (CTO of Indusface) and Phani Deepak Akella (VP of Marketing – Indusface), will share how AI-powered application security can help organizations build resilience by

Discussion points


Protecting at internet scale using AI and behavioral-based DDoS & bot mitigation.
Autonomously discovering external assets and remediating vulnerabilities within 72 hours, enabling secure, confident scaling.
Ensuring 100% application availability through platforms architected for failure resilience.
Eliminating silos with real-time correlation between attack surface and active threats for rapid, accurate mitigation

More like this

Threat Actors Leverage DDoS Attacks as Smokescreens for Data Theft

Distributed Denial of Service (DDoS) attacks, once seen as crude tools for disruption wielded...

20-Year-Old Proxy Botnet Network Dismantled After Exploiting 1,000 Unpatched Devices Each Week

A 20-year-old criminal proxy network has been disrupted through a joint operation involving Lumen’s...

“PupkinStealer” – .NET Malware Steals Browser Data and Exfiltrates via Telegram

A new information-stealing malware dubbed “PupkinStealer” has emerged as a significant threat to individuals...