Hackers News Hackers News
  • CyberSecurity News
  • Threats
  • Attacks
  • Vulnerabilities
  • Breaches
  • Comparisons

Social Media

Hackers News Hackers News
  • CyberSecurity News
  • Threats
  • Attacks
  • Vulnerabilities
  • Breaches
  • Comparisons
Search the Site
Popular Searches:
technology Amazon AI
Recent Posts
India Halts WhatsApp Usernames Rollout Due to Fraud Concerns
July 1, 2026
Critical Cursor IDE RCE Vulnerabilities Allow Zero-Click Prompt Injection
July 1, 2026
Automated Password Spray Attacks Target Microsoft Azure CLI
July 1, 2026
Home/CyberSecurity News/Critical Flaw in 11 AI Models, Including ChatGPT, Claude, Gemini
CyberSecurity News

Critical Flaw in 11 AI Models, Including ChatGPT, Claude, Gemini

Key Takeaways A novel jailbreak technique, dubbed “sockpuppeting,” can bypass safety mechanisms in 11 prominent large language models (LLMs). The attack leverages a single line of code by...

Marcus Rodriguez
Marcus Rodriguez
April 10, 2026 3 Min Read
42 0

Key Takeaways

  • A novel jailbreak technique, dubbed “sockpuppeting,” can bypass safety mechanisms in 11 prominent large language models (LLMs).
  • The attack leverages a single line of code by exploiting legitimate API features designed for “assistant prefill.”
  • Affected models include those from Google (Gemini), Anthropic (Claude), and OpenAI (ChatGPT), with varying susceptibility.
  • Defenses are available, primarily through API-level message validation or robust internal model resistance.

Cybersecurity researchers have uncovered a critical vulnerability, dubbed “sockpuppeting,” that allows attackers to circumvent the built-in safety protocols of eleven leading large language models (LLMs), including offerings from Google, Anthropic, and OpenAI. This technique, requiring only a single line of malicious code, exploits a common API functionality to trick LLMs into generating prohibited content.

Table Of Content

  • Key Takeaways
  • Model Vulnerability Testing
  • API Provider Defenses
  • What You Should Do

Unlike more intricate attack methodologies, sockpuppeting capitalizes on API features intended for “assistant prefill.” This legitimate function permits developers to pre-populate an AI assistant’s response, guiding its output format. Attackers weaponize this by injecting a seemingly compliant prefix, such as “Sure, here is how to do it,” directly into the assistant’s role within the API call.

The core of the exploit lies in the LLM’s inherent drive for self-consistency. Once a malicious prefill is accepted, the model is compelled to continue generating content that aligns with the fabricated agreement, effectively bypassing its standard safety filters and producing harmful or restricted information.

Comparison of normal and sockpuppet flows(source : trendmicro )
Comparison of normal and sockpuppet flows(source : trendmicro )

Model Vulnerability Testing

Researchers at Trend Micro, who detailed this vulnerability, characterize sockpuppeting as a black-box attack. It requires no complex optimization or access to the model’s internal weights, making it relatively straightforward to execute.

Testing revealed varying degrees of susceptibility across different LLMs. Google’s Gemini 2.5 Flash exhibited the highest vulnerability, with an attack success rate (ASR) of 15.7%. In contrast, OpenAI’s GPT-4o-mini demonstrated the strongest resistance, achieving an ASR of just 0.5%.

When successful, these attacks led to concerning outcomes, including the generation of functional malicious exploit code and the leakage of highly confidential system prompts. The most effective strategy for deploying the sockpuppeting exploit involved multi-turn persona setups. In these scenarios, the LLM is initially instructed to operate as an unrestricted assistant before the attacker injects the deceptive agreement.

ASR by model, ranked highest to lowest, with blocked models shown at 0%(source : trendmicro)
ASR by model, ranked highest to lowest, with blocked models shown at 0%(source : trendmicro)

Furthermore, researchers found that “task-reframing” variants of the attack could successfully bypass robust safety training. These variants cleverly disguised harmful requests as benign data formatting tasks, further illustrating the adaptability of the sockpuppeting technique.

API Provider Defenses

The way different major API providers handle assistant prefills significantly impacts their models’ exposure to this vulnerability. Some providers have implemented strong preventative measures at the API layer.

For instance, OpenAI and AWS Bedrock entirely block assistant prefills, thereby eliminating the attack surface and providing the most robust defense. Conversely, platforms such as Google Vertex AI accept prefill requests for certain models, compelling the AI to rely solely on its internal safety training for protection.

The three defense layers: API Block, Model Resistance, and Broadly Vulnerable(source : trendmicro)
The three defense layers: API Block, Model Resistance, and Broadly Vulnerable(source : trendmicro)

What You Should Do

  • Implement message-ordering validation at the API layer to block assistant-role messages that could be used for prefill attacks.
  • For organizations utilizing self-hosted inference servers like Ollama or vLLM, manual enforcement of message validation is crucial, as these platforms may not ensure proper message ordering by default, as highlighted by Trend Micro.
  • Proactively incorporate assistant prefill attack scenarios into standard AI red-teaming exercises to identify and mitigate potential vulnerabilities.
  • Stay informed about updates and patches from your LLM providers regarding API security and model safety.

Disclaimer: HackersRadar reports on cybersecurity threats and incidents for informational and awareness purposes only. We do not engage in hacking activities, data exfiltration, or the hosting or distribution of stolen or leaked information. All content is based on publicly available sources.

Tags:

AttackExploitSecurityVulnerability

Share Article

Marcus Rodriguez

Marcus Rodriguez

Marcus is a security researcher and investigative journalist with expertise in vulnerability research, bug bounties, and cloud security. Since 2017, Marcus has been breaking stories on critical vulnerabilities affecting major platforms. His investigative work has led to the disclosure of numerous security flaws and improved defenses across the industry. Marcus is an active participant in bug bounty programs and has been recognized for responsible disclosure practices. He holds multiple security certifications and regularly speaks at industry events.

Previous Post

Magecart Skimmer Exploits SVG Vulnerability on Magento Checkout Pages

Next Post

DesckVB RAT Evades Detection With Obfuscated JavaScript and Fileless .NET Loader

No Comment! Be the first one.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Popular Posts
Critical Fluentd Vulnerabilities Allow Remote Code Execution
July 1, 2026
Weaponized Google Ads Install Malicious Claude Code to Hijack macOS
July 1, 2026
Critical Adobe ColdFusion Vulnerabilities Let Attackers Run Code
July 1, 2026
Top Authors
Marcus Rodriguez
Marcus Rodriguez
Jennifer sherman
Jennifer sherman
Emy Elsamnoudy
Emy Elsamnoudy
Let's Connect
156k
2.25m
285k

Related Posts

Jennifer sherman
By Jennifer sherman
Threats

GlassWorm Attacks macOS via Malicious VS Code…

January 1, 2026
Emy Elsamnoudy
By Emy Elsamnoudy
Attacks

ClickFix Attack Hides Malicious Code via Stegan Security

January 1, 2026
Sarah simpson
By Sarah simpson
Vulnerabilities

MongoBleed Detector Tool Released to Detect MongoDB Vulnerability(CVE-2025-14847)

January 1, 2026
Emy Elsamnoudy
By Emy Elsamnoudy
Breaches

Conti Ransomware Gang Leaders & Infrastructure Exposed

January 1, 2026
Hackers News Hackers News
  • [email protected]

Quick Links

  • Contact Us
  • Privacy Policy
  • Terms of service

Categories

Attacks
Breaches
Comparisons
CyberSecurity News
Threats
Vulnerabilities

Let's keep in touch

receive fresh updates and breaking cyber news every day and week!

All Rights Reserved by HackersRadar ©2026

Follow Us