Critical Flaw in 11 AI Models, Including ChatGPT, Claude, Gemini

Key Takeaways

A novel jailbreak technique, dubbed “sockpuppeting,” can bypass safety mechanisms in 11 prominent large language models (LLMs).
The attack leverages a single line of code by exploiting legitimate API features designed for “assistant prefill.”
Affected models include those from Google (Gemini), Anthropic (Claude), and OpenAI (ChatGPT), with varying susceptibility.
Defenses are available, primarily through API-level message validation or robust internal model resistance.

Cybersecurity researchers have uncovered a critical vulnerability, dubbed “sockpuppeting,” that allows attackers to circumvent the built-in safety protocols of eleven leading large language models (LLMs), including offerings from Google, Anthropic, and OpenAI. This technique, requiring only a single line of malicious code, exploits a common API functionality to trick LLMs into generating prohibited content.

Unlike more intricate attack methodologies, sockpuppeting capitalizes on API features intended for “assistant prefill.” This legitimate function permits developers to pre-populate an AI assistant’s response, guiding its output format. Attackers weaponize this by injecting a seemingly compliant prefix, such as “Sure, here is how to do it,” directly into the assistant’s role within the API call.

The core of the exploit lies in the LLM’s inherent drive for self-consistency. Once a malicious prefill is accepted, the model is compelled to continue generating content that aligns with the fabricated agreement, effectively bypassing its standard safety filters and producing harmful or restricted information.

Comparison of normal and sockpuppet flows(source : trendmicro )

Model Vulnerability Testing

Researchers at Trend Micro, who detailed this vulnerability, characterize sockpuppeting as a black-box attack. It requires no complex optimization or access to the model’s internal weights, making it relatively straightforward to execute.

Testing revealed varying degrees of susceptibility across different LLMs. Google’s Gemini 2.5 Flash exhibited the highest vulnerability, with an attack success rate (ASR) of 15.7%. In contrast, OpenAI’s GPT-4o-mini demonstrated the strongest resistance, achieving an ASR of just 0.5%.

When successful, these attacks led to concerning outcomes, including the generation of functional malicious exploit code and the leakage of highly confidential system prompts. The most effective strategy for deploying the sockpuppeting exploit involved multi-turn persona setups. In these scenarios, the LLM is initially instructed to operate as an unrestricted assistant before the attacker injects the deceptive agreement.

ASR by model, ranked highest to lowest, with blocked models shown at 0%(source : trendmicro)

Furthermore, researchers found that “task-reframing” variants of the attack could successfully bypass robust safety training. These variants cleverly disguised harmful requests as benign data formatting tasks, further illustrating the adaptability of the sockpuppeting technique.

API Provider Defenses

The way different major API providers handle assistant prefills significantly impacts their models’ exposure to this vulnerability. Some providers have implemented strong preventative measures at the API layer.

For instance, OpenAI and AWS Bedrock entirely block assistant prefills, thereby eliminating the attack surface and providing the most robust defense. Conversely, platforms such as Google Vertex AI accept prefill requests for certain models, compelling the AI to rely solely on its internal safety training for protection.

The three defense layers: API Block, Model Resistance, and Broadly Vulnerable(source : trendmicro)

What You Should Do

Implement message-ordering validation at the API layer to block assistant-role messages that could be used for prefill attacks.
For organizations utilizing self-hosted inference servers like Ollama or vLLM, manual enforcement of message validation is crucial, as these platforms may not ensure proper message ordering by default, as highlighted by Trend Micro.
Proactively incorporate assistant prefill attack scenarios into standard AI red-teaming exercises to identify and mitigate potential vulnerabilities.
Stay informed about updates and patches from your LLM providers regarding API security and model safety.

Disclaimer: HackersRadar reports on cybersecurity threats and incidents for informational and awareness purposes only. We do not engage in hacking activities, data exfiltration, or the hosting or distribution of stolen or leaked information. All content is based on publicly available sources.

Tags:

Social Media

Critical Flaw in 11 AI Models, Including ChatGPT, Claude, Gemini

Key Takeaways

Table Of Content

Model Vulnerability Testing

API Provider Defenses

What You Should Do

Tags:

Marcus Rodriguez

Magecart Skimmer Exploits SVG Vulnerability on Magento Checkout Pages

DesckVB RAT Evades Detection With Obfuscated JavaScript and Fileless .NET Loader

No Comment! Be the first one.

Leave a Reply Cancel reply

Popular Posts

Critical Fluentd Vulnerabilities Allow Remote Code Execution

Weaponized Google Ads Install Malicious Claude Code to Hijack macOS

Critical Adobe ColdFusion Vulnerabilities Let Attackers Run Code

Top Authors

Let's Connect

Related Posts

GlassWorm Attacks macOS via Malicious VS Code…

ClickFix Attack Hides Malicious Code via Stegan Security

MongoBleed Detector Tool Released to Detect MongoDB Vulnerability(CVE-2025-14847)

Conti Ransomware Gang Leaders & Infrastructure Exposed

Quick Links

Categories

Let's keep in touch

Follow Us

Social Media

Search the Site

Recent Posts

Critical Flaw in 11 AI Models, Including ChatGPT, Claude, Gemini

Key Takeaways

Table Of Content

Model Vulnerability Testing

API Provider Defenses

What You Should Do

Tags:

Share Article

Magecart Skimmer Exploits SVG Vulnerability on Magento Checkout Pages

DesckVB RAT Evades Detection With Obfuscated JavaScript and Fileless .NET Loader

No Comment! Be the first one.

Leave a Reply Cancel reply

Popular Posts

Top Authors

Let's Connect

Related Posts

Quick Links

Categories

Let's keep in touch

Follow Us