Critical Flaw in 11 AI Models, Including ChatGPT, Claude, Gemini
Key Takeaways A novel jailbreak technique, dubbed “sockpuppeting,” can bypass safety mechanisms in 11 prominent large language models (LLMs). The attack leverages a single line of code by...
Key Takeaways
- A novel jailbreak technique, dubbed “sockpuppeting,” can bypass safety mechanisms in 11 prominent large language models (LLMs).
- The attack leverages a single line of code by exploiting legitimate API features designed for “assistant prefill.”
- Affected models include those from Google (Gemini), Anthropic (Claude), and OpenAI (ChatGPT), with varying susceptibility.
- Defenses are available, primarily through API-level message validation or robust internal model resistance.
Cybersecurity researchers have uncovered a critical vulnerability, dubbed “sockpuppeting,” that allows attackers to circumvent the built-in safety protocols of eleven leading large language models (LLMs), including offerings from Google, Anthropic, and OpenAI. This technique, requiring only a single line of malicious code, exploits a common API functionality to trick LLMs into generating prohibited content.
Table Of Content
Unlike more intricate attack methodologies, sockpuppeting capitalizes on API features intended for “assistant prefill.” This legitimate function permits developers to pre-populate an AI assistant’s response, guiding its output format. Attackers weaponize this by injecting a seemingly compliant prefix, such as “Sure, here is how to do it,” directly into the assistant’s role within the API call.
The core of the exploit lies in the LLM’s inherent drive for self-consistency. Once a malicious prefill is accepted, the model is compelled to continue generating content that aligns with the fabricated agreement, effectively bypassing its standard safety filters and producing harmful or restricted information.

Model Vulnerability Testing
Researchers at Trend Micro, who detailed this vulnerability, characterize sockpuppeting as a black-box attack. It requires no complex optimization or access to the model’s internal weights, making it relatively straightforward to execute.
Testing revealed varying degrees of susceptibility across different LLMs. Google’s Gemini 2.5 Flash exhibited the highest vulnerability, with an attack success rate (ASR) of 15.7%. In contrast, OpenAI’s GPT-4o-mini demonstrated the strongest resistance, achieving an ASR of just 0.5%.
When successful, these attacks led to concerning outcomes, including the generation of functional malicious exploit code and the leakage of highly confidential system prompts. The most effective strategy for deploying the sockpuppeting exploit involved multi-turn persona setups. In these scenarios, the LLM is initially instructed to operate as an unrestricted assistant before the attacker injects the deceptive agreement.

Furthermore, researchers found that “task-reframing” variants of the attack could successfully bypass robust safety training. These variants cleverly disguised harmful requests as benign data formatting tasks, further illustrating the adaptability of the sockpuppeting technique.
API Provider Defenses
The way different major API providers handle assistant prefills significantly impacts their models’ exposure to this vulnerability. Some providers have implemented strong preventative measures at the API layer.
For instance, OpenAI and AWS Bedrock entirely block assistant prefills, thereby eliminating the attack surface and providing the most robust defense. Conversely, platforms such as Google Vertex AI accept prefill requests for certain models, compelling the AI to rely solely on its internal safety training for protection.

What You Should Do
- Implement message-ordering validation at the API layer to block assistant-role messages that could be used for prefill attacks.
- For organizations utilizing self-hosted inference servers like Ollama or vLLM, manual enforcement of message validation is crucial, as these platforms may not ensure proper message ordering by default, as highlighted by Trend Micro.
- Proactively incorporate assistant prefill attack scenarios into standard AI red-teaming exercises to identify and mitigate potential vulnerabilities.
- Stay informed about updates and patches from your LLM providers regarding API security and model safety.
Disclaimer: HackersRadar reports on cybersecurity threats and incidents for informational and awareness purposes only. We do not engage in hacking activities, data exfiltration, or the hosting or distribution of stolen or leaked information. All content is based on publicly available sources.



No Comment! Be the first one.