Anthropic Details Claude 3.5 Sonnet Safeguards and Jailbreak Framework
Key Takeaways Anthropic has released extensive technical documentation outlining the security measures implemented in its Claude 3.5 Sonnet AI model. The disclosure details a new safety classifier...
Key Takeaways
- Anthropic has released extensive technical documentation outlining the security measures implemented in its Claude 3.5 Sonnet AI model.
- The disclosure details a new safety classifier system designed to categorize cybersecurity-related queries and a proposed framework for evaluating the severity of “jailbreaks.”
- The AI’s safety mechanisms classify cybersecurity requests into four distinct categories: Prohibited, High-risk Dual Use, Low-risk Dual Use, and Benign, to manage the dual-use nature of cyber capabilities.
- A new Cyber Jailbreak Severity (CJS) Framework, developed with Glasswing, rates jailbreaks from CJS-0 (Informational) to CJS-4 (Critical) based on factors like capability gain, breadth, ease of weaponization, and discoverability.
Anthropic Details Claude 3.5 Sonnet Safeguards and Jailbreak Framework
Anthropic has released comprehensive technical documentation detailing the cybersecurity safeguards integrated into its Claude 3.5 Sonnet AI model, following its recent global redeployment. This disclosure provides insight into the AI’s internal safety classifier system and introduces a preliminary framework for assessing the severity of jailbreak attempts, developed in collaboration with Glasswing.
Table Of Content
Claude 3.5 Sonnet’s Safety Classifier System
Claude 3.5 Sonnet’s safety classifiers are designed to intelligently categorize cybersecurity-related requests, moving beyond a blanket blocking approach. This nuanced system acknowledges the dual-use nature inherent in many cyber capabilities, sorting requests into four distinct categories:
- Prohibited Use: Activities such as ransomware development, wiper creation, cyber-physical sabotage, malware development, command-and-control (C2) infrastructure setup, and defense evasion techniques are consistently blocked. This reflects their high potential for harm and minimal legitimate defensive utility.
- High-Risk Dual Use: Categories like penetration testing, exploit development, privilege escalation, and advanced vulnerability discovery are currently blocked. This is a temporary measure, pending the implementation of more robust authorization controls.
- Low-Risk Dual Use: Activities including OSINT gathering, identifying already-known vulnerabilities, and cryptographic protocol testing are generally permitted. However, these are subject to a “safety margin” that intercepts and blocks borderline cases to prevent misuse.
- Benign Use: Tasks such as secure coding, patch management, log analysis, malware reverse engineering, and security education are allowed with minimal monitoring, recognizing their clear positive impact on cybersecurity.
Notably, Anthropic distinguishes between vulnerability discovery that can be performed by existing models (which is allowed) and novel, high-impact findings that are beyond the capabilities of competing tools (which are blocked). This approach aligns with National Security Agency (NSA) guidance, which suggests that responsible disclosure typically benefits defenders more than attackers.
Cyber Jailbreak Severity (CJS) Framework
Anthropic has proposed a Cyber Jailbreak Severity (CJS) framework, designed to standardize the assessment of jailbreak risks. This logarithmic scale ranges from CJS-0 (Informational) to CJS-4 (Critical), with each successive tier representing a significantly higher level of risk.
The rating is determined by evaluating four key scoring axes:
- Capability Gain: Assesses how significantly the jailbreak enhances attacker capabilities beyond existing tools (0–4 points).
- Breadth: Measures the extent to which the technique can be generalized across various attack types or targets (0–2 points).
- Ease of Weaponization: Evaluates the level of LLM expertise required to operationalize the exploit (0–2 points).
- Discoverability: Determines how easily threat actors could independently uncover the technique (0–2 points).
The summed scores from these axes map to specific severity bands: CJS-1 (Low, 1–3.5 points), CJS-2 (Medium, 4–6.5 points), CJS-3 (High, 7–8.5 points), and CJS-4 (Critical, 9–10 points). Anthropic emphasizes that while the final rating can be escalated based on discretionary factors—such as unpatched fundamental vulnerabilities or compounding risks from linked findings—it can never be reduced.
Anthropic is actively seeking feedback on this framework, inviting comments to [email protected]. Additionally, the company has launched a dedicated HackerOne bug bounty program, encouraging researchers to report potential jailbreaks in Claude 3.5 Sonnet.
The company frames this initiative as an early-stage effort to foster a common language between AI developers and government bodies, facilitating consistent discussions regarding jailbreak risks. It is important to note that this framework specifically excludes non-cybersecurity jailbreaks, such as system prompt extraction, as Anthropic already publishes these voluntarily.
Disclaimer: HackersRadar reports on cybersecurity threats and incidents for informational and awareness purposes only. We do not engage in hacking activities, data exfiltration, or the hosting or distribution of stolen or leaked information. All content is based on publicly available sources.



No Comment! Be the first one.