TokenBreak Attack Bypasses AI Moderation using Character Manipulation

Kristin Sestito@hiddenlayer.com //

TokenBreak Attack Bypasses AI Moderation using Character Manipulation

Cybersecurity researchers have recently unveiled a novel attack, dubbed TokenBreak, that exploits vulnerabilities in the tokenization process of large language models (LLMs). This technique allows malicious actors to bypass safety and content moderation guardrails with minimal alterations to text input. By manipulating individual characters, attackers can induce false negatives in text classification models, effectively evading detection mechanisms designed to prevent harmful activities like prompt injection, spam, and the dissemination of toxic content. The TokenBreak attack highlights a critical flaw in AI security, emphasizing the need for more robust defenses against such exploitation.

The TokenBreak attack specifically targets the way models tokenize text, the process of breaking down raw text into smaller units or tokens. HiddenLayer researchers discovered that models using Byte Pair Encoding (BPE) or WordPiece tokenization strategies are particularly vulnerable. By adding subtle alterations, such as adding an extra letter to a word like changing "instructions" to "finstructions", the meaning of the text is still understood. This manipulation causes different tokenizers to split the text in unexpected ways, effectively fooling the AI's detection mechanisms. The fact that the altered text remains understandable underscores the potential for attackers to inject malicious prompts and bypass intended safeguards.

To mitigate the risks associated with the TokenBreak attack, experts recommend several strategies. Selecting models that use Unigram tokenizers, which have demonstrated greater resilience to this type of manipulation, is crucial. Additionally, organizations should ensure tokenization and model logic alignment and implement misclassification logging to better detect and respond to potential attacks. Understanding the underlying protection model's family and its tokenization strategy is also critical. The TokenBreak attack serves as a reminder of the ever-evolving landscape of AI security and the importance of proactive measures to protect against emerging threats.

Original img attribution: https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEio0Wo64B_Iyi8iwpo_QIVHt-fLjo17610Hdz4n4H8JSYAOqiB-N6WHnc-4L_dFZ5wTDqZyFt63MdZ1aN00BG6Sh_UBymmJjlcwO1QkoD68wwFsQzfR3bZeUxxZF7Y0dpu33MIL_7ybdQVuvxHo4i1BfgKtwZ1112jOP88bNpmP9iuy__gcmXIneONP2xoi/s728-rw-e365/prompt.jpg

ImgSrc: blogger.googleu

References :

Security Risk Advisors: TokenBreak attack bypasses AI text filters by manipulating tokens. BERT/RoBERTa vulnerable, DeBERTa resistant. #AISecuority #LLM #PromptInjection The post appeared first on .
The Hacker News: Cybersecurity researchers have discovered a novel attack technique called TokenBreak that can be used to bypass a large language model's (LLM) safety and content moderation guardrails with just a single character change.
www.scworld.com: Researchers detail how malicious actors could exploit the novel TokenBreak attack technique to compromise large language models' tokenization strategy and evade implemented safety and content moderation protections
hiddenlayer.com: New TokenBreak Attack Bypasses AI Moderation with Single-Character Text Changes

Classification:

HashTags: #AISecurity #LLM #PromptInjection
Company: HiddenLayer
Target: AI Moderation Systems
Attacker: HiddenLayer
Product: LLM
Feature: Token Manipulation
Malware: TokenBreak
Type: AI
Severity: Medium

News from the AI & ML world

DeeperML

TokenBreak Attack Bypasses AI Moderation using Character Manipulation

Classification: