News from the AI & ML world

DeeperML

@github.com //
Latent Adversarial Training (LAT) has emerged as a promising method for enhancing the safety of Large Language Models (LLMs). A recent study compared LAT to standard Supervised Safety Fine-Tuning (SSFT) and Embedding Space Adversarial Training (AT) and found that LAT encodes refusal behavior in a more distributed way across the model's latent space. This means that instead of relying on a few specific elements, refusal is woven into the model's overall structure, potentially making it more resilient. The study investigated this by generating refusal vectors using each method.

The results indicated that refusal vectors computed from the LAT model were more effective at triggering refusal ablation attacks across multiple models, lowering refusal rates when compared to the other approaches. However, paradoxically, the models trained with LAT maintained the highest refusal rates and were more robust overall against these attacks. This is likely because LAT allows the models to explore a wider range of responses through hidden layer perturbations creating a more comprehensive understanding of refusal. However, the researchers also highlight a potential downside as the more robust encoding of refusal behaviour could be exploited by malicious actors leading to more effective refusal attacks.
Original img attribution: https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/mynHvr3ZnYDsk2tGH/gctckeefo8s3fuikkfq2
ImgSrc: res.cloudinary.

Share: bluesky twitterx--v2 facebook--v1 threads


References :
  • github.com: Latent Adversarial Training (LAT) Improves the Representation of Refusal
  • LessWrong: Latent Adversarial Training (LAT) Improves the Representation of Refusal
Classification:
  • HashTags: #LLMSafety #AdversarialTraining #AIResearch
  • Target: LLM Security
  • Product: LLMs
  • Feature: Safety fine-tuning
  • Type: Research
  • Severity: Medium