arXiv.org / 2025

Noise Injection Systemically Degrades Large Language Model Safety Guardrails

Prithviraj Singh Shahani, Matthias Scheutz

AI SafetyFoundation ModelsLarge Language ModelsPopular and Landmark Papers

Safety guardrails in large language models (LLMs) are a critical component in preventing harmful outputs. Yet, their resilience under perturbation remains poorly understood. In this paper, we investigate the robustness of safety fine-tuning in LLMs by systematically injecting Gaussian noise into model activations. We show across multiple open-weight models that (1) Gaussian noise raises harmful-output rates (p<0.001) by up to 27%, (2) that deeper safety fine-tuning affords no extra protection, and (3) that chain-of-thought reasoning remains largely intact. The findings reveal critical vulnerabilities in current safety alignment techniques and highlight the potential of reasoning-based and reinforcement learning approaches as promising direction for developing more robust AI safety systems. These results have important implications for real-world deployment of LLMs in safety-critical applications as these results imply that widely-deployed safety tuning methods can fail even without adversarial prompts.

2 citations0 influential

Full paper

Read the original paper

Open PDF Source page

Learning resources

arXiv PDFPDF arXiv abstract pagearXiv Google Scholar referencesGoogle Scholar Papers with Code searchPapers with Code Semantic Scholar paper pageSemantic Scholar YouTube explanationsYouTube

Reading state

Discuss in ChatGPT

Uses your own ChatGPT account. The paper context is copied into a tutor prompt before ChatGPT opens.

Preview prompt

You are my AI/ML research paper instructor. I want to deeply understand the paper below.

First, teach it in layers:
1. One-paragraph intuition.
2. Problem statement and why it mattered.
3. Key method, architecture, or algorithm.
4. Important equations or mechanisms, explained intuitively.
5. Experiments and evidence.
6. Limitations, assumptions, and failure modes.
7. How this paper influenced later AI/ML/Deep Learning/GenAI work.
8. A 30-minute study plan with checkpoints.
9. Quiz me with 5 questions and wait for my answers.

When something is not available in the attached context, say what is missing and infer carefully.

### Paper attached as context
Title: Noise Injection Systemically Degrades Large Language Model Safety Guardrails
Authors: Prithviraj Singh Shahani, Matthias Scheutz
Year: 2025
Venue: arXiv.org
Categories: AI Safety, Foundation Models, Large Language Models, Popular and Landmark Papers
Citations: 2
Paper URL: https://arxiv.org/abs/2505.13500v2
Open PDF: https://arxiv.org/pdf/2505.13500v2

Abstract:
Safety guardrails in large language models (LLMs) are a critical component in preventing harmful outputs. Yet, their resilience under perturbation remains poorly understood. In this paper, we investigate the robustness of safety fine-tuning in LLMs by systematically injecting Gaussian noise into model activations. We show across multiple open-weight models that (1) Gaussian noise raises harmful-output rates (p<0.001) by up to 27%, (2) that deeper safety fine-tuning affords no extra protection, and (3) that chain-of-thought reasoning remains largely intact. The findings reveal critical vulnerabilities in current safety alignment techniques and highlight the potential of reasoning-based and reinforcement learning approaches as promising direction for developing more robust AI safety systems. These results have important implications for real-world deployment of LLMs in safety-critical applications as these results imply that widely-deployed safety tuning methods can fail even without adversarial prompts.

Learning resources:
- PDF: arXiv PDF (https://arxiv.org/pdf/2505.13500v2)
- arXiv: arXiv abstract page (https://arxiv.org/abs/2505.13500v2)
- Google Scholar: Google Scholar references (https://scholar.google.com/scholar?q=Noise%20Injection%20Systemically%20Degrades%20Large%20Language%20Model%20Safety%20Guardrails)
- Papers with Code: Papers with Code search (https://paperswithcode.com/search?q=Noise%20Injection%20Systemically%20Degrades%20Large%20Language%20Model%20Safety%20Guardrails)
- Semantic Scholar: Semantic Scholar paper page (https://www.semanticscholar.org/paper/6b4a03ad48aa6f0084743a1378358526f6ce573d)
- YouTube: YouTube explanations (https://www.youtube.com/results?search_query=Noise%20Injection%20Systemically%20Degrades%20Large%20Language%20Model%20Safety%20Guardrails+paper+explained)