Test-Time Training Undermines Safety Guardrails

Warning: this paper contains red-teaming data and model-generated content that may be offensive.
  1. Simone Antonelli1,*
  2. Sadegh Akhondzadeh2,*
  3. Aleksandar Bojchevski2

1CISPA Helmholtz Center for Information Security  ยท  2University of Cologne
*Equal contribution

TL;DR

  1. LLMs adapt at test time

    Test-Time Training updates model weights on-the-fly to improve performance on individual inputs.

  2. Safety lives in static weights

    Alignment assumes fixed parameters. It was never designed to survive weight updates.

  3. TTT erases the guardrails

    A few fine-tuning steps are enough to strip safety alignment, turning a safe model into an unsafe one.

  4. This generalizes broadly

    We expose this vulnerability across seven open-weight models and a production fine-tuning API.

Attack overview

We identify three TTT threat models. The model parameters θ are updated via an adaptation operator θ′ = T(θ, D; λ), where D = (x1:n, ψ) contains the clean prompt and adversary-controlled ψ. Each threat model minimizes a different next-token-prediction loss. After adaptation, the model bypasses safety alignment.

Diagram of the three TTT threat models: self-supervised, few-shot, and generation-phase. A safe LLM is adapted with attacker-controlled data and produces unsafe output.

Three threat models, one shared mechanism

Self-supervised

Adapt on the user's prompt

The model adapts on the user's own clean prompt via self-supervised next-token prediction. No adversarial data needed.

+13pp avg ASR@10 across models

Even clean, optimized prompts degrade safety alignment.

Few-shot

Adapt on K harmful examples

Adversary supplies K=5 harmful prompt → affirmative-prefix pairs and the model adapts before answering the target query.

95% avg ASR@10 across models

A single example (K=1) is already enough to break alignment.

Generation-phase

Adapt on an affirmative prefix

Adversary supplies the model with an affirmative prefix conditioned on the query (e.g. "Sure, here is…").

93% avg ASR@10 across models

Priming the model is enough. No harmful content needed.

Evaluated on Gemma 7B, Llama 3 8B, Llama 3 70B, Qwen 2.5 1.5B, Qwen 2.5 7B, Qwen 3 1.7B, and Qwen 3 8B.

Additional findings

TTT composes with adversarial prompts

Layering few-shot TTT on top of clean, adversarial-template, and adversarial-suffix prompts lifts every variant to near 100% ASR@10 by step 2 on Llama 3 8B, even where each component alone is weak.

ASR@10 vs. TTT steps on Llama 3 8B for five configurations (clean, adv. template, adv. template + adv. suffix on base RS, adv. template + adv. suffix on TTT'd model, and ICL k=5), all converging near 100% by step 2.

Degenerate outputs fool the judge

Standard LLM safety judges can overestimate ASR by up to 13 percentage points because TTT overfits and produces degenerate text. Our validity-aware pipeline restores accuracy and eliminates these false positives.

Judge accuracy and degenerate-false-positive count across no-filter, heuristic, and LLM validity filtering.

Attacks transfer to production APIs

On Tinker, the few-shot attack reaches 100% ASR@10 on GPT-OSS 120B and 98% for generation-phase, with no API-specific tuning, for under $2 per attack.

A first defense

As a first step, we propose a provider-side detector that compares the model's perplexity on a private harmful holdout before and after TTT. If the holdout perplexity drops above a per-model threshold, the request is flagged.

This catches our vanilla attacks with high true-positive rate and low false-positive rate, but it is unlikely to withstand adaptive attacks. Robust deployment will require dynamic alignment that explicitly accounts for test-time weight updates.

Scatter of harmful-holdout perplexity before vs. after TTT, separating attacks from benign refusals.

Takeaways

  1. TTT exposes a new attack surface. Adversaries now have a lever on the parameters, not just the input.
  2. Attacks transfer to production APIs. Even a 120B-parameter model is fully jailbroken for under $2.
  3. Safety evaluation must go dynamic. Static red-teaming misses vulnerabilities that emerge only under weight updates.

Poster

Presented at the ICLR 2026 Workshop on Trustworthy AI.

Full conference poster: Test-Time Training Undermines Safety Guardrails.

BibTeX

Copied to clipboard
@misc{antonelli2026ttt,
  title         = {Test-Time Training Undermines Safety Guardrails},
  author        = {Antonelli, Simone and Akhondzadeh, Sadegh and Bojchevski, Aleksandar},
  year          = {2026},
  eprint        = {TODO},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CR}
}