Test-Time Training Undermines Safety Guardrails

TL;DR

LLMs adapt at test time

Test-Time Training updates model weights on-the-fly to improve performance on individual inputs.
Safety lives in static weights

Alignment assumes fixed parameters. It was never designed to survive weight updates.
TTT erases the guardrails

A few fine-tuning steps are enough to strip safety alignment, turning a safe model into an unsafe one.
This generalizes broadly

We expose this vulnerability across seven open-weight models and a production fine-tuning API.

Attack overview

We identify three TTT threat models. The model parameters θ are updated via an adaptation operator θ′ = T(θ, D; λ), where D = (x_1:n, ψ) contains the clean prompt and adversary-controlled ψ. Each threat model minimizes a different next-token-prediction loss. After adaptation, the model bypasses safety alignment.

Three threat models, one shared mechanism

Self-supervised

Adapt on the user's prompt

The model adapts on the user's own clean prompt via self-supervised next-token prediction. No adversarial data needed.

+13pp avg ASR@10 across models

Even clean, optimized prompts degrade safety alignment.

Few-shot

Adapt on K harmful examples

Adversary supplies K=5 harmful prompt → affirmative-prefix pairs and the model adapts before answering the target query.

95% avg ASR@10 across models

A single example (K=1) is already enough to break alignment.

Generation-phase

Adapt on an affirmative prefix

Adversary supplies the model with an affirmative prefix conditioned on the query (e.g. "Sure, here is…").

93% avg ASR@10 across models

Priming the model is enough. No harmful content needed.

Evaluated on Gemma 7B, Llama 3 8B, Llama 3 70B, Qwen 2.5 1.5B, Qwen 2.5 7B, Qwen 3 1.7B, and Qwen 3 8B.

Additional findings

TTT composes with adversarial prompts

Layering few-shot TTT on top of clean, adversarial-template, and adversarial-suffix prompts lifts every variant to near 100% ASR@10 by step 2 on Llama 3 8B, even where each component alone is weak.

ASR@10 vs. TTT steps on Llama 3 8B for five configurations (clean, adv. template, adv. template + adv. suffix on base RS, adv. template + adv. suffix on TTT'd model, and ICL k=5), all converging near 100% by step 2.

Degenerate outputs fool the judge

Standard LLM safety judges can overestimate ASR by up to 13 percentage points because TTT overfits and produces degenerate text. Our validity-aware pipeline restores accuracy and eliminates these false positives.

Judge accuracy and degenerate-false-positive count across no-filter, heuristic, and LLM validity filtering.

Attacks transfer to production APIs

On Tinker, the few-shot attack reaches 100% ASR@10 on GPT-OSS 120B and 98% for generation-phase, with no API-specific tuning, for under $2 per attack.

A first defense

As a first step, we propose a provider-side detector that compares the model's perplexity on a private harmful holdout before and after TTT. If the holdout perplexity drops above a per-model threshold, the request is flagged.

This catches our vanilla attacks with high true-positive rate and low false-positive rate, but it is unlikely to withstand adaptive attacks. Robust deployment will require dynamic alignment that explicitly accounts for test-time weight updates.

Scatter of harmful-holdout perplexity before vs. after TTT, separating attacks from benign refusals.

Takeaways

TTT exposes a new attack surface. Adversaries now have a lever on the parameters, not just the input.
Attacks transfer to production APIs. Even a 120B-parameter model is fully jailbroken for under $2.
Safety evaluation must go dynamic. Static red-teaming misses vulnerabilities that emerge only under weight updates.

Poster

Presented at the ICLR 2026 Workshop on Trustworthy AI.

LLMs adapt at test time

Safety lives in static weights

TTT erases the guardrails

This generalizes broadly

TTT composes with adversarial prompts

Degenerate outputs fool the judge

Attacks transfer to production APIs