Adapt on the user's prompt
The model adapts on the user's own clean prompt via self-supervised next-token prediction. No adversarial data needed.
Even clean, optimized prompts degrade safety alignment.
1CISPA Helmholtz Center for Information Security ยท
2University of Cologne
*Equal contribution
Test-Time Training updates model weights on-the-fly to improve performance on individual inputs.
Alignment assumes fixed parameters. It was never designed to survive weight updates.
A few fine-tuning steps are enough to strip safety alignment, turning a safe model into an unsafe one.
We expose this vulnerability across seven open-weight models and a production fine-tuning API.
We identify three TTT threat models. The model parameters θ are updated via an adaptation operator θ′ = T(θ, D; λ), where D = (x1:n, ψ) contains the clean prompt and adversary-controlled ψ. Each threat model minimizes a different next-token-prediction loss. After adaptation, the model bypasses safety alignment.
The model adapts on the user's own clean prompt via self-supervised next-token prediction. No adversarial data needed.
Even clean, optimized prompts degrade safety alignment.
Adversary supplies K=5 harmful prompt → affirmative-prefix pairs and the model adapts before answering the target query.
A single example (K=1) is already enough to break alignment.
Adversary supplies the model with an affirmative prefix conditioned on the query (e.g. "Sure, here is…").
Priming the model is enough. No harmful content needed.
Evaluated on Gemma 7B, Llama 3 8B, Llama 3 70B, Qwen 2.5 1.5B, Qwen 2.5 7B, Qwen 3 1.7B, and Qwen 3 8B.
Layering few-shot TTT on top of clean, adversarial-template, and adversarial-suffix prompts lifts every variant to near 100% ASR@10 by step 2 on Llama 3 8B, even where each component alone is weak.
Standard LLM safety judges can overestimate ASR by up to 13 percentage points because TTT overfits and produces degenerate text. Our validity-aware pipeline restores accuracy and eliminates these false positives.
On Tinker, the few-shot attack reaches 100% ASR@10 on GPT-OSS 120B and 98% for generation-phase, with no API-specific tuning, for under $2 per attack.
As a first step, we propose a provider-side detector that compares the model's perplexity on a private harmful holdout before and after TTT. If the holdout perplexity drops above a per-model threshold, the request is flagged.
This catches our vanilla attacks with high true-positive rate and low false-positive rate, but it is unlikely to withstand adaptive attacks. Robust deployment will require dynamic alignment that explicitly accounts for test-time weight updates.