TR2026-020
Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities
-
- , "Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities", AAAI Workshop on Trust and Control in Agentic AI, January 2026.BibTeX TR2026-020 PDF
- @inproceedings{Khattar2026jan,
- author = {Khattar, Vanshaj and Choudhury, Moumita and Rashid, Md Rafi Ur and Liu, Jing and Koike-Akino, Toshiaki and Jin, Ming and Wang, Ye},
- title = {{Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities}},
- booktitle = {AAAI Workshop on Trust and Control in Agentic AI},
- year = 2026,
- month = jan,
- url = {https://www.merl.com/publications/TR2026-020}
- }
- , "Amplification Effects in Test-Time Reinforcement Learning: Safety and Reasoning Vulnerabilities", AAAI Workshop on Trust and Control in Agentic AI, January 2026.
-
MERL Contacts:
-
Research Areas:
Abstract:
Test-time training (TTT) has recently emerged as a promising method to improve the reasoning abilities of large language models (LLMs), in which the model directly learns from test data without access to labels. However, this reliance on test data also makes TTT methods vulnerable to harmful prompt injections. In this paper, we investigate safety vulnerabilities of TTT methods, where we specifically consider test-time reinforcement learning (TTRL) (Zuo et al. 2025), a recent TTT method that improves LLM reasoning by rewarding self-consistency using majority vote as a reward signal. We show that harmful prompt injection during TTRL ampli- fies the model’s existing behaviors, i.e., safety amplification when the base model is relatively safe, and harmfulness amplification when it is vulnerable to the injected data. In both cases, there is a decline in reasoning ability, which we refer to as the reasoning tax. We also show that TTRL can be exploited adversarially using specially designed “HarmInject” prompts to force the model to answer jailbreak and reasoning queries together, resulting in stronger harmfulness amplification. Overall, our results highlight that TTT methods that enhance LLM reasoning by promoting self-consistency can lead to amplification behaviors and reasoning degradation, high- lighting the need for safer TTT methods.


