Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron
Published in arXiv, 2026
Summary:
This work studies lightweight safety alignment for large language models. Instead of relying on expensive post-training pipelines, it introduces a safety-aware decoding strategy with a single-neuron gate that combines the model’s own generation ability with external expert guidance, improving safety while preserving utility and keeping the training cost low.
Code: here
Bibtex:
@misc{shen2026lightalignment,
title={Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron},
author={Sicheng Shen and Mingyang Lv and Han Shen and Jialin Wu and Binghao Wang and Zhou Yang and Guobin Shen and Dongcheng Zhao and Feifei Zhao and Yi Zeng},
year={2026},
eprint={2602.02027},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.02027}
}
Recommended citation: Shen, S., Lv, M., Shen, H., Wu, J., Wang, B., Yang, Z., Shen, G., Zhao, D., Zhao, F., & Zeng, Y. (2026). Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron.
Download Paper
