Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron

Published in arXiv, 2026

Summary:

This work studies lightweight safety alignment for large language models. Instead of relying on expensive post-training pipelines, it introduces a safety-aware decoding strategy with a single-neuron gate that combines the model’s own generation ability with external expert guidance, improving safety while preserving utility and keeping the training cost low.

Code: here

Bibtex:

@misc{shen2026lightalignment,
  title={Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron},
  author={Sicheng Shen and Mingyang Lv and Han Shen and Jialin Wu and Binghao Wang and Zhou Yang and Guobin Shen and Dongcheng Zhao and Feifei Zhao and Yi Zeng},
  year={2026},
  eprint={2602.02027},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2602.02027}
}

Recommended citation: Shen, S., Lv, M., Shen, H., Wu, J., Wang, B., Yang, Z., Shen, G., Zhao, D., Zhao, F., & Zeng, Y. (2026). Light Alignment Improves LLM Safety via Model Self-Reflection with a Single Neuron.
Download Paper