Teaching AI safety guardrails to adapt on the job

Minbeom Kim, Lesly Miculicich, Bhavana Dalvi Mishra, Mihir Parmar, Phillip Wallis, Bharath Chandrasekhar, Kyomin Jung, Tomas Pfister, Long T. Le

As AI systems access private data and execute real workflows, safety guardrails must handle context-dependent failures—like privacy norms or organizational policies—that can't be anticipated at training time. LiSA addresses this by inducing reusable safety policies from occasional deployment failures without full retraining. The approach converts sparse user reports into generalizable rules, prevents overgeneralization through conflict detection, and uses evidence-based confidence gates so memory reuse scales with accumulated data rather than accuracy alone. Tested on privacy, configuration, and multi-step agent tasks, LiSA outperforms memory baselines under sparse feedback and stays robust at 20% label-flip rates.