arXiv preprint arXiv:2505.23556 , year=

Understanding refusal in language models with sparse autoencoders , author= · arXiv 2505.23556

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness

cs.LG · 2026-06-14 · unverdicted · novelty 8.0

Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.

OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets

cs.CL · 2026-07-02 · unverdicted · novelty 7.0

OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.

Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement

cs.CL · 2026-05-08 · unverdicted · novelty 5.0

LANCE applies variational inference for label enhancement across multiple rejection categories, supplying gradients to a refinement model that produces safe, non-rigid responses from LLMs.

citing papers explorer

Showing 3 of 3 citing papers.

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness cs.LG · 2026-06-14 · unverdicted · none · ref 45
Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.
OpenSafeIntent: Evaluating Intent-Calibrated Safe Completion Across Dual-Use Prompt Sets cs.CL · 2026-07-02 · unverdicted · none · ref 3
OpenSafeIntent benchmark shows models fail to calibrate safety across intent shifts in matched dual-use prompts, indicating current evaluations are insufficient.
Beyond "I cannot fulfill this request": Alleviating Rigid Rejection in LLMs via Label Enhancement cs.CL · 2026-05-08 · unverdicted · none · ref 11
LANCE applies variational inference for label enhancement across multiple rejection categories, supplying gradients to a refinement model that produces safe, non-rigid responses from LLMs.

arXiv preprint arXiv:2505.23556 , year=

fields

years

verdicts

representative citing papers

citing papers explorer