Ensuring reliability in human-AI collaboration is crucial for fostering appropriate trust in hybrid decision-making systems, which depends not only on predictive performance but also on transparency and awareness of model limitations. Selective classification addresses this need by allowing models to reject uncertain instances and provide predictions only on confident cases. However, existing approaches typically provide little insight into the rationale behind the abstention decisions. In this work, we introduce a novel selective classification method that leverages the distance between an instance and its counterfactuals as a proxy for prediction uncertainty. This formulation naturally enables human-interpretable explanations of the rejection policy, clarifying whether a black-box predictor is sufficiently stable to issue a decision or should refrain from doing so. The resulting abstention policy is locally interpretable, post-hoc and model-agnostic with respect to the black-box predictor, and can be flexibly combined with different counterfactual generation methods and distance functions. Extensive experiments on diverse tabular datasets demonstrate that our selective classifier matches or exceeds the performance of state-of-the-art baselines while inherently providing local contrastive explanations for abstention decisions as a byproduct of its local counterfactual analysis.
‘‘I know that I don’t know... and I explain why’’ Robust abstention via counterfactual explanations
Valerio Bonsignori
;Clara Punzi;Roberto Pellungrini;Fosca Giannotti
2026
Abstract
Ensuring reliability in human-AI collaboration is crucial for fostering appropriate trust in hybrid decision-making systems, which depends not only on predictive performance but also on transparency and awareness of model limitations. Selective classification addresses this need by allowing models to reject uncertain instances and provide predictions only on confident cases. However, existing approaches typically provide little insight into the rationale behind the abstention decisions. In this work, we introduce a novel selective classification method that leverages the distance between an instance and its counterfactuals as a proxy for prediction uncertainty. This formulation naturally enables human-interpretable explanations of the rejection policy, clarifying whether a black-box predictor is sufficiently stable to issue a decision or should refrain from doing so. The resulting abstention policy is locally interpretable, post-hoc and model-agnostic with respect to the black-box predictor, and can be flexibly combined with different counterfactual generation methods and distance functions. Extensive experiments on diverse tabular datasets demonstrate that our selective classifier matches or exceeds the performance of state-of-the-art baselines while inherently providing local contrastive explanations for abstention decisions as a byproduct of its local counterfactual analysis.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.



