Ensuring reliability in human-AI collaboration is crucial for fostering appropriate trust in hybrid decision-making systems, which depends not only on predictive performance but also on transparency and awareness of model limitations. Selective classification addresses this need by allowing models to reject uncertain instances and provide predictions only on confident cases. However, existing approaches typically provide little insight into the rationale behind the abstention decisions. In this work, we introduce a novel selective classification method that leverages the distance between an instance and its counterfactuals as a proxy for prediction uncertainty. This formulation naturally enables human-interpretable explanations of the rejection policy, clarifying whether a black-box predictor is sufficiently stable to issue a decision or should refrain from doing so. The resulting abstention policy is locally interpretable, post-hoc and model-agnostic with respect to the black-box predictor, and can be flexibly combined with different counterfactual generation methods and distance functions. Extensive experiments on diverse tabular datasets demonstrate that our selective classifier matches or exceeds the performance of state-of-the-art baselines while inherently providing local contrastive explanations for abstention decisions as a byproduct of its local counterfactual analysis.

‘‘I know that I don’t know... and I explain why’’ Robust abstention via counterfactual explanations

Valerio Bonsignori
;
Clara Punzi;Roberto Pellungrini;Fosca Giannotti
2026

Abstract

Ensuring reliability in human-AI collaboration is crucial for fostering appropriate trust in hybrid decision-making systems, which depends not only on predictive performance but also on transparency and awareness of model limitations. Selective classification addresses this need by allowing models to reject uncertain instances and provide predictions only on confident cases. However, existing approaches typically provide little insight into the rationale behind the abstention decisions. In this work, we introduce a novel selective classification method that leverages the distance between an instance and its counterfactuals as a proxy for prediction uncertainty. This formulation naturally enables human-interpretable explanations of the rejection policy, clarifying whether a black-box predictor is sufficiently stable to issue a decision or should refrain from doing so. The resulting abstention policy is locally interpretable, post-hoc and model-agnostic with respect to the black-box predictor, and can be flexibly combined with different counterfactual generation methods and distance functions. Extensive experiments on diverse tabular datasets demonstrate that our selective classifier matches or exceeds the performance of state-of-the-art baselines while inherently providing local contrastive explanations for abstention decisions as a byproduct of its local counterfactual analysis.
2026
Settore INFO-01/A - Informatica
Settore IINF-05/A - Sistemi di elaborazione delle informazioni
Artificial intelligence; Collaborative intelligence; Machine ethics
   It takes two to tango: a synergistic approach to human-machine decision making
   TANGO
   European Commission
   Grant Agreement n. 101120763
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11384/168187
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact