‘‘I know that I don’t know... and I explain why’’ Robust abstention via counterfactual explanations

Ensuring reliability in human-AI collaboration is crucial for fostering appropriate trust in hybrid decision-making systems, which depends not only on predictive performance but also on transparency and awareness of model limitations. Selective classification addresses this need by allowing models to reject uncertain instances and provide predictions only on confident cases. However, existing approaches typically provide little insight into the rationale behind the abstention decisions. In this work, we introduce a novel selective classification method that leverages the distance between an instance and its counterfactuals as a proxy for prediction uncertainty. This formulation naturally enables human-interpretable explanations of the rejection policy, clarifying whether a black-box predictor is sufficiently stable to issue a decision or should refrain from doing so. The resulting abstention policy is locally interpretable, post-hoc and model-agnostic with respect to the black-box predictor, and can be flexibly combined with different counterfactual generation methods and distance functions. Extensive experiments on diverse tabular datasets demonstrate that our selective classifier matches or exceeds the performance of state-of-the-art baselines while inherently providing local contrastive explanations for abstention decisions as a byproduct of its local counterfactual analysis.

‘‘I know that I don’t know... and I explain why’’ Robust abstention via counterfactual explanations

Valerio Bonsignori;Clara Punzi;Roberto Pellungrini;Fosca Giannotti

2026

Abstract

Ensuring reliability in human-AI collaboration is crucial for fostering appropriate trust in hybrid decision-making systems, which depends not only on predictive performance but also on transparency and awareness of model limitations. Selective classification addresses this need by allowing models to reject uncertain instances and provide predictions only on confident cases. However, existing approaches typically provide little insight into the rationale behind the abstention decisions. In this work, we introduce a novel selective classification method that leverages the distance between an instance and its counterfactuals as a proxy for prediction uncertainty. This formulation naturally enables human-interpretable explanations of the rejection policy, clarifying whether a black-box predictor is sufficiently stable to issue a decision or should refrain from doing so. The resulting abstention policy is locally interpretable, post-hoc and model-agnostic with respect to the black-box predictor, and can be flexibly combined with different counterfactual generation methods and distance functions. Extensive experiments on diverse tabular datasets demonstrate that our selective classifier matches or exceeds the performance of state-of-the-art baselines while inherently providing local contrastive explanations for abstention decisions as a byproduct of its local counterfactual analysis.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2026
			
	Settore Scientifico Disciplinare (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
Settore IINF-05/A - Sistemi di elaborazione delle informazioni
			
	Titolo Rivista
	
				IEEE ACCESS
			
	DOI
	
				https://dx.doi.org/10.1109/ACCESS.2026.3705102
			
	Parole chiave
	
				Artificial intelligence; Collaborative intelligence; Machine ethics
			
	Progetti che finanziano la ricerca
	
	Titolo Progetto
	
									It takes two to tango: a synergistic approach to human-machine decision making
								
	Acronimo
	
									TANGO
								
	Nome finanziatore
	
										European Commission
									
	N. Contratto
	
									Grant Agreement n. 101120763

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11384/168187

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

ND

social impact