Explainable machine learning identifies a polygenic risk score as a key predictor of pancreatic cancer risk in the UK Biobank

Peduzzi, G.; Felici, A.; Pellungrini, R.; Campa, D.

doi:10.1016/j.dld.2024.11.010

Background: Predicting the risk of developing pancreatic ductal adenocarcinoma (PDAC) is of paramount importance, given its high mortality rate. Current PDAC risk prediction models rely on a limited number of variables, do not include genetics, and have a modest accuracy. Aim: This study aimed to develop an interpretable PDAC risk prediction model, based on machine learning (ML). Methods: Five ML models (Adaptive Boosting, eXtreme Gradient Boosting, CatBoost, Deep Forest and Random Forest) built on 56 exposome variables and a polygenic risk score (PRS) were tested in 654 PDAC cases and 1,308 controls of the UK Biobank. Additionally, SHapley Additive exPlanation (SHAP) and Global model Interpretation via the Recursive Partitioning (Girp) were employed to explain the models. Results: All models provided similar performance, but based on recall the best was CatBoost (77.10 %). SHAP highlighted age and the PRS as primary contributors across all models. Girp developed rules to discern cases from controls, identifying age, PRS, and pancreatitis in most of the rules. Conclusion: The predictive models tested have exhibited good performance, indicating their potential application in the clinical field in the near future, with the PRS playing a key role in identifying high-risk individuals as demonstrated by the explainers.

Explainable machine learning identifies a polygenic risk score as a key predictor of pancreatic cancer risk in the UK Biobank

Peduzzi G.;Felici A.;Pellungrini R.;Campa D.

2025

Abstract

Background: Predicting the risk of developing pancreatic ductal adenocarcinoma (PDAC) is of paramount importance, given its high mortality rate. Current PDAC risk prediction models rely on a limited number of variables, do not include genetics, and have a modest accuracy. Aim: This study aimed to develop an interpretable PDAC risk prediction model, based on machine learning (ML). Methods: Five ML models (Adaptive Boosting, eXtreme Gradient Boosting, CatBoost, Deep Forest and Random Forest) built on 56 exposome variables and a polygenic risk score (PRS) were tested in 654 PDAC cases and 1,308 controls of the UK Biobank. Additionally, SHapley Additive exPlanation (SHAP) and Global model Interpretation via the Recursive Partitioning (Girp) were employed to explain the models. Results: All models provided similar performance, but based on recall the best was CatBoost (77.10 %). SHAP highlighted age and the PRS as primary contributors across all models. Girp developed rules to discern cases from controls, identifying age, PRS, and pancreatitis in most of the rules. Conclusion: The predictive models tested have exhibited good performance, indicating their potential application in the clinical field in the near future, with the PRS playing a key role in identifying high-risk individuals as demonstrated by the explainers.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Settore Scientifico Disciplinare (validi dal 09/05/2024)
	
				Settore MEDS-10/A - Gastroenterologia
			
	Titolo Rivista
	
				DIGESTIVE AND LIVER DISEASE
			
	DOI
	
				https://dx.doi.org/10.1016/j.dld.2024.11.010
			
	Parole chiave
	
				Explainable artificial intelligence; Pancreatic cancer; Polygenic Risk Score; Risk prediction
			
	Appare nelle tipologie:
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
PIIS1590865824011009.pdf Accesso chiuso Tipologia: Published version Licenza: Tutti i diritti riservati Dimensione 1.39 MB Formato Adobe PDF Richiedi una copia	1.39 MB	Adobe PDF	Richiedi una copia