Towards Building a Trustworthy RAG-Based Chatbot for the Italian Public Administration

Building a Trustworthy Retrieval-Augmented Generation (RAG) chatbot for Italy’s public sector presents challenges that go beyond selecting an appropriate Large Language Model. A major issue is the retrieval phase, where Italian text embedders often underperform compared to English and multilingual counterparts, hindering precise identification and contextualization of critical information. Regulatory constraints further complicate matters by disallowing closed source or cloud based models, forcing reliance on on-premise or fully open source solutions that may not fully address the linguistic complexities of Italian documents. In our study, we evaluate three embedding approaches using a publicly available Italian dataset: a monolingual Italian approach, a translation based method leveraging English only embedders with backward reference mapping, and a multilingual framework applied to both original and translated texts. Our methodology involves chunking documents into coherent segments, embedding them in a high dimensional semantic space, and measuring retrieval accuracy via top-k similarity searches. Our results indicate that the translation based approach significantly improves retrieval performance over Italian specific models, suggesting that bilingual mapping can effectively address both domain specific challenges and regulatory constraints in developing RAG pipelines for public administration.

Towards Building a Trustworthy RAG-Based Chatbot for the Italian Public Administration

Mala, Chandana Sree;di Maio, Christian;Proietti, Mattia;Gezici, Gizem;Giannotti, Fosca;Melacci, Stefano;Lenci, Alessandro;Gori, Marco

2025

Abstract

Building a Trustworthy Retrieval-Augmented Generation (RAG) chatbot for Italy’s public sector presents challenges that go beyond selecting an appropriate Large Language Model. A major issue is the retrieval phase, where Italian text embedders often underperform compared to English and multilingual counterparts, hindering precise identification and contextualization of critical information. Regulatory constraints further complicate matters by disallowing closed source or cloud based models, forcing reliance on on-premise or fully open source solutions that may not fully address the linguistic complexities of Italian documents. In our study, we evaluate three embedding approaches using a publicly available Italian dataset: a monolingual Italian approach, a translation based method leveraging English only embedders with backward reference mapping, and a multilingual framework applied to both original and translated texts. Our methodology involves chunking documents into coherent segments, embedding them in a high dimensional semantic space, and measuring retrieval accuracy via top-k similarity searches. Our results indicate that the translation based approach significantly improves retrieval performance over Italian specific models, suggesting that bilingual mapping can effectively address both domain specific challenges and regulatory constraints in developing RAG pipelines for public administration.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2025
			
	Settore Scientifico Disciplinare (validi dal 09/05/2024)
	
				Settore INFO-01/A - Informatica
			
	Titolo del Volume
	
				Proceedings of the 4th International Conference on Hybrid Human-Artificial Intelligence
			
	ISBN
	
				978-1-64368-611-0
			
	DOI
	
				https://dx.doi.org/10.3233/FAIA250637
			
	Dataset relativi alla pubblicazione
	
	DOI
	
									https://dx.doi.org/10.3233/FAIA250637

File in questo prodotto:

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11384/163504

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

ND

social impact