Building a Trustworthy Retrieval-Augmented Generation (RAG) chatbot for Italy’s public sector presents challenges that go beyond selecting an appropriate Large Language Model. A major issue is the retrieval phase, where Italian text embedders often underperform compared to English and multilingual counterparts, hindering precise identification and contextualization of critical information. Regulatory constraints further complicate matters by disallowing closed source or cloud based models, forcing reliance on on-premise or fully open source solutions that may not fully address the linguistic complexities of Italian documents. In our study, we evaluate three embedding approaches using a publicly available Italian dataset: a monolingual Italian approach, a translation based method leveraging English only embedders with backward reference mapping, and a multilingual framework applied to both original and translated texts. Our methodology involves chunking documents into coherent segments, embedding them in a high dimensional semantic space, and measuring retrieval accuracy via top-k similarity searches. Our results indicate that the translation based approach significantly improves retrieval performance over Italian specific models, suggesting that bilingual mapping can effectively address both domain specific challenges and regulatory constraints in developing RAG pipelines for public administration.

Towards Building a Trustworthy RAG-Based Chatbot for the Italian Public Administration

Mala, Chandana Sree
;
Gezici, Gizem
;
Giannotti, Fosca;Lenci, Alessandro;Gori, Marco
2025

Abstract

Building a Trustworthy Retrieval-Augmented Generation (RAG) chatbot for Italy’s public sector presents challenges that go beyond selecting an appropriate Large Language Model. A major issue is the retrieval phase, where Italian text embedders often underperform compared to English and multilingual counterparts, hindering precise identification and contextualization of critical information. Regulatory constraints further complicate matters by disallowing closed source or cloud based models, forcing reliance on on-premise or fully open source solutions that may not fully address the linguistic complexities of Italian documents. In our study, we evaluate three embedding approaches using a publicly available Italian dataset: a monolingual Italian approach, a translation based method leveraging English only embedders with backward reference mapping, and a multilingual framework applied to both original and translated texts. Our methodology involves chunking documents into coherent segments, embedding them in a high dimensional semantic space, and measuring retrieval accuracy via top-k similarity searches. Our results indicate that the translation based approach significantly improves retrieval performance over Italian specific models, suggesting that bilingual mapping can effectively address both domain specific challenges and regulatory constraints in developing RAG pipelines for public administration.
2025
Settore INFO-01/A - Informatica
Proceedings of the 4th International Conference on Hybrid Human-Artificial Intelligence
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11384/163504
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact