The discipline of Authorship Analysis studies the linguistic style of written documents to determine information about their authorship. Unlike traditional methodologies, it leverages statistical methods and focuses on quantifiable linguistic events rather than the literary content of the text. In recent years, this field has experienced significant growth due to advances in information technology, enabling the employment of Machine Learning and Natural Language Processing computational tools, and it has been applied in various domains, spanning from cybersecurity to forensics. This Ph.D. Thesis investigates the application of Computational Authorship Analysis methodologies in the cultural heritage domain. Building on the experience gathered gathered through the research of a case-study (the debated Dantean authorship of the historic document textit{Epistle to Cangrande}), we address what we believe are the four main issues in this domain application: i) the identification of features that allow for accurate classification while being topic-agnostic; ii) the limited size of the datasets usually available in these studies; iii) the challenges that can be encountered when facing the possibility that the document under scrutiny is a forgery; and iv) the necessity of providing scholars in cultural heritage with proper explanations regarding the computational system's findings. Each of these issues is covered by a dedicated chapter in this dissertation, in which we offer a deep examination of the problem background, describe our proposed solutions, and present the related results of our research on the matter. In particular, we: i) introduce the use of rhythmic features; ii) evaluate the employment of an alternative vectorial representation of the documents, based on the concept of document pairs; iii) propose the augmentation of the classifier training data with automatically generated samples that mimic the work of a forger; and iv) assess the suitability of some modern explainability methods for the cultural heritage public. With this work, we aim to offer a comprehensive overview of the Authorship Analysis field, and provide guidance on the best practices for its application in cultural heritage.

Computational Authorship Analysis: Applications and Issues in the Cultural Heritage Field / Corbara, Silvia; relatore esterno: Monreale, Anna; Scuola Normale Superiore, ciclo 35, 25-Nov-2024.

Computational Authorship Analysis: Applications and Issues in the Cultural Heritage Field

CORBARA, Silvia
2024

Abstract

The discipline of Authorship Analysis studies the linguistic style of written documents to determine information about their authorship. Unlike traditional methodologies, it leverages statistical methods and focuses on quantifiable linguistic events rather than the literary content of the text. In recent years, this field has experienced significant growth due to advances in information technology, enabling the employment of Machine Learning and Natural Language Processing computational tools, and it has been applied in various domains, spanning from cybersecurity to forensics. This Ph.D. Thesis investigates the application of Computational Authorship Analysis methodologies in the cultural heritage domain. Building on the experience gathered gathered through the research of a case-study (the debated Dantean authorship of the historic document textit{Epistle to Cangrande}), we address what we believe are the four main issues in this domain application: i) the identification of features that allow for accurate classification while being topic-agnostic; ii) the limited size of the datasets usually available in these studies; iii) the challenges that can be encountered when facing the possibility that the document under scrutiny is a forgery; and iv) the necessity of providing scholars in cultural heritage with proper explanations regarding the computational system's findings. Each of these issues is covered by a dedicated chapter in this dissertation, in which we offer a deep examination of the problem background, describe our proposed solutions, and present the related results of our research on the matter. In particular, we: i) introduce the use of rhythmic features; ii) evaluate the employment of an alternative vectorial representation of the documents, based on the concept of document pairs; iii) propose the augmentation of the classifier training data with automatically generated samples that mimic the work of a forger; and iv) assess the suitability of some modern explainability methods for the cultural heritage public. With this work, we aim to offer a comprehensive overview of the Authorship Analysis field, and provide guidance on the best practices for its application in cultural heritage.
25-nov-2024
Settore INF/01 - Informatica
Matematica e Informatica
35
machine learning; artificial intelligence; authorship analysis; authorship identification; text classification; natural language processing; author profiling
Monreale, Anna
Sebastiani, Fabrizio
Moreo, Alejandro
Scuola Normale Superiore
File in questo prodotto:
File Dimensione Formato  
Tesi.pdf

accesso aperto

Descrizione: Tesi PhD
Tipologia: Published version
Licenza: Creative Commons
Dimensione 4.33 MB
Formato Adobe PDF
4.33 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11384/157723
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact