HANSEN : Human and AI Spoken Text Benchmark for Authorship Analysis

Tripto, Nafis; Uchendu, Adaku; Thai, Le; Setzu, Mattia; Giannotti, Fosca; Lee, Dongwon

doi:10.18653/v1/2023.findings-emnlp.916

Authorship Analysis, also known as stylometry, has been an essential aspect of Natural Language Processing (NLP) for a long time. Likewise, the recent advancement of Large Language Models (LLMs) has made authorship analysis increasingly crucial for distinguishing between human-written and AI-generated texts. However, these authorship analysis tasks have primarily been focused on written texts, not considering spoken texts. Thus, we introduce the largest benchmark for spoken texts - HANSEN (Human ANd ai Spoken tExt beNchmark). HANSEN encompasses meticulous curation of existing speech datasets accompanied by transcripts, alongside the creation of novel AI-generated spoken text datasets. Together, it comprises 17 human datasets, and AI-generated spoken texts created using 3 prominent LLMs: ChatGPT, PaLM2, and Vicuna13B. To evaluate and demonstrate the utility of HANSEN, we perform Authorship Attribution (AA) & Author Verification (AV) on human-spoken datasets and conducted Human vs. AI spoken text detection using state-of-the-art (SOTA) models. While SOTA methods, such as, character n-gram or Transformer-based model, exhibit similar AA & AV performance in human-spoken datasets compared to written ones, there is much room for improvement in AI-generated spoken text detection. The HANSEN benchmark is available at: https://huggingface.co/datasets/HANSEN-REPO/HANSEN.

HANSEN : Human and AI Spoken Text Benchmark for Authorship Analysis

Tripto, Nafis;Uchendu, Adaku;Le, Thai;Setzu, Mattia;Giannotti, Fosca;Lee, Dongwon

2023

Abstract

Authorship Analysis, also known as stylometry, has been an essential aspect of Natural Language Processing (NLP) for a long time. Likewise, the recent advancement of Large Language Models (LLMs) has made authorship analysis increasingly crucial for distinguishing between human-written and AI-generated texts. However, these authorship analysis tasks have primarily been focused on written texts, not considering spoken texts. Thus, we introduce the largest benchmark for spoken texts - HANSEN (Human ANd ai Spoken tExt beNchmark). HANSEN encompasses meticulous curation of existing speech datasets accompanied by transcripts, alongside the creation of novel AI-generated spoken text datasets. Together, it comprises 17 human datasets, and AI-generated spoken texts created using 3 prominent LLMs: ChatGPT, PaLM2, and Vicuna13B. To evaluate and demonstrate the utility of HANSEN, we perform Authorship Attribution (AA) & Author Verification (AV) on human-spoken datasets and conducted Human vs. AI spoken text detection using state-of-the-art (SOTA) models. While SOTA methods, such as, character n-gram or Transformer-based model, exhibit similar AA & AV performance in human-spoken datasets compared to written ones, there is much room for improvement in AI-generated spoken text detection. The HANSEN benchmark is available at: https://huggingface.co/datasets/HANSEN-REPO/HANSEN.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2023
			
	Settore Scientifico Disciplinare (validi fino a 24/06/2024)
	
				Settore INF/01 - Informatica
			
	Titolo del Convegno
	
				Empirical Methods in Natural Language Processing
			
	Luogo del Convegno
	
				Singapore
			
	Periodo del Convegno
	
				6-10 dicembre 2023
			
	Titolo del Volume
	
				Findings of the Association for Computational Linguistics: EMNLP 2023
			
	Editore
	
				Association for Computational Linguistics
			
	ISBN
	
				9798891760615
			
	DOI
	
				https://dx.doi.org/10.18653/v1/2023.findings-emnlp.916
			
	Parole chiave
	
				Computer Science; Computation and Language; Computer Science; Computation and Language; Authorship analysis; Authorship attribution; Curation; Human dataset; Language model; Language processing; Natural languages; Stylometry; Text detection; Written texts; Computational linguistics; Natural language processing systems
			
	Progetti che finanziano la ricerca
	
	Titolo Progetto
	
									SoBigData++: European Integrated Infrastructure for Social Mining and Big Data Analytics
								
	Acronimo
	
									SoBigData-PlusPlus
								
	Nome finanziatore
	
										European Commission
									
	Finanziamento
	
									Horizon 2020 Framework Programme
								
	N. Contratto
	
									871042
								
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
HANSEN- Human and AI Spoken Text Benchmark for Authorship Analysis.pdf accesso aperto Tipologia: Published version Licenza: Creative Commons Dimensione 654.89 kB Formato Adobe PDF	654.89 kB	Adobe PDF