#Turki$hTweets: A Benchmark Dataset for Turkish Text Correction

Koksal, Asiye Tuba; Bozal, Ozge; Yürekli, Emre; Gezici, Gizem

doi:10.18653/v1/2020.findings-emnlp.374

Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4190–4198 November 16 - 20, 2020. c©2020 Association for Computational Linguistics 4190#Turki$hTweets: A Benchmark Dataset for Turkish Text Correction Asiye Tuba Koksal1, Ozge Bozal1,2, Emre Yurekli1, Gizem Gezici1,3 1Huawei R&D Center, Istanbul, Turkey 2Bogazici University, Istanbul, Turkey 3Sabanci University, Istanbul, Turkey {asiye.tuba.koksal, ozge.bozal, emre.yurekli, gizem.gezici}@huawei.com Abstract #Turki$hTweets is a benchmark dataset for the task of correcting the user misspellings, with the purpose of introducing the first public Turkish dataset in this area. #Turki$hTweets provides correct/incorrect word annotations with a detailed misspelling category formula- tion based on the real user data. We evaluated four state-of-the-art approaches on our dataset to present a preliminary analysis for the sake of reproducibility. The annotated dataset is publicly available at https://github.com/ atubakoksal/annotated_tweets.

#Turki$hTweets: A Benchmark Dataset for Turkish Text Correction

Koksal, Asiye Tuba;Bozal, Ozge;Yürekli, Emre;Gezici, Gizem

2020

Abstract

Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4190–4198 November 16 - 20, 2020. c©2020 Association for Computational Linguistics 4190#Turki$hTweets: A Benchmark Dataset for Turkish Text Correction Asiye Tuba Koksal1, Ozge Bozal1,2, Emre Yurekli1, Gizem Gezici1,3 1Huawei R&D Center, Istanbul, Turkey 2Bogazici University, Istanbul, Turkey 3Sabanci University, Istanbul, Turkey {asiye.tuba.koksal, ozge.bozal, emre.yurekli, gizem.gezici}@huawei.com Abstract #Turki$hTweets is a benchmark dataset for the task of correcting the user misspellings, with the purpose of introducing the first public Turkish dataset in this area. #Turki$hTweets provides correct/incorrect word annotations with a detailed misspelling category formula- tion based on the real user data. We evaluated four state-of-the-art approaches on our dataset to present a preliminary analysis for the sake of reproducibility. The annotated dataset is publicly available at https://github.com/ atubakoksal/annotated_tweets.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione
	
				2020
			
	Settore Scientifico Disciplinare (validi fino a 24/06/2024)
	
				Settore ING-INF/05 - Sistemi di Elaborazione delle Informazioni
			
	Titolo del Convegno
	
				Findings of the Association for Computational Linguistics, ACL 2020: EMNLP 2020
			
	Periodo del Convegno
	
				2020
			
	Titolo del Volume
	
				Findings of the Association for Computational Linguistics: EMNLP 2020
			
	Editore
	
				Association for Computational Linguistics (ACL)
			
	ISBN
	
				9781952148903
			
	DOI
	
				https://dx.doi.org/10.18653/v1/2020.findings-emnlp.374
			
	Appare nelle tipologie:
	
				4.1 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2020.findings-emnlp.374.pdf accesso aperto Tipologia: Published version Licenza: Creative Commons Dimensione 205.59 kB Formato Adobe PDF	205.59 kB	Adobe PDF