Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4190–4198 November 16 - 20, 2020. c©2020 Association for Computational Linguistics 4190#Turki$hTweets: A Benchmark Dataset for Turkish Text Correction Asiye Tuba Koksal1, Ozge Bozal1,2, Emre Yurekli1, Gizem Gezici1,3 1Huawei R&D Center, Istanbul, Turkey 2Bogazici University, Istanbul, Turkey 3Sabanci University, Istanbul, Turkey {asiye.tuba.koksal, ozge.bozal, emre.yurekli, gizem.gezici}@huawei.com Abstract #Turki$hTweets is a benchmark dataset for the task of correcting the user misspellings, with the purpose of introducing the first public Turkish dataset in this area. #Turki$hTweets provides correct/incorrect word annotations with a detailed misspelling category formula- tion based on the real user data. We evaluated four state-of-the-art approaches on our dataset to present a preliminary analysis for the sake of reproducibility. The annotated dataset is publicly available at https://github.com/ atubakoksal/annotated_tweets.

#Turki$hTweets: A Benchmark Dataset for Turkish Text Correction

Gezici, Gizem
2020

Abstract

Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4190–4198 November 16 - 20, 2020. c©2020 Association for Computational Linguistics 4190#Turki$hTweets: A Benchmark Dataset for Turkish Text Correction Asiye Tuba Koksal1, Ozge Bozal1,2, Emre Yurekli1, Gizem Gezici1,3 1Huawei R&D Center, Istanbul, Turkey 2Bogazici University, Istanbul, Turkey 3Sabanci University, Istanbul, Turkey {asiye.tuba.koksal, ozge.bozal, emre.yurekli, gizem.gezici}@huawei.com Abstract #Turki$hTweets is a benchmark dataset for the task of correcting the user misspellings, with the purpose of introducing the first public Turkish dataset in this area. #Turki$hTweets provides correct/incorrect word annotations with a detailed misspelling category formula- tion based on the real user data. We evaluated four state-of-the-art approaches on our dataset to present a preliminary analysis for the sake of reproducibility. The annotated dataset is publicly available at https://github.com/ atubakoksal/annotated_tweets.
2020
Settore ING-INF/05 - Sistemi di Elaborazione delle Informazioni
Findings of the Association for Computational Linguistics: EMNLP 2020
File in questo prodotto:
File Dimensione Formato  
2020.findings-emnlp.374.pdf

accesso aperto

Licenza: Creative Commons
Dimensione 205.59 kB
Formato Adobe PDF
205.59 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11384/139205
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 3
  • ???jsp.display-item.citation.isi??? ND
social impact