Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4190–4198 November 16 - 20, 2020. c©2020 Association for Computational Linguistics 4190#Turki$hTweets: A Benchmark Dataset for Turkish Text Correction Asiye Tuba Koksal1, Ozge Bozal1,2, Emre Yurekli1, Gizem Gezici1,3 1Huawei R&D Center, Istanbul, Turkey 2Bogazici University, Istanbul, Turkey 3Sabanci University, Istanbul, Turkey {asiye.tuba.koksal, ozge.bozal, emre.yurekli, gizem.gezici}@huawei.com Abstract #Turki$hTweets is a benchmark dataset for the task of correcting the user misspellings, with the purpose of introducing the first public Turkish dataset in this area. #Turki$hTweets provides correct/incorrect word annotations with a detailed misspelling category formula- tion based on the real user data. We evaluated four state-of-the-art approaches on our dataset to present a preliminary analysis for the sake of reproducibility. The annotated dataset is publicly available at https://github.com/ atubakoksal/annotated_tweets.
#Turki$hTweets: A Benchmark Dataset for Turkish Text Correction
Gezici, Gizem
2020
Abstract
Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4190–4198 November 16 - 20, 2020. c©2020 Association for Computational Linguistics 4190#Turki$hTweets: A Benchmark Dataset for Turkish Text Correction Asiye Tuba Koksal1, Ozge Bozal1,2, Emre Yurekli1, Gizem Gezici1,3 1Huawei R&D Center, Istanbul, Turkey 2Bogazici University, Istanbul, Turkey 3Sabanci University, Istanbul, Turkey {asiye.tuba.koksal, ozge.bozal, emre.yurekli, gizem.gezici}@huawei.com Abstract #Turki$hTweets is a benchmark dataset for the task of correcting the user misspellings, with the purpose of introducing the first public Turkish dataset in this area. #Turki$hTweets provides correct/incorrect word annotations with a detailed misspelling category formula- tion based on the real user data. We evaluated four state-of-the-art approaches on our dataset to present a preliminary analysis for the sake of reproducibility. The annotated dataset is publicly available at https://github.com/ atubakoksal/annotated_tweets.File | Dimensione | Formato | |
---|---|---|---|
2020.findings-emnlp.374.pdf
accesso aperto
Licenza:
Creative Commons
Dimensione
205.59 kB
Formato
Adobe PDF
|
205.59 kB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.