Pāriet uz galveno navigāciju Pāriet uz meklēšanu Pāriet uz galveno saturu

Help from the Neighbors: Estonian Dialect Normalization Using a Finnish Dialect Generator

  • Tuuli Tuisk
  • , Hämäläinen Mika
  • , Alnajjar Khalid
  • Ne LU

Zinātniskās darbības rezultāts: Nodaļa grāmatā/enciklopēdijā/konferences krājumāKonferences zinātniskais rakstsPētniecībakoleģiāli recenzēts

3 Atsauces (Scopus)

Kopsavilkums

While standard Estonian is not a low-resourced language, the different dialects of the language are under-resourced from the point of view of NLP, given that there are no vast hand normalized resources available for training a machine learning model to normalize dialectal Estonian to standard Estonian. In this paper, we crawl a small corpus of parallel dialectal Estonian - standard Estonian sentences. In addition, we take a savvy approach of generating more synthetic training data for the normalization task by using an existing dialect generator model built for Finnish to "dialectalize" standard Estonian sentences from the Universal Dependencies tree banks. Our BERT based normalization model achieves a word error rate that is 26.49 points lower when using both the synthetic data and Estonian data in comparison to training the model with only the available Estonian data. Our results suggest that synthetic data generated by a model trained on a more resourced related language can indeed boost the results for a less resourced language.

OriģinālvalodaAngļu
Rīkotāja publikācijas nosaukumsDeepLo 2022 - 3rd Workshop on Deep Learning Approaches for Low-Resource NLP, Proceedings of the DeepLo Workshop
RedaktoriColin Cherry, Angela Fan, George Foster, Gholamreza Haffari, Shahram Khadivi, Nanyun Peng, Xiang Ren, Ehsan Shareghi, Swabha Swayamdipta
Publikācijas vietaSeattle
IzdevējsAssociation for Computational Linguistics
Lapas61-66
Lapu skaits6
ISBN (Elektroniski)9781955917971
DOIs
Publikācijas statussPublicēts - 2022

Publikāciju sērijas

NosaukumsDeepLo 2022 - 3rd Workshop on Deep Learning Approaches for Low-Resource NLP, Proceedings of the DeepLo Workshop

OECD Zinātnes nozare

  • 6.2 Valodniecība un literatūrzinātne

Nospiedums

Uzziniet vairāk par pētniecības tēmām “Help from the Neighbors: Estonian Dialect Normalization Using a Finnish Dialect Generator”. Kopā tie veido unikālu nospiedumu.

Citēt šo