Pāriet uz galveno navigāciju Pāriet uz meklēšanu Pāriet uz galveno saturu

Subword segmentation for machine translation based on grouping words by potential roots

  • University of Latvia

Zinātniskās darbības rezultāts: Devums žurnālamZinātniskais raksts (žurnālā)koleģiāli recenzēts

2 Atsauces (Scopus)

Kopsavilkums

This paper proposes a new subword segmentation method for machine translation. The algorithm, which we call GenSeg, is generic in the sense that it can be applied to any language, but is designed with an emphasis on inflectional splitting, i.e. it attempts to split words on boundaries corresponding to inflectional suffixes. The main principle of the method is grouping together words that share a common middle substring, and then separating the best such substring from the rest of the word. GenSeg is a cross-language method extended with some language-specific morphological analysis rules (currently for the Latvian language). To verify its effectiveness, we performed machine translation experiments in two directions: Latvian-English and English-Latvian, obtaining minor improvements in translation quality when using our pre-processing method.

OriģinālvalodaAngļu
Lapas (no-līdz)500-509
Lapu skaits10
ŽurnālsBaltic Journal of Modern Computing
Sējums7
Izdevuma numurs4
DOIs
Publikācijas statussPublicēts - 2019

Nospiedums

Uzziniet vairāk par pētniecības tēmām “Subword segmentation for machine translation based on grouping words by potential roots”. Kopā tie veido unikālu nospiedumu.

Citēt šo