Pāriet uz galveno navigāciju Pāriet uz meklēšanu Pāriet uz galveno saturu

Semi-automatic quasi-morphological word segmentation for neural machine translation

  • Jānis Zuters*
  • , Gus Strazds
  • , Kārlis Immers
  • *Šī darba korespondējošais autors
  • University of Latvia

Zinātniskās darbības rezultāts: Nodaļa grāmatā/enciklopēdijā/konferences krājumāKonferences zinātniskais rakstsPētniecībakoleģiāli recenzēts

8 Atsauces (Scopus)

Kopsavilkums

This paper proposes the Prefix-Root-Postfix-Encoding (PRPE) algorithm, which performs close-to-morphological segmentation of words as part of text pre-processing in machine translation. PRPE is a cross-language algorithm requiring only minor tweaking to adapt it for any particular language, a property which makes it potentially useful for morphologically rich languages with no morphological analysers available. As a key part of the proposed algorithm we introduce the ‘Root alignment’ principle to extract potential sub-words from a corpus, as well as a special technique for constructing words from potential sub-words. We conducted experiments with two different neural machine translation systems, training them on parallel corpora for English-Latvian and Latvian-English translation. Evaluation of translation quality showed improvements in BLEU scores when the data were pre-processed using the proposed algorithm, compared to a couple of baseline word segmentation algorithms. Although we were able to demonstrate improvements in both translation directions and for both NMT systems, they were relatively minor, and our experiments show that machine translation with inflected languages remains challenging, especially with translation direction towards a highly inflected language.

OriģinālvalodaAngļu
Rīkotāja publikācijas nosaukumsDatabases and Information Systems - 13th International Baltic Conference, DB and IS 2018, Proceedings
RedaktoriOlegas Vasilecas, Gintautas Dzemyda, Audrone Lupeikiene
IzdevējsSpringer Verlag
Lapas289-301
Lapu skaits13
ISBN (Drukātā versija)9783319975702
DOIs
Publikācijas statussPublicēts - 2018
Pasākums13th International Baltic Conference on Databases and Information Systems, DB and IS 2018 - Trakai, Lietuva
Ilgums: 1 jūl. 20184 jūl. 2018

Publikāciju sērijas

NosaukumsCommunications in Computer and Information Science
Sējums838
ISSN (Drukātā versija)1865-0929

Konference

Konference13th International Baltic Conference on Databases and Information Systems, DB and IS 2018
Valsts/TeritorijaLietuva
PilsētaTrakai
Periods1/07/184/07/18

Nospiedums

Uzziniet vairāk par pētniecības tēmām “Semi-automatic quasi-morphological word segmentation for neural machine translation”. Kopā tie veido unikālu nospiedumu.

Citēt šo