TY - GEN
T1 - Morphology-inspired word segmentation for neural machine translation
AU - Zuters, Jānis
AU - Strazds, Gus
AU - Ļeonova, Viktorija
N1 - Publisher Copyright:
© 2019 The authors and IOS Press.
PY - 2019
Y1 - 2019
N2 - This paper proposes the Prefix-Root-Postfix-Encoding (PRPE) algorithm, which performs close-to-morphological segmentation of words as part of text pre-processing in machine translation. PRPE is a cross-language algorithm requiring only minor tweaking to adapt it for any particular language, a property which makes it potentially useful for morphologically rich languages with no morphological analysers available. As a key part of the proposed algorithm we introduce the ‘Root alignment’ principle to extract potential sub-words from a corpus, as well as a special technique for constructing words from potential sub-words. In addition, we supplemented the algorithm with specific processing for named-entities based on transliteration. We conducted experiments with two different neural machine translation systems, training them on parallel corpora for English-Latvian and Latvian-English translation. Evaluation of translation quality showed improvements in BLEU scores when the data were pre-processed using the proposed algorithm, compared to a couple of baseline word segmentation algorithms. Although we were able to demonstrate improvements in both translation directions and for both NMT systems, they were relatively minor, and our experiments show that machine translation with inflected languages remains challenging, especially with translation direction towards a highly inflected language.
AB - This paper proposes the Prefix-Root-Postfix-Encoding (PRPE) algorithm, which performs close-to-morphological segmentation of words as part of text pre-processing in machine translation. PRPE is a cross-language algorithm requiring only minor tweaking to adapt it for any particular language, a property which makes it potentially useful for morphologically rich languages with no morphological analysers available. As a key part of the proposed algorithm we introduce the ‘Root alignment’ principle to extract potential sub-words from a corpus, as well as a special technique for constructing words from potential sub-words. In addition, we supplemented the algorithm with specific processing for named-entities based on transliteration. We conducted experiments with two different neural machine translation systems, training them on parallel corpora for English-Latvian and Latvian-English translation. Evaluation of translation quality showed improvements in BLEU scores when the data were pre-processed using the proposed algorithm, compared to a couple of baseline word segmentation algorithms. Although we were able to demonstrate improvements in both translation directions and for both NMT systems, they were relatively minor, and our experiments show that machine translation with inflected languages remains challenging, especially with translation direction towards a highly inflected language.
KW - Named-entity processing
KW - Neural machine translation
KW - Word segmentation
UR - https://www.scopus.com/pages/publications/85063356974
U2 - 10.3233/978-1-61499-941-6-225
DO - 10.3233/978-1-61499-941-6-225
M3 - Conference paper
AN - SCOPUS:85063356974
T3 - Frontiers in Artificial Intelligence and Applications
SP - 225
EP - 239
BT - Databases and Information Systems X - Selected Papers from the 13th International Baltic Conference, DB and IS 2018
A2 - Lupeikiene, Audrone
A2 - Vasilecas, Olegas
A2 - Dzemyda, Gintautas
PB - IOS Press BV
T2 - 13th International Baltic Conference on Databases and Information Systems, DB and IS 2018
Y2 - 1 July 2018 through 4 July 2018
ER -