Skip to main navigation Skip to search Skip to main content

Morphology-inspired word segmentation for neural machine translation

  • Jānis Zuters*
  • , Gus Strazds
  • , Viktorija Ļeonova
  • *Corresponding author for this work
  • University of Latvia

Research output: Chapter in Book/Report/Conference proceedingConference paperResearchpeer-review

2 Citations (Scopus)

Abstract

This paper proposes the Prefix-Root-Postfix-Encoding (PRPE) algorithm, which performs close-to-morphological segmentation of words as part of text pre-processing in machine translation. PRPE is a cross-language algorithm requiring only minor tweaking to adapt it for any particular language, a property which makes it potentially useful for morphologically rich languages with no morphological analysers available. As a key part of the proposed algorithm we introduce the ‘Root alignment’ principle to extract potential sub-words from a corpus, as well as a special technique for constructing words from potential sub-words. In addition, we supplemented the algorithm with specific processing for named-entities based on transliteration. We conducted experiments with two different neural machine translation systems, training them on parallel corpora for English-Latvian and Latvian-English translation. Evaluation of translation quality showed improvements in BLEU scores when the data were pre-processed using the proposed algorithm, compared to a couple of baseline word segmentation algorithms. Although we were able to demonstrate improvements in both translation directions and for both NMT systems, they were relatively minor, and our experiments show that machine translation with inflected languages remains challenging, especially with translation direction towards a highly inflected language.

Original languageEnglish
Title of host publicationDatabases and Information Systems X - Selected Papers from the 13th International Baltic Conference, DB and IS 2018
EditorsAudrone Lupeikiene, Olegas Vasilecas, Gintautas Dzemyda
PublisherIOS Press BV
Pages225-239
Number of pages15
ISBN (Electronic)9781614999409
DOIs
Publication statusPublished - 2019
Event13th International Baltic Conference on Databases and Information Systems, DB and IS 2018 - Trakai, Lithuania
Duration: 1 Jul 20184 Jul 2018

Publication series

NameFrontiers in Artificial Intelligence and Applications
Volume315
ISSN (Print)0922-6389
ISSN (Electronic)1879-8314

Conference

Conference13th International Baltic Conference on Databases and Information Systems, DB and IS 2018
Country/TerritoryLithuania
CityTrakai
Period1/07/184/07/18

Keywords

  • Named-entity processing
  • Neural machine translation
  • Word segmentation

Fingerprint

Dive into the research topics of 'Morphology-inspired word segmentation for neural machine translation'. Together they form a unique fingerprint.

Cite this