Skip to main navigation Skip to search Skip to main content

Using sub-word n-gram models for dealing with OOV in large vocabulary speech recognition for Latvian

  • Tilde Company

Research output: Chapter in Book/Report/Conference proceedingConference paperResearchpeer-review

6 Citations (Scopus)

Abstract

In the Latvian language, one word can have tens or even hundreds of surface forms. This is a serious problem for large vocabulary speech recognition. Inclusion of every form in vocabulary will make it intractable, but, on the other hand, even with a vocabulary of 400K, the out-ofvocabulary (OOV) rate will be very high. In this paper, the authors investigate the possibility of using sub-word vocabularies where words are split into frequent and common parts. The results of our experiment show that this allows to significantly reduce the OOV rate.

Original languageEnglish
Title of host publicationProceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015
PublisherAssociation for Computational Linguistics (ACL)
Pages281-285
Number of pages5
ISBN (Electronic)9789175190983
ISBN (Print)9789175190983
Publication statusPublished - 2015
Externally publishedYes
Event20th Nordic Conference of Computational Linguistics, NODALIDA 2015 - Vilnius, Lithuania
Duration: 11 May 201513 May 2015

Publication series

NameProceedings of the 20th Nordic Conference of Computational Linguistics, NODALIDA 2015

Conference

Conference20th Nordic Conference of Computational Linguistics, NODALIDA 2015
Country/TerritoryLithuania
CityVilnius
Period11/05/1513/05/15

OECD Field of Science

  • 1.2 Computer and Information Sciences

Fingerprint

Dive into the research topics of 'Using sub-word n-gram models for dealing with OOV in large vocabulary speech recognition for Latvian'. Together they form a unique fingerprint.

Cite this