Skip to main navigation Skip to search Skip to main content

Looking for a needle in a haystack: Semi-automatic creation of a Latvian multi-word dictionary from small monolingual corpora

Research output: Chapter in Book/Report/Conference proceedingConference paperResearchpeer-review

1 Citation (Scopus)

Abstract

Multiword expressions (MWEs) are an indispensable part of almost any dictionary. However, the identification of missing MWEs that have recently appeared in a language is not a simple task. In this paper we describe automated methods for MWE identification in a rather small Latvian text corpora. We propose starting with the application of statistical measures to identify a wide range of MWEs and then applying linguistically motivated filters to clean the list of initially extracted MWE candidates. We show that for morphologically rich languages, such as Latvian, in cases with a small amount of language data better results can be achieved with lemmatized data. We also demonstrate that in the case of a small general domain (balanced) corpus, automatic methods can be used to find good MWE candidates - terminological units, named entities and some lexicalized phrases. However, finding idiomatic expressions in small, general domain corpora is looking for a needle in a haystack: only a larger or more expressive corpus can help in the identification process.

Original languageEnglish
Title of host publication18th Euralex International Congress, 2018
EditorsVojko Gorjanc, Simon Krek, Jaka Cibej, Iztok Kosem
Place of PublicationLjubljana
PublisherEuropean Association for Lexicography
Pages255-265
Number of pages11
ISBN (Electronic)9789610600961
ISBN (Print)9789610600978
Publication statusPublished - 2018
Event18th Euralex International Congress, 2018 - Ljubljana, Slovenia
Duration: 17 Jul 201821 Jul 2018

Publication series

NameEURALEX Proceedings
ISSN (Electronic)2521-7100

Conference

Conference18th Euralex International Congress, 2018
Country/TerritorySlovenia
CityLjubljana
Period17/07/1821/07/18

Keywords

  • Collocations
  • Low resourced languages
  • Multi-word expressions
  • Named entities
  • Terminology

OECD Field of Science

  • 1.2 Computer and Information Sciences

Fingerprint

Dive into the research topics of 'Looking for a needle in a haystack: Semi-automatic creation of a Latvian multi-word dictionary from small monolingual corpora'. Together they form a unique fingerprint.

Cite this