Skip to main navigation Skip to search Skip to main content

Latviešu valodas morfēmu un vārddarināšanas modeļu datubāzes lemmu atlase

Translated title of the contribution: Selection of lemmas for the database of Latvian morphemes and derivational models

Research output: Chapter in Book/Report/Conference proceedingConference paperResearchpeer-review

1 Citation (Scopus)

Abstract

The article offers an overview of the first working stage for the project “Database of Latvian Morphemes and Derivational Models (DLMDM)” (No. LZP-2022/1-0013), during which a set of the lemmas database was created. The register of the lemmas was made from The Balanced Corpus of Moderns Latvian, dated to 2018. Originally, 165 090 lemmas had been obtained from corpus texts, and at the end of data revision, 77 124 lemmas were declared valid. The analysis of the lemmas took place in three steps: step 1 – automated selection of the lemmas database, step 2 – manual processing of the lemmas database, step 3 – one more automated checking of the lemmas database. A total of 30 009 lemmas (steps 1 and 3) were invalidated during the automated selection of the lemmas database. These were words that contained characters or symbols that were not letters of the Latvian alphabet, as well as various duplicate shapes. During the manual processing of the lemmas database, 78 518 lemmas were selected and tested for spelling and usage context. At this step, 57 957 lemmas were declared invalid – abbreviations, various words that do not exist in Latvian, etc. Other selected lemmas (total – 20 561) were divided into three groups: (1) lemmas that have been corrected, (2) lemmas that have been left with parallel forms, and (3) lemmas that have not been corrected. These lemmas were included in the database. The final lemmas amount is 77 124, but this number is variable because the process of data revision still proceeds during the next steps of the project.

Translated title of the contributionSelection of lemmas for the database of Latvian morphemes and derivational models
Original languageLatvian
Title of host publicationValoda: nozīme un forma
Subtitle of host publicationGramatika un valodas elektroniskie resursi
EditorsAndra Kalnača, Ilze Lokmane, Daiki Horiguchi
Place of PublicationRīga
Pages225-237
Volume16
ISBN (Electronic)978-9934-36-494-5
DOIs
Publication statusPublished - 2025

Publication series

NameValoda: Nozime un Forma
Volume16
ISSN (Print)2255-9256
ISSN (Electronic)2256-0602

OECD Field of Science

  • 6.2 Languages and Literature
  • 1.2 Computer and Information Sciences

Fingerprint

Dive into the research topics of 'Selection of lemmas for the database of Latvian morphemes and derivational models'. Together they form a unique fingerprint.

Cite this