Skip to main navigation Skip to search Skip to main content

Lessons learned from creating a balanced corpus from online data

  • Roberts Daragis
  • , Kristīne Levāne-Petrova
  • , Ilmārs Poikāns

Research output: Chapter in Book/Report/Conference proceedingConference paperResearchpeer-review

6 Citations (Scopus)

Abstract

This paper describes lessons learned from developing the most recent Balanced Corpus of Modern Latvian (LVK2018) from various online sources. Most of the new corpora are created from data obtained from various text holders, which requires cooperation agreements with each of the text holders. Reaching these cooperation agreements is a difficult and time consuming task and may not be necessary if the resource to be created is not of hundred millions of size. Although there are many different resources available on the Internet today for a particular language, finding viable online resources to create a balanced corpus is still a challenging task. Developing a balanced corpus from various online sources does not require agreements with text holders, but it presents many more technical challenges, including text extraction, cleaning and validation.

Original languageEnglish
Title of host publicationHuman Language Technologies - The Baltic Perspective - Proceedings of the 9th International Conference Baltic HLT 2020
EditorsAndrius Utka, Jurgita Vaicenoniene, Jolanta Kovalevskaite, Danguole Kalinauskaite
Place of PublicationAmsterdam
PublisherIOS Press
Pages127-134
Volume328
ISBN (Print)9781643681160
DOIs
Publication statusPublished - 15 Sept 2020

Publication series

NameFrontiers in Artificial Intelligence and Applications
Volume328
ISSN (Print)0922-6389
ISSN (Electronic)1879-8314

Keywords

  • Balanced corpus
  • Corpus development
  • General corpus
  • Metadata

OECD Field of Science

  • 1.2 Computer and Information Sciences

Fingerprint

Dive into the research topics of 'Lessons learned from creating a balanced corpus from online data'. Together they form a unique fingerprint.

Cite this