Abstract
The Latvian National Corpora Collection (LNCC), accessible through Korpuss.lv, is an extensive and diverse collection of about 40 text and spoken corpora, totalling 2.8 billion tokens. These corpora represent a wide range of text types, such as news articles, blogs, scientific texts, parliamentary debates, and essays. Importantly, almost all the corpora in the LNCC have been re-annotated with a uniform morpho-syntactic annotation scheme, enabling federated search and consistent linguistic analysis across different text types and genres. This feature is especially valuable for computational linguistics and language technology development, offering objective data for studies in lexicography, terminology, grammar, semantics, and language learning. Thus, Korpuss.lv emerges as a critical tool in the digital humanities, helping to develop and refine language technologies and research methodologies.
| Original language | English |
|---|---|
| Pages (from-to) | 636-645 |
| Number of pages | 10 |
| Journal | Baltic Journal of Modern Computing |
| Volume | 12 |
| Issue number | 4 |
| DOIs | |
| Publication status | Published - 2024 |
Keywords
- corpora collection
- corpus linguistics
- federated search
- noSketch Engine
- timeline
Fingerprint
Dive into the research topics of 'Korpuss.lv - a Versatile Platform for Digital Humanities'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver