TY - GEN
T1 - Lessons learned from creating a balanced corpus from online data
AU - Daragis, Roberts
AU - Levāne-Petrova, Kristīne
AU - Poikāns, Ilmārs
N1 - Publisher Copyright:
© 2020 The authors and IOS Press.
PY - 2020/9/15
Y1 - 2020/9/15
N2 - This paper describes lessons learned from developing the most recent Balanced Corpus of Modern Latvian (LVK2018) from various online sources. Most of the new corpora are created from data obtained from various text holders, which requires cooperation agreements with each of the text holders. Reaching these cooperation agreements is a difficult and time consuming task and may not be necessary if the resource to be created is not of hundred millions of size. Although there are many different resources available on the Internet today for a particular language, finding viable online resources to create a balanced corpus is still a challenging task. Developing a balanced corpus from various online sources does not require agreements with text holders, but it presents many more technical challenges, including text extraction, cleaning and validation.
AB - This paper describes lessons learned from developing the most recent Balanced Corpus of Modern Latvian (LVK2018) from various online sources. Most of the new corpora are created from data obtained from various text holders, which requires cooperation agreements with each of the text holders. Reaching these cooperation agreements is a difficult and time consuming task and may not be necessary if the resource to be created is not of hundred millions of size. Although there are many different resources available on the Internet today for a particular language, finding viable online resources to create a balanced corpus is still a challenging task. Developing a balanced corpus from various online sources does not require agreements with text holders, but it presents many more technical challenges, including text extraction, cleaning and validation.
KW - Balanced corpus
KW - Corpus development
KW - General corpus
KW - Metadata
UR - http://ebooks.iospress.nl/volumearticle/55535
UR - https://www.scopus.com/pages/publications/85093363245
U2 - 10.3233/FAIA200614
DO - 10.3233/FAIA200614
M3 - Conference paper
SN - 9781643681160
VL - 328
T3 - Frontiers in Artificial Intelligence and Applications
SP - 127
EP - 134
BT - Human Language Technologies - The Baltic Perspective - Proceedings of the 9th International Conference Baltic HLT 2020
A2 - Utka, Andrius
A2 - Vaicenoniene, Jurgita
A2 - Kovalevskaite, Jolanta
A2 - Kalinauskaite, Danguole
PB - IOS Press
CY - Amsterdam
ER -