TY - GEN
T1 - Multilingual culture-independent word analogy datasets
AU - Ular, Matej
AU - Vaik, Kristiina
AU - Lindström, Jessica
AU - Kurpniece, Milda
AU - Robnik-Šikonja, Marko
N1 - Publisher Copyright:
© European Language Resources Association (ELRA), licensed under CC-BY-NC
PY - 2020
Y1 - 2020
N2 - In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We designed the monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings.
AB - In text processing, deep neural networks mostly use word embeddings as an input. Embeddings have to ensure that relations between words are reflected through distances in a high-dimensional numeric space. To compare the quality of different text embeddings, typically, we use benchmark datasets. We present a collection of such datasets for the word analogy task in nine languages: Croatian, English, Estonian, Finnish, Latvian, Lithuanian, Russian, Slovenian, and Swedish. We designed the monolingual analogy task to be much more culturally independent and also constructed cross-lingual analogy datasets for the involved languages. We present basic statistics of the created datasets and their initial evaluation using fastText embeddings.
KW - Analogy task
KW - Evaluation
KW - Less-resourced languages
KW - Word embeddings
UR - http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.501.pdf
UR - https://www.scopus.com/pages/publications/85096622856
M3 - Conference paper
SN - 979-109554634-4
SN - 9791095546344
T3 - LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
SP - 4074
EP - 4080
BT - LREC 2020 - 12th International Conference on Language Resources and Evaluation, Conference Proceedings
A2 - Calzolari, Nicoletta
A2 - Bechet, Frederic
A2 - Blache, Philippe
A2 - Choukri, Khalid
A2 - Cieri, Christopher
A2 - Declerck, Thierry
A2 - Goggi, Sara
A2 - Isahara, Hitoshi
A2 - Maegaard, Bente
A2 - Mariani, Joseph
A2 - Mazo, Helene
A2 - Moreno, Asuncion
A2 - Odijk, Jan
A2 - Piperidis, Stelios
PB - The European Language Resources Association
CY - Paris
ER -