Abstract
This paper describes corpora collection activity for building large machine translation systems for Latvian e-Government platform. We describe requirements for corpora, selection and assessment of data sources, collection of the public corpora and creation of new corpora from miscellaneous sources. Methodology, tools and assessment methods are also presented along with the results achieved, challenges faced and conclusions made. Several approaches to address the data scarceness are discussed. We summarize the volume of obtained corpora and provide quality metrics of MT systems trained on this data. Resulting MT systems for English-Latvian, Latvian-English and Latvian-Russian are integrated in the Latvian e-service portal and are freely available on website HUGO.LV. This paper can serve as a guidance for similar activities initiated in other countries, particularly in the context of European Language Resource Coordination action.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 |
| Editors | Nicoletta Calzolari, Khalid Choukri, Helene Mazo, Asuncion Moreno, Thierry Declerck, Sara Goggi, Marko Grobelnik, Jan Odijk, Stelios Piperidis, Bente Maegaard, Joseph Mariani |
| Publisher | European Language Resources Association (ELRA) |
| Pages | 1270-1276 |
| Number of pages | 7 |
| ISBN (Electronic) | 9782951740891 |
| Publication status | Published - 2016 |
| Externally published | Yes |
| Event | 10th International Conference on Language Resources and Evaluation, LREC 2016 - Portoroz, Slovenia Duration: 23 May 2016 → 28 May 2016 |
Publication series
| Name | Proceedings of the 10th International Conference on Language Resources and Evaluation, LREC 2016 |
|---|
Conference
| Conference | 10th International Conference on Language Resources and Evaluation, LREC 2016 |
|---|---|
| Country/Territory | Slovenia |
| City | Portoroz |
| Period | 23/05/16 → 28/05/16 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 16 Peace, Justice and Strong Institutions
Keywords
- Corpus
- E-Government
- Machine translation
- Parallel texts
- Public sector information
- Web crawling
Fingerprint
Dive into the research topics of 'Collecting language resources for the Latvian e-Government machine translation platform'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver