Skip to main navigation Skip to search Skip to main content

MultiLeg: Dataset for Text Sanitisation in Less-resourced Languages

  • Rinalds Vı̄ksna
  • , Inguna Skadin
  • , Roberts Rozis

Research output: Chapter in Book/Report/Conference proceedingConference paperResearchpeer-review

1 Citation (Scopus)

Abstract

Text sanitization is the task of detecting and removing personal information from the text. While it has been well-studied in monolingual settings, today, there is also a need for multilingual text sanitization. In this paper, we introduce MultiLeg: a parallel, multilingual named entity (NE) dataset consisting of documents from the Court of Justice of the European Union annotated with semantic categories suitable for text sanitization. The dataset is available in 8 languages, and it contains 3082 parallel text segments for each language. We also show that the pseudonymized dataset remains useful for downstream tasks.

Original languageEnglish
Title of host publication2024 Joint International Conference on Computational Linguistics Language Resources and Evaluation Lrec Coling 2024 Main Conference Proceedings
EditorsNicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Pages11776-11782
Number of pages7
ISBN (Electronic)9782493814104
Publication statusPublished - 2024

Publication series

Name2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings

Keywords

  • legal domain
  • multilingual
  • named entities
  • text sanitization

OECD Field of Science

  • 1.2 Computer and Information Sciences

Fingerprint

Dive into the research topics of 'MultiLeg: Dataset for Text Sanitisation in Less-resourced Languages'. Together they form a unique fingerprint.

Cite this