A multilabel dataset for distinguishing Bosnian, Croatian, Montenegrin, and Serbian

Miletić, Aleksandra; Miletić, Filip

A multilabel dataset for distinguishing Bosnian, Croatian, Montenegrin, and Serbian

Date

2024-04-19, 2024-04-19

Creator/contributor

Miletić, Aleksandra

Miletić, Filip

Publication Type

dataset
dataset

Repositories

Zenodo

Access rights

Open

Description

This dataset contains files used in the VarDial 2024 Shared Task on Distinguishing Between Similar Languages - Multiple Labels for the Bosnian - Croatian - Montenegrin - Serbian (BCMS) subtask. The starting point for this dataset is the one published by Rupnik et al. (2023). It contains geolocated data from the BCMS linguistic area collected from Twitter (rebranded as X).Each instance contains the full tweet production of a single user, which was manually annotated for the user's country.The original annotation was single-label, and it was produced by a single annotator. In the version of the data produced here, the test and dev sets were reannotated by multiple annotators, in a multi-label setting. For the details on the reannotation process, please see Miletić and Miletić (2024). We have also excluded retweets from the original data, as these represent reproduced content from a different user account and may not be representative of the language use of the user themselves. For details on the shared task, we refer you to Chifu et al. (2024).

Link to original dataset

https://doi.org/10.5281/zenodo.10998042

View full metadata

University of Helsinki

University of Helsinki Data catalogue

A multilabel dataset for distinguishing Bosnian, Croatian, Montenegrin, and Serbian

Restricted Availability

Date

Persistent identifier of the Data Catalogue metadata

Creator/contributor

Editor

Journal title

Journal volume

Publisher

Publication Type

Peer Review Status

Repositories

Access rights

ISBN

ISSN

Description

Link to original dataset

Keyword (yso)

Keyword

Publication Series

Journal title

Location of the original dataset