A multilabel dataset for distinguishing Bosnian, Croatian, Montenegrin, and Serbian

Miletić, Aleksandra; Miletić, Filip

A multilabel dataset for distinguishing Bosnian, Croatian, Montenegrin, and Serbian

dc.contributor.affiliation	University of Helsinki - Miletić, Aleksandra
dc.contributor.affiliation	University of Stuttgart - Miletić, Filip
dc.contributor.author	Miletić, Aleksandra
dc.contributor.author	Miletić, Filip
dc.date.accessioned	2025-03-24T15:17:15Z
dc.date.issued	2024-04-19
dc.date.issued	2024-04-19
dc.description	This dataset contains files used in the VarDial 2024 Shared Task on Distinguishing Between Similar Languages - Multiple Labels for the Bosnian - Croatian - Montenegrin - Serbian (BCMS) subtask. The starting point for this dataset is the one published by Rupnik et al. (2023). It contains geolocated data from the BCMS linguistic area collected from Twitter (rebranded as X).Each instance contains the full tweet production of a single user, which was manually annotated for the user's country.The original annotation was single-label, and it was produced by a single annotator. In the version of the data produced here, the test and dev sets were reannotated by multiple annotators, in a multi-label setting. For the details on the reannotation process, please see Miletić and Miletić (2024). We have also excluded retweets from the original data, as these represent reproduced content from a different user account and may not be representative of the language use of the user themselves. For details on the shared task, we refer you to Chifu et al. (2024).
dc.identifier	https://doi.org/10.5281/zenodo.10998042
dc.identifier.uri	https://hydatakatalogi-test-24.it.helsinki.fi/handle/123456789/10044
dc.rights	Open
dc.rights.license	cc-by-sa-4.0
dc.title	A multilabel dataset for distinguishing Bosnian, Croatian, Montenegrin, and Serbian
dc.type	dataset
dc.type	dataset

Zenodo