A multilabel dataset for distinguishing Bosnian, Croatian, Montenegrin, and Serbian

dc.contributor.affiliationUniversity of Helsinki - Miletić, Aleksandra
dc.contributor.affiliationUniversity of Stuttgart - Miletić, Filip
dc.contributor.authorMiletić, Aleksandra
dc.contributor.authorMiletić, Filip
dc.date.accessioned2025-03-24T15:17:15Z
dc.date.issued2024-04-19
dc.date.issued2024-04-19
dc.descriptionThis dataset contains files used in the VarDial 2024 Shared Task on Distinguishing Between Similar Languages - Multiple Labels for the Bosnian - Croatian - Montenegrin - Serbian (BCMS) subtask. The starting point for this dataset is the one published by Rupnik et al. (2023). It contains geolocated data from the BCMS linguistic area collected from Twitter (rebranded as X).Each instance contains the full tweet production of a single user, which was manually annotated for the user's country.The original annotation was single-label, and it was produced by a single annotator. In the version of the data produced here, the test and dev sets were reannotated by multiple annotators, in a multi-label setting. For the details on the reannotation process, please see Miletić and Miletić (2024). We have also excluded retweets from the original data, as these represent reproduced content from a different user account and may not be representative of the language use of the user themselves. For details on the shared task, we refer you to Chifu et al. (2024).
dc.identifierhttps://doi.org/10.5281/zenodo.10998042
dc.identifier.urihttps://hydatakatalogi-test-24.it.helsinki.fi/handle/123456789/10044
dc.rightsOpen
dc.rights.licensecc-by-sa-4.0
dc.titleA multilabel dataset for distinguishing Bosnian, Croatian, Montenegrin, and Serbian
dc.typedataset
dc.typedataset