A multilabel dataset for distinguishing Bosnian, Croatian, Montenegrin, and Serbian
No Thumbnail Available
Restricted Availability
Date
2024-04-19, 2024-04-19
Persistent identifier of the Data Catalogue metadata
Creator/contributor
Editor
Journal title
Journal volume
Publisher
Publication Type
dataset
dataset
dataset
Peer Review Status
Repositories
Access rights
Open
ISBN
ISSN
Description
This dataset contains files used in the VarDial 2024 Shared Task on Distinguishing Between Similar Languages - Multiple Labels for the Bosnian - Croatian - Montenegrin - Serbian (BCMS) subtask.
The starting point for this dataset is the one published by Rupnik et al. (2023). It contains geolocated data from the BCMS linguistic area collected from Twitter (rebranded as X).Each instance contains the full tweet production of a single user, which was manually annotated for the user's country.The original annotation was single-label, and it was produced by a single annotator. In the version of the data produced here, the test and dev sets were reannotated by multiple annotators, in a multi-label setting. For the details on the reannotation process, please see Miletić and Miletić (2024). We have also excluded retweets from the original data, as these represent reproduced content from a different user account and may not be representative of the language use of the user themselves.
For details on the shared task, we refer you to Chifu et al. (2024).