A multilabel dataset for distinguishing Bosnian, Croatian, Montenegrin, and Serbian

No Thumbnail Available

Restricted Availability

Date

2024-04-19, 2024-04-19

Persistent identifier of the Data Catalogue metadata

Editor

Journal title

Journal volume

Publisher

Publication Type

dataset
dataset

Peer Review Status

Repositories

Access rights

Open

ISBN

ISSN

Description

This dataset contains files used in the VarDial 2024 Shared Task on Distinguishing Between Similar Languages - Multiple Labels for the Bosnian - Croatian - Montenegrin - Serbian (BCMS) subtask. The starting point for this dataset is the one published by Rupnik et al. (2023). It contains geolocated data from the BCMS linguistic area collected from Twitter (rebranded as X).Each instance contains the full tweet production of a single user, which was manually annotated for the user's country.The original annotation was single-label, and it was produced by a single annotator. In the version of the data produced here, the test and dev sets were reannotated by multiple annotators, in a multi-label setting. For the details on the reannotation process, please see Miletić and Miletić (2024). We have also excluded retweets from the original data, as these represent reproduced content from a different user account and may not be representative of the language use of the user themselves. For details on the shared task, we refer you to Chifu et al. (2024).

Keyword (yso)

Keyword

Publication Series

Journal title

Location of the original dataset