Murreviikko: an Annotated and Normalized Corpus of Dialectal Finnish Tweets
dc.contributor.affiliation | University of Helsinki - Kuparinen Olli | |
dc.contributor.author | Kuparinen Olli | |
dc.date.accessioned | 2025-03-24T15:15:11Z | |
dc.date.issued | 2023-05-08 | |
dc.date.issued | 2023-05-08 | |
dc.description | Murreviikko (literally 'Dialect week') is a campaign founded in the University of Eastern Finland to promote the use of Finnish dialects in social media. It started in 2020 and takes place mid-October. The original data was collected from Twitter with the search word murreviikko ('dialect week') and hashtag #murreviikko separately for 2020, 2021 and 2022. The current dataset combines all the original collections. The tweets are dialectologically annotated on two levels: following the East-West division of Finnish dialects, and following a seven-way division of Finnish dialects (South-West, Häme, Southern Ostrobothnia, Central and Northern Ostrobothnia, Far North, Savo, and South-East), appended with the Helsinki slang. There is also a class for dialectal tweets, which are not discernible (NA) because of contrasting or scarce dialectal features. The original tweets are normalized to a phonetic standard, but word order is not altered, or grammar rules of standard Finnish followed otherwise. This means that for instance standard Finnish possessive suffixes (minun kirja-ni 'my book-my') are not added if they are not present in the original tweet (minun kirja). Likewise, dialect words are not corrected to the standard alternative, even if such words would exist (pruukata > pruukata instead of standard tavata). Following the rules of the Twitter API, this repository only includes the tweet id's, dialect annotations and normalizations. The original tweets are available for scientific use by request, as granted by the European Union’s Digital Single Market directive (2019/790). | |
dc.identifier | https://doi.org/10.5281/zenodo.7907535 | |
dc.identifier.uri | https://hydatakatalogi-test-24.it.helsinki.fi/handle/123456789/9593 | |
dc.rights | Open | |
dc.rights.license | cc-by-4.0 | |
dc.subject | dialect | |
dc.subject | tweet | |
dc.subject | normalization | |
dc.title | Murreviikko: an Annotated and Normalized Corpus of Dialectal Finnish Tweets | |
dc.type | dataset | |
dc.type | dataset |