OcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan

dc.contributor.affiliationUniversity of Helsinki - Miletić, Aleksandra
dc.contributor.affiliationUniversity of Helsinki - Scherrer, Yves
dc.contributor.authorMiletić, Aleksandra
dc.contributor.authorScherrer, Yves
dc.date.accessioned2025-03-24T15:14:54Z
dc.date.issued2022-09-14
dc.date.issued2022-09-14
dc.descriptionOcWikiDisc is a freely available corpus in Occitan, extracted from the talk pages associated with the Occitan Wikipedia. The corpus contains messages posted by users in direct user-to-user interactions as part of the discussions about the content and the editing policies on Wikipedia. The messages are associated with metadata, such as the username, the date and time of the posting, the discussion title, etc. The corpus has also been annotated with tools for automatic language identification, allowing to filter out content in languages other than Occitan. Using different filtering strategies, four versions of the corpus are published (see documentation for more details). The version with the most restrictive filtering contains 8,000 messages for a total of 618,000 tokens, produced by 520 different users.
dc.identifierhttps://doi.org/10.5281/zenodo.7079580
dc.identifier.urihttps://hydatakatalogi-test-24.it.helsinki.fi/handle/123456789/9410
dc.rightsOpen
dc.rights.licensecc-by-sa-3.0
dc.subjectOccitan, Wikipedia talk pages, corpus, low-resourced language, language identification
dc.titleOcWikiDisc: a Corpus of Wikipedia Talk Pages in Occitan
dc.typedataset
dc.typedataset

Files

Repositories