Morphology data for 22 languages

Hämäläinen, Mika; Partanen, Niko; Rueter, Jack; Alnajjar, Khalid

Morphology data for 22 languages

dc.contributor.affiliation	University of Helsinki - Hämäläinen, Mika
dc.contributor.affiliation	University of Helsinki - Partanen, Niko
dc.contributor.affiliation	University of Helsinki - Rueter, Jack
dc.contributor.affiliation	University of Helsinki - Alnajjar, Khalid
dc.contributor.author	Hämäläinen, Mika
dc.contributor.author	Partanen, Niko
dc.contributor.author	Rueter, Jack
dc.contributor.author	Alnajjar, Khalid
dc.date.accessioned	2025-03-24T15:11:11Z
dc.date.issued	2020-07-02
dc.date.issued	2020-07-02
dc.description	Most people will want to download only the train_data.zip file. This contains folders named after language ISO codes. Under each language, you will find separate folders for the lemmatization, analysis and generation tasks. Each one of these folders has source and target files for train, val and test. There is also a pred.txt file that has the predictions of the baseline system. If you need more control over how these datasets are created, keep on reading. To create the dataset from scratch, run the data_formatter.py. The morphological data is stored in the fst.zip, if you want to download that and skip some parts of making the data from scratch, extract it to the same folder as the data_formatter.py. Cite: Hämäläinen, M., Partanen, N., Rueter, J., & Alnajjar, K. (2021). Neural Morphology Dataset and Models for Multiple Languages, from the Large to the Endangered. In Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa 2021)
dc.identifier	https://doi.org/10.5281/zenodo.3928628
dc.identifier.uri	https://hydatakatalogi-test-24.it.helsinki.fi/handle/123456789/8994
dc.rights	Open
dc.rights.license	cc-by-4.0
dc.title	Morphology data for 22 languages
dc.type	dataset
dc.type	dataset

Zenodo