TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

Scherrer, Yves

TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

dc.contributor.affiliation	University of Helsinki - Scherrer, Yves
dc.contributor.author	Scherrer, Yves
dc.date.accessioned	2025-03-24T15:17:25Z
dc.date.issued	2020-03-12
dc.date.issued	2020-03-12
dc.description	This paper presents TaPaCo, a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links between sentences "meaning the same thing". This graph is then traversed to extract sets of paraphrases. Several language-independent filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 - 250 000 sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.
dc.identifier	https://doi.org/10.5281/zenodo.3707949
dc.identifier.uri	https://hydatakatalogi-test-24.it.helsinki.fi/handle/123456789/10139
dc.rights	Open
dc.rights.license	cc-by-2.0
dc.subject	Multilingual corpus, Paraphrases, Crowdsourcing
dc.title	TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages
dc.type	dataset
dc.type	dataset

Repositories

Zenodo

University of Helsinki

University of Helsinki Data catalogue

TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

Files

Repositories