TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

Scherrer, Yves

TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

Date

2020-03-12, 2020-03-12

Creator/contributor

Scherrer, Yves

Publication Type

dataset
dataset

Repositories

Zenodo

Access rights

Open

Description

This paper presents TaPaCo, a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links between sentences "meaning the same thing". This graph is then traversed to extract sets of paraphrases. Several language-independent filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200 - 250 000 sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists.

Link to original dataset

https://doi.org/10.5281/zenodo.3707949

Keyword

Multilingual corpus, Paraphrases, Crowdsourcing

View full metadata

University of Helsinki

University of Helsinki Data catalogue

TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages

Restricted Availability

Date

Persistent identifier of the Data Catalogue metadata

Creator/contributor

Editor

Journal title

Journal volume

Publisher

Publication Type

Peer Review Status

Repositories

Access rights

ISBN

ISSN

Description

Link to original dataset

Keyword (yso)

Keyword

Publication Series

Journal title

Location of the original dataset