SemEval-2024 Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

Mickus, Timothee; Zosa, Elaine; Vázquez, Raúl; Vahtola, Teemu; Tiedemann, Jörg; Segonne, Vincent; Raganato, Alessandro; Apidianaki, Marianna

SemEval-2024 Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

dc.contributor.affiliation	University of Helsinki - Mickus, Timothee
dc.contributor.affiliation	University of Helsinki - Zosa, Elaine
dc.contributor.affiliation	University of Helsinki - Vázquez, Raúl
dc.contributor.affiliation	University of Helsinki - Vahtola, Teemu
dc.contributor.affiliation	University of Helsinki - Tiedemann, Jörg
dc.contributor.affiliation	Université Bretagne Sud - Segonne, Vincent
dc.contributor.affiliation	University of Milano-Bicocca - Raganato, Alessandro
dc.contributor.affiliation	University of Pennsylvania - Apidianaki, Marianna
dc.contributor.author	Mickus, Timothee
dc.contributor.author	Zosa, Elaine
dc.contributor.author	Vázquez, Raúl
dc.contributor.author	Vahtola, Teemu
dc.contributor.author	Tiedemann, Jörg
dc.contributor.author	Segonne, Vincent
dc.contributor.author	Raganato, Alessandro
dc.contributor.author	Apidianaki, Marianna
dc.date.accessioned	2025-03-24T15:11:23Z
dc.date.issued	2024-05-16
dc.date.issued	2024-05-16
dc.description	Task description: SHROOM participants will need to detect grammatically sound output that contains incorrect semantic information (i.e. unsupported or inconsistent with the source input), with or without having access to the model that produced the output. Overview of the task: The modern NLG landscape is plagued by two interlinked problems: On the one hand, our current neural models have a propensity to produce inaccurate but fluent outputs; on the other hand, our metrics are most apt at describing fluency, rather than correctness. This leads neural networks to “hallucinate”, i.e., produce fluent but incorrect outputs that we currently struggle to detect automatically. For many NLG applications, the correctness of an output is however mission critical. For instance, producing a plausible-sounding translation that is inconsistent with the source text puts in jeopardy the usefulness of a machine translation pipeline. With our shared task, we hope to foster the growing interest in this topic in the community. With SHROOM we adopt a post hoc setting, where models have already been trained and outputs already produced: participants will be asked to perform binary classification to identify cases of fluent overgeneration hallucinations in two different setups: model-aware and model-agnostic tracks. That is, participants must detect grammatically sound outputs which contain incorrect or unsupported semantic information, inconsistent with the source input, with or without having access to the model that produced the output. To that end, we will provide participants with a collection of checkpoints, inputs, references and outputs of systems covering three different NLG tasks: definition modeling (DM), machine translation (MT) and paraphrase generation (PG), trained with varying degrees of accuracy. The development set will provide binary annotations from at least five different annotators and a majority vote gold label.
dc.identifier	https://doi.org/10.5281/zenodo.11202503
dc.identifier.uri	https://hydatakatalogi-test-24.it.helsinki.fi/handle/123456789/9151
dc.rights	Open
dc.rights.license	cc-by-4.0
dc.subject	Semeval-2024
dc.subject	Hallucinations
dc.title	SemEval-2024 Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes
dc.type	dataset
dc.type	dataset

Repositories

Zenodo

University of Helsinki

University of Helsinki Data catalogue

SemEval-2024 Task 6: SHROOM, a Shared-task on Hallucinations and Related Observable Overgeneration Mistakes

Files

Repositories