ReproGen - Announcement and call for Reproduction Papers

Shared Task Announcement and Call for Human Evaluations to be Reproduced

Background

Across Natural Language Processing (NLP), a growing body of work is looking at the issue of reproducibility. However, replicability of human evaluation experiments and reproducibility of their results is currently under-addressed, and this is of particular concern for Natural Language Generation (NLG) where human evaluations are the norm. With the ReproGen shared task on reproducibility of human evaluations in NLG we aim (i) to shed light on the extent to which past NLG evaluations have been replicable and reproducible, and (ii) to draw conclusions regarding how evaluations can be designed and reported to increase reproducibility. If the task is run over several years, we hope to be able to document an overall increase in levels of reproducibility over time.

Solicitation of Human Evaluations to be Reproduced

With this call we invite authors of papers describing a human evaluation experiment and reporting results from it to put their paper up for reproduction in the main ReproGen Shared Task track (see below). We are also asking NLG researchers to nominate a human evaluation by other authors if they would be interested in reproducing its results themselves. (Self-)nominated human evaluations should fulfill the following criteria (intended to ensure a low barrier to participation):

  • The original evaluation evaluates automatically generated textual outputs (of any length)
  • Authors are able to make the complete set of system outputs from the evaluation freely available to ReproGen participants
  • Evaluators were not paid for participation
  • There were between 10 and 50 evaluators
  • The original evaluation involves easily accessible evaluators such as university colleagues or students

From the (self-)nominations received, we will make a selection aimed at achieving balance across NLG tasks and publication years.

Please submit your (self-)nominated evaluation papers by filling in the ReproGen evaluation proposal form.

About ReproGen

Following discussion of the ReproGen proposal at INLG’20 GenChal, we are organising ReproGen with two tracks, one an ‘unshared task’ in which teams attempt to reproduce their own prior human evaluation results, the other a shared task in which teams try to reproduce the same prior human evaluation results:

  • Main Reproducibility Track: For a shared set of selected human evaluations, participants attempt to reproduce their results, using published information plus extra details provided by the authors, and making common-sense assumptions where information is still incomplete.

  • RYO Track: Reproduce Your Own previous human evaluation results, and report what happened. Unshared task.

Full details will be made available when registration opens on 5 February 2020.

Important Dates

28 Jan 2021: Announcement and Call for Human Evaluations to be Reproduced
5 Feb 2021: First Call for Participation and registration opens
15 Feb 2021: Submission deadline for proposals of human evaluations
15 Aug 2021: Submission deadline for reproduction reports
End September 2021: Results presented at INLG (conference dates to be confirmed)

Bibliography

ReproGen proposal at INLG’2020

Contact

reprogen.task@gmail.com