The goal of this challenge is to offer a fertile ground for designing vision-language pre-training techniques that facilitate the vision-language downstream tasks (e.g., video captioning this year). Meanwhile, to further motivate and challenge the multimedia community, we provide a large-scale video-language pre-training dataset (namely "Auto-captions on GIF") for contestants to solve such challenging but emerging task.

The contestants are asked to develop video captioning system based on Auto-captions on GIF dataset (as pre-training data) and the public MSR-VTT benchmark (as training data for downstream task). For the evaluation purpose, a contesting system is asked to produce at least one sentence for each test video. The accuracy will be evaluated against human pre-generated sentence(s).


This monkey on the back of horse

Disney made the best cake of all time using projection

The dry driver returns to his car and presents his mate with kebab

Tiny squid flopping around on the rocky bottom of fish tank

Important Dates

· March 10, 2020: Web Site and Call for Participation Ready
· March 31, 2020: Dataset available for download (pre-training, training, and validation set)
· June 1, 2020 June 15, 2020: Test set available for download
· June 20, 2020 July 6, 2020: Close evaluation server + One page report submission
· June 24, 2020 July 8, 2020: Evaluation results announce
· June 29, 2020 July 13, 2020: ACM Multimedia 2020 Grand Challenge paper submission