Task Description
This year we will focus on visual-language pre-training for the downstream task of video captioning. Given the automatically collected GIF videos and the corresponding captions, the goal of visual-language pre-training is to learn a generic representation or structure that can better reflect the cross-modal interaction between visual content and textual sentence. The learnt generic representation or structure is further adapted to facilitate the downstream task of video captioning, i.e., describing video content with a complete and natural sentence.
The contestants are asked to develop video captioning system based on the Auto-captions on GIF dataset provided by the Challenge (as pre-training data) and the public MSR-VTT benchmark (as training data for downstream task). For the evaluation purpose, a contesting system is asked to produce at least one sentence for each test video. The accuracy will be evaluated against human pre-generated sentence(s) during evaluation stage.