Pre-training for Video Understanding Challenge

Details

In the first track of pre-training for video captioning, we have released ACTION for vision-language pre-training. The table below shows the statistics of ACTION dataset.:

Dataset	Context	Source	#Video	#Sentence	#Word	Vocabulary
ACTION	multi-category	Automatic crawling from web	213,078	224,989	2,291,565	50,039

To formalize the task of pre-training for video captioning, we provide three datasets to the participants:
A pre-training dataset of ~220K GIF videos in ACTION. Each GIF video is equipped with one caption.
A training dataset of ~9.5K videos in MSR-VTT. Each video is annotated with 20 captions.
A validation dataset of ~0.5K videos in MSR-VTT. Each video is annotated with 20 captions.
In addition to the datasets above, we will include a testing set.

In the second track of pre-training for video categorization, we are finalizing the pre-training video dataset (the Weakly-Supervised dataset). Here we show the statistics of the Weakly-Supervised dataset as following:

Dataset	Facet	Source	Video Title	#Category	#Video
the Weakly-Supervised dataset	multi-faceted	Automatic crawling from web	√	2,015	2,958,092

To formalize the task of pre-training for video categorization, we provide four datasets to the participants:
• A pre-training dataset of ~3M videos in the Weakly-Supervised dataset
• A training dataset of ~50K videos in Downstream.
• A validation dataset of ~25K videos in Downstream.
• A testing dataset of ~30K videos in Downstream.