Details

In the first track of pre-training for video captioning, we have released ACTION for vision-language pre-training. The table below shows the statistics of ACTION dataset.:

Dataset Context Source #Video #Sentence #Word Vocabulary
ACTION multi-category Automatic crawling from web 213,078 224,989 2,291,565 50,039

To formalize the task of pre-training for video captioning, we provide three datasets to the participants:
   A pre-training dataset of ~220K GIF videos in ACTION. Each GIF video is equipped with one caption.
   A training dataset of ~9.5K videos in MSR-VTT. Each video is annotated with 20 captions.
   A validation dataset of ~0.5K videos in MSR-VTT. Each video is annotated with 20 captions.
In addition to the datasets above, we will include a testing set.

In the second track of pre-training for video categorization, we are finalizing the pre-training video dataset (the Weakly-Supervised dataset). Here we show the statistics of the Weakly-Supervised dataset as following:


Dataset Facet Source Video Title #Category #Video
the Weakly-Supervised dataset multi-faceted Automatic crawling from web 2,015 2,958,092

To formalize the task of pre-training for video categorization, we provide four datasets to the participants:
   • A pre-training dataset of ~3M videos in the Weakly-Supervised dataset
   • A training dataset of ~50K videos in Downstream.
   • A validation dataset of ~25K videos in Downstream.
   • A testing dataset of ~30K videos in Downstream.

Downloads

Track 1 : the video-sentence pairs in ACTION are here.

Track 2 : the 3M videos in the Weakly-Supervised dataset are here, and the training and validation videos in Downstream are here.