Details
In the first track of pre-training for video captioning, we have released ACTION for vision-language pre-training. The table below shows the statistics of ACTION dataset.:
Dataset | Context | Source | #Video | #Sentence | #Word | Vocabulary |
ACTION | multi-category | Automatic crawling from web | 213,078 | 224,989 | 2,291,565 | 50,039 |
To formalize the task of pre-training for video captioning, we provide three datasets to the participants:
A pre-training dataset of ~220K GIF videos in ACTION. Each GIF video is equipped with one caption.
A training dataset of ~9.5K videos in MSR-VTT. Each video is annotated with 20 captions.
A validation dataset of ~0.5K videos in MSR-VTT. Each video is annotated with 20 captions.
In addition to the datasets above, we will include a testing set.
In the second track of pre-training for video categorization, we are finalizing the pre-training video dataset (the Weakly-Supervised dataset) and will make it publically available in March 2021. Here we show the statistics of the Weakly-Supervised dataset as following:
Dataset | Facet | Source | Video Title | #Category | #Video |
the Weakly-Supervised dataset | multi-faceted | Automatic crawling from web | √ | 2,015 | 2,958,092 |
To formalize the task of pre-training for video categorization, we provide four datasets to the participants:
• A pre-training dataset of ~3M videos in the Weakly-Supervised dataset
• A training dataset of ~50K videos in Downstream.
• A validation dataset of ~25K videos in Downstream.
• A testing dataset of ~25K videos in Downstream.