Abstract


Pioneer efforts have been dedicated to the content-oriented video captioning that generates relevant sentences to describe the visual contents of a given video from the producer perspective. By contrast, this work targets at the search-oriented one that summarizes the given video via generating query-like sentences from the consumer angle. Beyond relevance, diversity is vital in characterizing consumers’ seeking intention from different aspects. Towards this end, we devise a large-scale multimodal pre-training network regularized by five tasks to strengthen the downstream video representation, which is well-trained over our collected 11M micro-videos. Thereafter, we present a flow-based diverse captioning model to generate different captions from consumers’ search demand. This model is optimized via a reconstruction loss and a KL divergence between the prior and the posterior. We justify our model over our constructed golden dataset comprising 690k <query, micro-video> pairs and experimental results demonstrate its superiority.

Datasets


KwaiSVC-222k

KwaiSVC-222k is a golden dataset for search-oriented micro-video captioning. It is based on the users' video search behavior in Kuaishou micro-video platform. Specifically, we filter search logs about query-click behavior to get high quality <query, micro-video> pairs. The filter rules consist of video view count, click through rate, and play completion rate.

Baidu Cloud (password: ihc2)
KwaiSVC-222k dataset.

KwaiSVC-11M

KwaiSVC-11M is a large multimodal pretraining dataset collected for solving the multimodal representation learning challenge. Based on this dataset, we devise a large-scale Multimodal prE-training nEtwork (MEEK), which improves the caption performance. This dataset is constructed similarly to KwaiSVC-222k. The only difference is that we relax the filter rules to get more data.

Due to copyright and privacy issues, this dataset is not available to the public.

Dataset statistics

Comparison among existing video captioning datasets
Dataset Clips Captions Pairs Purpose Category
MSVD 2k 80k 80k content-oriented -
MSR-VTT 10k 200k 200k content-oriented 20
Kwai-SVC-222k 222k 144k 690k search-oriented 32
Kwai-SVC-11M 11M 4M 35M search-oriented 35


Code


Performance Comparison for Search-oriented Micro-video Captioning
Model Diversity Relevance R/D
mB4 U B1 B2 B3 B4 R C
CVAE 0.917 3.83% 0.861 0.822 0.781 0.747 0.811 2.950 0.815
AG-CVAE> 0.845 8.70% 0.860 0.822 0.781 0.745 0.818 2.950 0.882
DCM 0.437 73.50% 0.666 0.555 0.457 0.378 0.606 1.710 0.865
POS 0.953 2.33% 0.855 0.816 0.773 0.738 0.804 2.940 0.774
Seq-CVAE 0.780 16.40% 0.845 0.803 0.757 0.719 0.786 2.880 0.922
FLIP (Ours) 0.692 23.20% 0.854 0.813 0.770 0.733 0.800 2.890 1.059

Baidu Cloud (password: v5r4)
Code & checkpoints: pretraining (MEEK), diverse captioning (FLIP), and baseline models.

Paper


Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, Alberto Del Bimbo. Search-oriented
Micro-video Captioning. ACM MM 2022 (best paper). PDF