Search-oriented Micro-video Captioning

Pioneer efforts have been dedicated to the content-oriented video captioning that generates relevant sentences to describe the visual contents of a given video from the producer perspective. By contrast, this work targets at the search-oriented one that summarizes the given video via generating query-like sentences from the consumer angle. Beyond relevance, diversity is vital in characterizing consumers’ seeking intention from different aspects. Towards this end, we devise a large-scale multimodal pre-training network regularized by five tasks to strengthen the downstream video representation, which is well-trained over our collected 11M micro-videos. Thereafter, we present a flow-based diverse captioning model to generate different captions from consumers’ search demand. This model is optimized via a reconstruction loss and a KL divergence between the prior and the posterior. We justify our model over our constructed golden dataset comprising 690k <query, micro-video> pairs and experimental results demonstrate its superiority.

Datasets

KwaiSVC-222k

KwaiSVC-222k is a golden dataset for search-oriented micro-video captioning. It is based on the users' video search behavior in Kuaishou micro-video platform. Specifically, we filter search logs about query-click behavior to get high quality <query, micro-video> pairs. The filter rules consist of video view count, click through rate, and play completion rate.

Baidu Cloud (password: ihc2)
KwaiSVC-222k dataset.

KwaiSVC-11M

KwaiSVC-11M is a large multimodal pretraining dataset collected for solving the multimodal representation learning challenge. Based on this dataset, we devise a large-scale Multimodal prE-training nEtwork (MEEK), which improves the caption performance. This dataset is constructed similarly to KwaiSVC-222k. The only difference is that we relax the filter rules to get more data.

Due to copyright and privacy issues, this dataset is not available to the public.

Dataset statistics

**Comparison among existing video captioning datasets**
Dataset	Clips	Captions	Pairs	Purpose	Category
MSVD	2k	80k	80k	content-oriented	-
MSR-VTT	10k	200k	200k	content-oriented	20
Kwai-SVC-222k	222k	144k	690k	search-oriented	32
Kwai-SVC-11M	11M	4M	35M	search-oriented	35

Code

**Performance Comparison for Search-oriented Micro-video Captioning**
Model	Diversity		Relevance						R/D
Model	mB4	U	B1	B2	B3	B4	R	C	R/D
CVAE	0.917	3.83%	0.861	0.822	0.781	0.747	0.811	2.950	0.815
AG-CVAE>	0.845	8.70%	0.860	0.822	0.781	0.745	0.818	2.950	0.882
DCM	0.437	73.50%	0.666	0.555	0.457	0.378	0.606	1.710	0.865
POS	0.953	2.33%	0.855	0.816	0.773	0.738	0.804	2.940	0.774
Seq-CVAE	0.780	16.40%	0.845	0.803	0.757	0.719	0.786	2.880	0.922
FLIP (Ours)	0.692	23.20%	0.854	0.813	0.770	0.733	0.800	2.890	1.059

Baidu Cloud (password: v5r4)
Code & checkpoints: pretraining (MEEK), diverse captioning (FLIP), and baseline models.

Paper

Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, Alberto Del Bimbo. Search-oriented
Micro-video Captioning. ACM MM 2022 (best paper). PDF

Abstract