Abstract
Pioneer efforts have been dedicated to the content-oriented video
captioning that generates relevant sentences to describe the visual
contents of a given video from the producer perspective. By contrast,
this work targets at the search-oriented one that summarizes
the given video via generating query-like sentences from the consumer
angle. Beyond relevance, diversity is vital in characterizing
consumers’ seeking intention from different aspects. Towards this
end, we devise a large-scale multimodal pre-training network regularized
by five tasks to strengthen the downstream video representation,
which is well-trained over our collected 11M micro-videos.
Thereafter, we present a flow-based diverse captioning model to
generate different captions from consumers’ search demand. This
model is optimized via a reconstruction loss and a KL divergence
between the prior and the posterior. We justify our model over our
constructed golden dataset comprising 690k <query, micro-video>
pairs and experimental results demonstrate its superiority.
DatasetsKwaiSVC-222k
KwaiSVC-222k is a golden dataset for search-oriented micro-video captioning.
It is based on the users' video search behavior in Kuaishou micro-video platform.
Specifically, we filter search logs about query-click behavior to
get high quality <query, micro-video> pairs. The filter rules
consist of video view count, click through rate, and play completion rate.
Baidu Cloud (password: ihc2)
KwaiSVC-222k dataset. KwaiSVC-11M
KwaiSVC-11M is a large multimodal pretraining dataset collected for solving the multimodal representation learning challenge.
Based on this dataset, we devise a large-scale Multimodal prE-training nEtwork (MEEK), which improves the caption performance.
This dataset is constructed similarly to KwaiSVC-222k. The only difference is that we relax the filter rules to get more data.
Due to copyright and privacy issues, this dataset is not available to the public.
Dataset statistics
Code
Baidu Cloud (password: v5r4)
Code & checkpoints: pretraining (MEEK), diverse captioning (FLIP), and baseline models. Paper
Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, Alberto Del Bimbo.
Search-oriented
Micro-video Captioning. ACM MM 2022 (best paper). PDF |