![]() Abstract
Pioneer efforts have been dedicated to the content-oriented video
captioning that generates relevant sentences to describe the visual
contents of a given video from the producer perspective. By contrast,
this work targets at the search-oriented one that summarizes
the given video via generating query-like sentences from the consumer
angle. Beyond relevance, diversity is vital in characterizing
consumers’ seeking intention from different aspects. Towards this
end, we devise a large-scale multimodal pre-training network regularized
by five tasks to strengthen the downstream video representation,
which is well-trained over our collected 11M micro-videos.
Thereafter, we present a flow-based diverse captioning model to
generate different captions from consumers’ search demand. This
model is optimized via a reconstruction loss and a KL divergence
between the prior and the posterior. We justify our model over our
constructed golden dataset comprising 690k <query, micro-video>
pairs and experimental results demonstrate its superiority.
Datasets![]() KwaiSVC-222k
KwaiSVC-222k is a golden dataset for search-oriented micro-video captioning.
It is based on the users' video search behavior in Kuaishou micro-video platform.
Specifically, we filter search logs about query-click behavior to
get high quality <query, micro-video> pairs. The filter rules
consist of video view count, click through rate, and play completion rate.
![]()
Baidu Cloud (password: ihc2)
KwaiSVC-222k dataset. KwaiSVC-11M
KwaiSVC-11M is a large multimodal pretraining dataset collected for solving the multimodal representation learning challenge.
Based on this dataset, we devise a large-scale Multimodal prE-training nEtwork (MEEK), which improves the caption performance.
This dataset is constructed similarly to KwaiSVC-222k. The only difference is that we relax the filter rules to get more data.
![]()
Due to copyright and privacy issues, this dataset is not available to the public.
Dataset statistics
![]() Code
![]()
Baidu Cloud (password: v5r4)
Code & checkpoints: pretraining (MEEK), diverse captioning (FLIP), and baseline models. Paper![]()
Liqiang Nie, Leigang Qu, Dai Meng, Min Zhang, Qi Tian, Alberto Del Bimbo.
Search-oriented
Micro-video Captioning. ACM MM 2022 (best paper). PDF |