APES: Audiovisual Person Search in Untrimmed Video

  title={APES: Audiovisual Person Search in Untrimmed Video},
  author={Juan Leon Alcazar and Long Mai and Federico Perazzi and Joon-Young Lee and Pablo Arbel{\'a}ez and Bernard Ghanem and Fabian Caba Heilbron},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
Humans are arguably one of the most important subjects in video streams, many real-world applications such as video summarization or video editing workflows often require the automatic search and retrieval of a person of interest. Despite tremendous efforts in the person re-identification and retrieval domains, few works have developed audiovisual search strategies. In this paper, we present the Audiovisual Person Search dataset (APES), a new dataset composed of untrimmed videos whose audio… Expand

Figures and Tables from this paper


Person Search in Videos with One Portrait Through Visual and Temporal Links
A novel framework is proposed, which takes into account the identity invariance along a tracklet, thus allowing person identities to be propagated via both the visual and the temporal links and remarkably outperforms mainstream person re-id methods. Expand
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned. Expand
Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection
This paper presents the AVA Active Speaker detection dataset (AVA-ActiveSpeaker), which has been publicly released to facilitate algorithm development and comparison, and introduces a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compares several variants. Expand
iQIYI-VID: A Large Dataset for Multi-modal Person Identification
This paper introduces iQIYI-VID, the largest video dataset for multi-modal person identification, and proposed a Multi- modal Attention module to fuse multi-Modal features that can improve person identification considerably. Expand
VoxCeleb: A Large-Scale Speaker Identification Dataset
This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification. Expand
Self-Supervised Learning of Audio-Visual Objects from Video
This work introduces a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time, and significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection. Expand
MARS: A Video Benchmark for Large-Scale Person Re-Identification
It is shown that CNN in classification mode can be trained from scratch using the consecutive bounding boxes of each identity, and the learned CNN embedding outperforms other competing methods considerably and has good generalization ability on other video re-id datasets upon fine-tuning. Expand
Moments in Time Dataset: One Million Videos for Event Understanding
The Moments in Time dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis. Expand
Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding
This work proposes a novel Hollywood in Homes approach to collect data, collecting a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities, and evaluates and provides baseline results for several tasks including action recognition and automatic description generation. Expand
Scalable Person Re-identification: A Benchmark
A minor contribution, inspired by recent advances in large-scale image search, an unsupervised Bag-of-Words descriptor is proposed that yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large- scale 500k dataset. Expand