APES: Audiovisual Person Search in Untrimmed Video

@article{Alcazar2021APESAP,
  title={APES: Audiovisual Person Search in Untrimmed Video},
  author={Juan Leon Alcazar and Long Mai and Federico Perazzi and Joon-Young Lee and Pablo Arbel{\'a}ez and Bernard Ghanem and Fabian Caba Heilbron},
  journal={2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)},
  year={2021},
  pages={1720-1729}
}
Humans are arguably one of the most important subjects in video streams, many real-world applications such as video summarization or video editing workflows often require the automatic search and retrieval of a person of interest. Despite tremendous efforts in the person re-identification and retrieval domains, few works have developed audiovisual search strategies. In this paper, we present the Audiovisual Person Search dataset (APES), a new dataset composed of untrimmed videos whose audio… 

Figures and Tables from this paper

AVA-AVD: Audio-visual Speaker Diarization in the Wild

The AVA Audio-Visual Relation Network (AVR-Net) is designed which introduces a simple yet effective modality mask to capture discriminative information based on face visibility and shows that the method not only can outperform state-of-the-art methods but is more robust as varying the ratio of off-screen speakers.

An Efficient Person Search Method Using Spatio-Temporal Features for Surveillance Videos

An efficient person search method that employs spatio-temporal features in surveillance videos that considers the spatial features of persons in each frame, but also utilizes the temporal relationship of the same person between adjacent frames.

An Unsupervised Person Search Method for Video Surveillance

This method considers both the spatial features of persons within each frame and the temporal relationship of the same person among different frames and improves the search accuracy by utilizing the spatio-temporal features.

Emphasizing Complementary Samples for Non-literal Cross-modal Retrieval

This paper proposes a novel approach to prioritize loosely-aligned samples in cross-modal retrieval, which relies on estimating to what extent semantic similarity is preserved in the separate channels (images/text) in the learned multimodal space.

UniCon+: ICTCAS-UCAS Submission to the AVA-ActiveSpeaker Task at ActivityNet Challenge 2022

This report presents a brief description of the winning solution to the AVA Active Speaker Detection (ASD) task at ActivityNet Challenge 2022, which continues to rank first on this year’s challenge leaderboard and pushes the state-of-the-art.

References

SHOWING 1-10 OF 32 REFERENCES

Person Search in Videos with One Portrait Through Visual and Temporal Links

A novel framework is proposed, which takes into account the identity invariance along a tracklet, thus allowing person identities to be propagated via both the visual and the temporal links and remarkably outperforms mainstream person re-id methods.

Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

It is argued that the visual and audio components of a video signal should be modeled jointly using a fused multisensory representation, and it is proposed to learn such a representation in a self-supervised way, by training a neural network to predict whether video frames and audio are temporally aligned.

Ava Active Speaker: An Audio-Visual Dataset for Active Speaker Detection

This paper presents the AVA Active Speaker detection dataset (AVA-ActiveSpeaker), which has been publicly released to facilitate algorithm development and comparison, and introduces a state-of-the-art, jointly trained audio-visual model for real-time active speaker detection and compares several variants.

iQIYI-VID: A Large Dataset for Multi-modal Person Identification

This paper introduces iQIYI-VID, the largest video dataset for multi-modal person identification, and proposed a Multi- modal Attention module to fuse multi-Modal features that can improve person identification considerably.

VoxCeleb: A Large-Scale Speaker Identification Dataset

This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.

Self-Supervised Learning of Audio-Visual Objects from Video

This work introduces a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time, and significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.

MARS: A Video Benchmark for Large-Scale Person Re-Identification

It is shown that CNN in classification mode can be trained from scratch using the consecutive bounding boxes of each identity, and the learned CNN embedding outperforms other competing methods considerably and has good generalization ability on other video re-id datasets upon fine-tuning.

Moments in Time Dataset: One Million Videos for Event Understanding

The Moments in Time dataset, a large-scale human-annotated collection of one million short videos corresponding to dynamic events unfolding within three seconds, can serve as a new challenge to develop models that scale to the level of complexity and abstract reasoning that a human processes on a daily basis.

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

This work proposes a novel Hollywood in Homes approach to collect data, collecting a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities, and evaluates and provides baseline results for several tasks including action recognition and automatic description generation.

Scalable Person Re-identification: A Benchmark

A minor contribution, inspired by recent advances in large-scale image search, an unsupervised Bag-of-Words descriptor is proposed that yields competitive accuracy on VIPeR, CUHK03, and Market-1501 datasets, and is scalable on the large- scale 500k dataset.