• Publications
  • Influence
VoxCeleb: A Large-Scale Speaker Identification Dataset
TLDR
This paper proposes a fully automated pipeline based on computer vision techniques to create a large scale text-independent speaker identification dataset collected 'in the wild', and shows that a CNN based architecture obtains the best performance for both identification and verification.
VoxCeleb2: Deep Speaker Recognition
TLDR
A very large-scale audio-visual speaker recognition dataset collected from open-source media is introduced and Convolutional Neural Network models and training strategies that can effectively recognise identities from voice under various conditions are developed and compared.
Utterance-level Aggregation for Speaker Recognition in the Wild
TLDR
This paper proposes a powerful speaker recognition deep network, using a ‘thin-ResNet’ trunk architecture, and a dictionary-based NetVLAD or GhostVLAD layer to aggregate features across time, that can be trained end-to-end.
Use What You Have: Video retrieval using representations from collaborative experts
TLDR
This paper proposes a collaborative experts model to aggregate information from these different pre-trained experts and assess the approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet.
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
TLDR
This work proposes a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets, and demonstrates the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects.
Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching
TLDR
This paper introduces a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker and shows that a CNN can indeed be trained to solve this task in both the static and dynamic scenarios and is even well above chance on 10-way classification of the face given the voice.
Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval
TLDR
An end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets and yields state-of-the-art results on standard downstream video-retrieval benchmarks including MSR-VTT, MSVD, DiDeMo and LSMDC.
Learnable PINs: Cross-Modal Embeddings for Person Identity
TLDR
A curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully, is developed and an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas is shown.
Spot the conversation: speaker diarisation in the wild
TLDR
This work proposes an automatic audio-visual diarisation method for YouTube videos that consists of active speaker detection using audio- visual methods and speaker verification using self-enrolled speaker models, and integrates this method into a semi-automatic dataset creation pipeline.
...
1
2
3
4
...