• Publications
  • Influence
Squeeze-and-Excitation Networks
TLDR
This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.
Use What You Have: Video retrieval using representations from collaborative experts
TLDR
This paper proposes a collaborative experts model to aggregate information from these different pre-trained experts and assess the approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet.
Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks
TLDR
This work proposes a simple, lightweight solution to the issue of limited context propagation in ConvNets, which propagates context across a group of neurons by aggregating responses over their extent and redistributing the aggregates back through the group.
Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching
TLDR
This paper introduces a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker and shows that a CNN can indeed be trained to solve this task in both the static and dynamic scenarios and is even well above chance on 10-way classification of the face given the voice.
Learnable PINs: Cross-Modal Embeddings for Person Identity
TLDR
A curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully, is developed and an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas is shown.
Semi-convolutional Operators for Instance Segmentation
TLDR
It is shown theoretically and empirically that constructing dense pixel embeddings that can separate object instances cannot be easily achieved using convolutional operators, and that simple modifications, which are called semi-convolutional, have a much better chance of succeeding at this task.
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
TLDR
A strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark is developed and it is shown that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets.
Unsupervised Learning of Landmarks by Descriptor Vector Exchange
TLDR
A new perspective on the equivariance approach is developed by noting that dense landmark detectors can be interpreted as local image descriptors equipped with invariance to intra-category variations, and proposing a direct method to enforce such an invariance in the standard equivariant loss.
BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues
TLDR
A new scalable approach to data collection for sign recognition in continuous videos is introduced, and it is shown that BSL-1K can be used to train strong sign recognition models for co-articulated signs in BSL and that these models additionally form excellent pretraining for other sign languages and benchmarks.
Disentangled Speech Embeddings Using Cross-Modal Self-Supervision
TLDR
A self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video to tease apart the representations of linguistic content and speaker identity without access to manually annotated data is developed.
...
1
2
3
4
5
...