University of Cambridge
Author pages are created from data sourced from our academic publisher partnerships and public sources.
Share This Author
- Jie Hu, Li Shen, Samuel Albanie, Gang Sun, E. Wu
- Computer ScienceIEEE Transactions on Pattern Analysis and Machine…
- 5 September 2017
This work proposes a novel architectural unit, which is term the “Squeeze-and-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels and shows that these blocks can be stacked together to form SENet architectures that generalise extremely effectively across different datasets.
Use What You Have: Video retrieval using representations from collaborative experts
This paper proposes a collaborative experts model to aggregate information from these different pre-trained experts and assess the approach empirically on five retrieval benchmarks: MSR-VTT, LSMDC, MSVD, DiDeMo, and ActivityNet.
Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks
This work proposes a simple, lightweight solution to the issue of limited context propagation in ConvNets, which propagates context across a group of neurons by aggregating responses over their extent and redistributing the aggregates back through the group.
Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching
- Arsha Nagrani, Samuel Albanie, Andrew Zisserman
- Computer ScienceIEEE/CVF Conference on Computer Vision and…
- 1 April 2018
This paper introduces a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker and shows that a CNN can indeed be trained to solve this task in both the static and dynamic scenarios and is even well above chance on 10-way classification of the face given the voice.
Learnable PINs: Cross-Modal Embeddings for Person Identity
A curriculum learning schedule for hard negative mining targeted to this task, that is essential for learning to proceed successfully, is developed and an application of using the joint embedding for automatically retrieving and labelling characters in TV dramas is shown.
Semi-convolutional Operators for Instance Segmentation
It is shown theoretically and empirically that constructing dense pixel embeddings that can separate object instances cannot be easily achieved using convolutional operators, and that simple modifications, which are called semi-convolutional, have a much better chance of succeeding at this task.
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
- Samuel Albanie, Arsha Nagrani, A. Vedaldi, Andrew Zisserman
- Computer ScienceACM Multimedia
- 16 August 2018
A strong teacher network for facial emotion recognition that achieves the state of the art on a standard benchmark is developed and it is shown that the speech emotion embedding can be used for speech emotion recognition on external benchmark datasets.
Unsupervised Learning of Landmarks by Descriptor Vector Exchange
- James Thewlis, Samuel Albanie, Hakan Bilen, A. Vedaldi
- Computer ScienceIEEE/CVF International Conference on Computer…
- 18 August 2019
A new perspective on the equivariance approach is developed by noting that dense landmark detectors can be interpreted as local image descriptors equipped with invariance to intra-category variations, and proposing a direct method to enforce such an invariance in the standard equivariant loss.
BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues
A new scalable approach to data collection for sign recognition in continuous videos is introduced, and it is shown that BSL-1K can be used to train strong sign recognition models for co-articulated signs in BSL and that these models additionally form excellent pretraining for other sign languages and benchmarks.
Disentangled Speech Embeddings Using Cross-Modal Self-Supervision
- Arsha Nagrani, Joon Son Chung, Samuel Albanie, Andrew Zisserman
- Computer ScienceICASSP - IEEE International Conference on…
- 20 February 2020
A self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video to tease apart the representations of linguistic content and speaker identity without access to manually annotated data is developed.