• Publications
  • Influence
Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering
TLDR
A combined bottom-up and top-down attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of this approach to VQA.
MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition
TLDR
A benchmark task to recognize one million celebrities from their face images, by using all the possibly collected face images of this individual on the web as training data, which could lead to one of the largest classification problems in computer vision.
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
TLDR
This paper proposes a new learning method Oscar (Object-Semantics Aligned Pre-training), which uses object tags detected in images as anchor points to significantly ease the learning of alignments.
Bottom-Up and Top-Down Attention for Image Captioning and VQA
TLDR
A combined bottom-up and topdown attention mechanism that enables attention to be calculated at the level of objects and other salient image regions is proposed, demonstrating the broad applicability of the method to VQA.
Unified Vision-Language Pre-Training for Image Captioning and VQA
TLDR
VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions and VQA 2.0.
Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation
TLDR
A novel Reinforced Cross-Modal Matching (RCM) approach that enforces cross-modal grounding both locally and globally via reinforcement learning (RL), and a Self-Supervised Imitation Learning (SIL) method to explore unseen environments by imitating its own past, good decisions is introduced.
CleanNet: Transfer Learning for Scalable Image Classifier Training with Label Noise
TLDR
CleanNet, a joint neural embedding network, which only requires a fraction of the classes being manually verified to provide the knowledge of label noise that can be transferred to other classes is introduced, which can reduce label noise detection error rate on held-out classes where no human supervision available.
HigherHRNet: Scale-Aware Representation Learning for Bottom-Up Human Pose Estimation
TLDR
HigherHRNet is presented, a novel bottom-up human pose estimation method for learning scale-aware representations using high-resolution feature pyramids that surpasses all top-down methods on CrowdPose test and achieves new state-of-the-art result on COCO test-dev, suggesting its robustness in crowded scene.
Multilinear Discriminant Analysis for Face Recognition
TLDR
This paper presents a novel approach to solve the supervised dimensionality reduction problem by encoding an image object as a general tensor of second or even higher order, and proposes a discriminant tensor criterion, whereby multiple interrelated lower dimensional discriminative subspaces are derived for feature extraction.
Object-Driven Text-To-Image Synthesis via Adversarial Training
TLDR
A thorough comparison between the classic grid attention and the new object-driven attention is provided through analyzing their mechanisms and visualizing their attention layers, showing insights of how the proposed model generates complex scenes in high quality.
...
...