• Publications
  • Influence
Labeled Faces in the Wild: A Database forStudying Face Recognition in Unconstrained Environments
TLDR
The database contains labeled face photographs spanning the range of conditions typically encountered in everyday life, and exhibits “natural” variability in factors such as pose, lighting, race, accessories, occlusions, and background.
ReferItGame: Referring to Objects in Photographs of Natural Scenes
TLDR
A new game to crowd-source natural language referring expressions by designing a two player game that can both collect and verify referring expressions directly within the game and provides an in depth analysis of the resulting dataset.
Modeling Context in Referring Expressions
TLDR
This work focuses on incorporating better measures of visual context into referring expression models and finds that visual comparison to other objects within an image helps improve performance significantly.
MAttNet: Modular Attention Network for Referring Expression Comprehension
TLDR
This work proposes to decompose expressions into three modular components related to subject appearance, location, and relationship to other objects, which allows for flexibly adapt to expressions containing different types of information in an end-to-end framework.
Two-person interaction detection using body-pose features and multiple instance learning
TLDR
A complex human activity dataset depicting two person interactions, including synchronized video, depth and motion capture data is created, and techniques related to Multiple Instance Learning (MIL) are explored, finding that the MIL based classifier outperforms SVMs when the sequences extend temporally around the interaction of interest.
Im2Text: Describing Images Using 1 Million Captioned Photographs
TLDR
A new objective performance measure for image captioning is introduced and methods incorporating many state of the art, but fairly noisy, estimates of image content are developed to produce even more pleasing results.
TVQA: Localized, Compositional Video Question Answering
TLDR
This paper presents TVQA, a large-scale video QA dataset based on 6 popular TV shows, and provides analyses of this new dataset as well as several baselines and a multi-stream end-to-end trainable neural network framework for the TVZA task.
Parsing clothing in fashion photographs
TLDR
An effective method for parsing clothing in fashion photographs, an extremely challenging problem due to the large number of possible garment items, variations in configuration, garment appearance, layering, and occlusion is demonstrated.
Shape matching and object recognition using low distortion correspondences
TLDR
This work approaches recognition in the framework of deformable shape matching, relying on a new algorithm for finding correspondences between feature points, and shows results for localizing frontal and profile faces that are comparable to special purpose approaches tuned to faces.
Where to Buy It: Matching Street Clothing Photos in Online Shops
TLDR
Three different methods for Exact Street to Shop retrieval are developed, including two deep learning baseline methods, and a method to learn a similarity measure between the street and shop domains.
...
1
2
3
4
5
...