• Publications
  • Influence
No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques
We show that for human-object interaction detection a relatively simple factorized model with appearance and layout encodings constructed from pre-trained object detectors outperforms moreExpand
Contrastive Learning for Weakly Supervised Phrase Grounding
It is shown that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Expand
Completing 3D object shape from one depth image
This work takes an exemplar-based approach: retrieve similar objects in a database of 3D models using view-based matching and transfer the symmetries and surfaces from retrieved models to fully automatically reconstruct a 3D model from any category. Expand
No-Frills Human-Object Interaction Detection: Factorization, Appearance and Layout Encodings, and Training Techniques
We show that with an appropriate factorization, and encodings of layout and appearance constructed from outputs of pretrained object detectors, a relatively simple model outperforms moreExpand
ViCo: Word Embeddings From Visual Co-Occurrences
This work extracts four types of visual co-occurrences between object and attribute words from large-scale, textually-annotated visual databases like VisualGenome and ImageNet and trains a multi-task log-bilinear model that compactly encodes word ``meanings'' represented by each co- Occurrence type into a single visual word-vector. Expand
Learning Curves for Analysis of Deep Networks
A method is proposed to robustly estimate learning curves, abstract their parameters into error and data-reliance, and evaluate the effectiveness of different parameterizations for a variety of image classification models. Expand
Towards General Purpose Vision Systems
This work proposes a task-agnostic vision-language system that accepts an image and a natural language task description and outputs bounding boxes, confidences, and text and evaluates the system’s ability to learn multiple skills simultaneously, to perform tasks with novel skillconcept combinations, and to learn new skills efficiently and without forgetting. Expand
Imagine This! Scripts to Compositions to Videos
This work presents the Composition, Retrieval, and Fusion Network (CRAFT), a model capable of learning knowledge from video-caption data and applying it while generating videos from novel captions, and evaluates CRAFT on semantic fidelity to caption, composition consistency, and visual quality. Expand
Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks
This paper investigates a vision-language embedding as a core representation and shows that it leads to better cross-task transfer than standard multitask learning and improves visual recognition, especially for categories that have relatively few recognition training labels but appear often in the VQA setting. Expand
3DFS: Deformable Dense Depth Fusion and Segmentation for Object Reconstruction from a Handheld Camera
The approach is to perform dense depth map estimation for multiple views using a proposed objective function that preserves detail to perform 3D reconstruction and segmentation of a single object placed on a flat surface from an input video. Expand