• Publications
  • Influence
No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques
TLDR
We show that for human-object interaction detection a relatively simple factorized model with appearance and layout encodings constructed from pre-trained object detectors outperforms more sophisticated approaches. Expand
  • 42
  • 10
  • PDF
Completing 3D object shape from one depth image
TLDR
We take an exemplar-based approach: retrieve similar objects in a database of 3D models using view-based matching and transfer the symmetries from retrieved models. Expand
  • 123
  • 4
  • PDF
Contrastive Learning for Weakly Supervised Phrase Grounding
TLDR
We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words, compared to non-corresponding pairs. Expand
  • 13
  • 4
  • PDF
No-Frills Human-Object Interaction Detection: Factorization, Appearance and Layout Encodings, and Training Techniques
TLDR
We show that with an appropriate factorization, and encodings of layout and appearance constructed from outputs of pretrained object detectors, a relatively simple model outperforms more sophisticated approaches on human-object interaction detection. Expand
  • 15
  • 3
  • PDF
ViCo: Word Embeddings From Visual Co-Occurrences
TLDR
We propose to learn word embeddings from Visual Cooccurrences from large-scale, textually-annotated visual databases like VisualGenome and ImageNet. Expand
  • 8
  • 1
  • PDF
Learning Curves for Analysis of Deep Networks
TLDR
We propose a method to robustly estimate learning curves, abstract their parameters into error and data-reliance, and evaluate the effectiveness of different parameterizations. Expand
  • 2
  • 1
  • PDF
Imagine This! Scripts to Compositions to Videos
TLDR
We present the Composition, Retrieval, and Fusion Network (CRAFT), a model capable of learning this knowledge from video-caption data and applying it while generating videos from novel captions. Expand
  • 21
  • PDF
Aligned Image-Word Representations Improve Inductive Transfer Across Vision-Language Tasks
TLDR
In this paper, we investigate a vision-language embedding as a core representation and show that it leads to better cross-task transfer than standard multitask learning. Expand
  • 16
  • PDF
3DFS: Deformable Dense Depth Fusion and Segmentation for Object Reconstruction from a Handheld Camera
TLDR
We propose an approach for 3D reconstruction and segmentation of a single object placed on a flat surface from an input video. Expand
  • 3
  • PDF
Towards General Purpose Vision Systems
TLDR
In this work, we propose a task-agnostic vision-language system that accepts an image and a natural language task description and outputs bounding boxes, confidences, and text. Expand
...
1
2
...