No-Frills Human-Object Interaction Detection: Factorization, Layout Encodings, and Training Techniques
- Tanmay Gupta, A. Schwing, Derek Hoiem
- Computer ScienceIEEE International Conference on Computer Vision
- 14 November 2018
We show that for human-object interaction detection a relatively simple factorized model with appearance and layout encodings constructed from pre-trained object detectors outperforms more…
Contrastive Learning for Weakly Supervised Phrase Grounding
- Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, J. Kautz, Derek Hoiem
- Computer ScienceEuropean Conference on Computer Vision
- 17 June 2020
It is shown that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words.
Towards General Purpose Vision Systems
- Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, Derek Hoiem
- Computer ScienceArXiv
- 1 April 2021
GPV-1 is proposed, a task-agnostic vision-language architecture that can learn and perform tasks that involve receiving an image and producing text and/or bounding boxes, including classification, localization, visual question answering, captioning, and more.
Completing 3D object shape from one depth image
- Jason Rock, Tanmay Gupta, J. Thorsen, JunYoung Gwak, Daeyun Shin, Derek Hoiem
- Computer Science, MathematicsComputer Vision and Pattern Recognition
- 7 June 2015
This work takes an exemplar-based approach: retrieve similar objects in a database of 3D models using view-based matching and transfer the symmetries and surfaces from retrieved models to fully automatically reconstruct a 3D model from any category.
Visual Semantic Role Labeling for Video Understanding
- Arka Sadhu, Tanmay Gupta, Mark Yatskar, R. Nevatia, Aniruddha Kembhavi
- Computer ScienceComputer Vision and Pattern Recognition
- 2 April 2021
This work introduces the VidSitu benchmark, a large scale video understanding data source with 29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds, and provides a comprehensive analysis of the dataset in comparison to other publicly available video understanding benchmarks, several illustrative baselines and evaluate a range of standard video recognition models.
Imagine This! Scripts to Compositions to Videos
- Tanmay Gupta, Dustin Schwenk, Ali Farhadi, Derek Hoiem, Aniruddha Kembhavi
- Computer ScienceEuropean Conference on Computer Vision
- 10 April 2018
This work presents the Composition, Retrieval, and Fusion Network (CRAFT), a model capable of learning knowledge from video-caption data and applying it while generating videos from novel captions, and evaluates CRAFT on semantic fidelity to caption, composition consistency, and visual quality.
No-Frills Human-Object Interaction Detection: Factorization, Appearance and Layout Encodings, and Training Techniques
- Tanmay Gupta, A. Schwing, Derek Hoiem
- Computer ScienceArXiv
- 14 November 2018
We show that with an appropriate factorization, and encodings of layout and appearance constructed from outputs of pretrained object detectors, a relatively simple model outperforms more…
ViCo: Word Embeddings From Visual Co-Occurrences
- Tanmay Gupta, A. Schwing, Derek Hoiem
- Computer ScienceIEEE International Conference on Computer Vision
- 22 August 2019
This work extracts four types of visual co-occurrences between object and attribute words from large-scale, textually-annotated visual databases like VisualGenome and ImageNet and trains a multi-task log-bilinear model that compactly encodes word ``meanings'' represented by each co- Occurrence type into a single visual word-vector.
Webly Supervised Concept Expansion for General Purpose Vision Models
- Amita Kamath, Christopher Clark, Tanmay Gupta, Eric Kolve, Derek Hoiem, Aniruddha Kembhavi
- Computer ScienceEuropean Conference on Computer Vision
- 4 February 2022
This work uses a dataset of 1M+ images spanning 10k+ visual concepts to demonstrate webly-supervised concept expansion for two existing GPVs and proposes a new architecture, GPV-2, that supports a variety of tasks — from vision tasks like classification and localization to vision+language tasks like QA and captioning, to more niche ones like human-object interaction detection.
Learning Curves for Analysis of Deep Networks
- Derek Hoiem, Tanmay Gupta, Zhizhong Li, Michal Shlapentokh-Rothman
- Computer ScienceInternational Conference on Machine Learning
- 21 October 2020
A method is proposed to robustly estimate learning curves, abstract their parameters into error and data-reliance, and evaluate the effectiveness of different parameterizations for a variety of image classification models.
...
...