Vicente Ordonez

Learn More
We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-WeightNetworks, the filters are approximated with binary values resulting in 32× memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions(More)
We develop and demonstrate automatic image description methods using a large captioned photo collection. One contribution is our technique for the automatic collection of this new dataset – performing a huge number of Flickr queries and then filtering the noisy results down to 1 million images with associated visually relevant captions. Such a collection(More)
With the rise in popularity of digital cameras, the amount of visual data available on the web is growing exponentially. Some of these pictures are extremely beautiful and aesthetically pleasing, but the vast majority are uninteresting or of low quality. This paper demonstrates a simple, yet powerful method to automatically select high aesthetic quality(More)
In this paper we introduce a new game to crowd-source natural language referring expressions. By designing a two player game, we can both collect and verify referring expressions directly within the game. To date, the game has produced a dataset containing 130,525 expressions, referring to 96,654 distinct objects, in 19,894 photographs of natural scenes.(More)
We present a holistic data-driven approach to image description generation, exploiting the vast amount of (noisy) parallel image data and associated natural language descriptions available on the web. More specifically, given a query image, we retrieve existing human-composed phrases used to describe visually similar images, then selectively combine those(More)
We present a system to automatically generate natural language descriptions from images. This system consists of two parts. The first part, content planning, smooths the output of computer vision-based detection and recognition algorithms with statistics mined from large pools of visually descriptive text to determine the best content words to use to(More)
We present a new tree based approach to composing expressive image descriptions that makes use of naturally occuring web images with captions. We investigate two related tasks: image caption generalization and generation, where the former is an optional subtask of the latter. The high-level idea of our approach is to harvest expressive phrases (as tree(More)
Entry level categories - the labels people will use to name an object - were originally defined and studied by psychologists in the 1980s. In this paper we study entry-level categories at a large scale and learn the first models for predicting entry-level categories for images. Our models combine visual recognition predictions with proxies for word(More)
What is the story of an image? What is the relationship between pictures, language, and information we can extract using state of the art computational recognition systems? In an attempt to address both of these questions, we explore methods for retrieving and generating natural language descriptions for images. Ideally, we would like our generated textual(More)