Learn More
We describe a very simple bag-of-words baseline for visual question answering. This baseline concatenates the word features from the question and CNN features from the image to predict the answer. When evaluated on the challenging VQA dataset [2], it shows comparable performance to many recent approaches using recurrent neural networks. To explore the(More)
Human pose estimation requires a versatile yet well-constrained spatial model for grouping locally ambiguous parts together to produce a globally consistent hypothesis. Previous works either use local deformable models deviating from a certain template, or use a global mixture representation in the pose space. In this paper, we propose a new hierarchical(More)
A video sequence of an underwater scene taken from above the water surface suffers from severe distortions due to water fluctuations. In this paper, we simultaneously estimate the shape of the water surface and recover the planar underwater scene without using any calibration patterns, image priors, multiple viewpoints or active illumination. The key idea(More)
Digital photo management is becoming indispensable for the explosively growing family photo albums due to the rapid popularization of digital cameras and mobile phone cameras. In an effective photo management system photo annotation is the most challenging task. In this paper, we develop several innovative interaction techniques for semi-automatic photo(More)
Competing with top human players in the ancient game of Go has been a long-term goal of artificial intelligence. Go's high branching factor makes traditional search techniques ineffective, even on leading-edge hardware, and Go's evaluation function could change drastically with one stone change. Recent works [Maddi-son et al. (2015); Clark & Storkey (2015)](More)
Understanding 3D object structure from a single image is an important but difficult task in computer vision, mostly due to the lack of 3D object annotations in real images. Previous work tackles this problem by either solving an optimization task given 2D keypoint positions, or training on synthetic data with ground truth 3D information. In this work, we(More)
The conventional classification schemes — notably multinomial logistic regression — used in conjunction with convolutional networks (convnets) are classical in statistics, designed without consideration for the usual coupling with convnets, stochastic gradient descent, and backpropagation. In the specific application to supervised learning for convnets, a(More)
In this paper, we propose a new framework for training vision-based agent for First-Person Shooter (FPS) Game, in particular Doom. Our framework combines the state-of-the-art reinforcement learning approach (Asynchronous Advantage Actor-Critic (A3C) model [Mnih et al. (2016)]) with curriculum learning. Our model is simple in design and only uses game states(More)