Learn More
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently(More)
— Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent are effective for tasks involving sequences, visual and otherwise. We describe a class of recurrent convolutional architectures which is end-to-end trainable and suitable for large-scale visual understanding(More)
A major challenge in scaling object detection is the difficulty of obtaining labeled images for large numbers of categories. Recently, deep convolutional neural networks (CNN) have emerged as clear winners on object classification benchmarks , in part due to training with 1.2M+ labeled classification images. Unfortunately , only a small fraction of those(More)
GOALS Given a short YouTube video, output a natural language sentence that describes the main activity in the video. When the model is not confident enough it should produce a less specific answer, but not over generalize.
We present a system which can recognize the contents of your meal from a single image, and then predict its nutritional contents, such as calories. The simplest version assumes that the user is eating at a restaurant for which we know the menu. In this case, we can collect images offline to train a multi-label classifier. At run time, we apply the(More)
We present a holistic data-driven technique that generates natural-language descriptions for videos. We combine the output of state-of-the-art object and activity detectors with " real-world " knowledge to select the most probable subject-verb-object triplet for describing a video. We show that this knowledge, automatically mined from web-scale text(More)
— We propose a system for human-robot interaction that learns both models for spatial prepositions and for object recognition. Our system grounds the meaning of an input sentence in terms of visual percepts coming from the robot's sensors in order to send an appropriate command to the PR2 or respond to spatial queries. To perform this grounding, the system(More)
— Since the concept of fuzzy set was defined in 1965 by Zadeh, numerous papers on fuzzy topics have been published and many of his seminal ideas have evolved in different directions. However, there is lack of natural and intuitive procedures to capture fuzzy perceptions from non-expert users. In this paper, we propose an innovative approach to solve this(More)