Learn More
Automatically describing video content with natural language is a fundamental challenge of computer vision. Re-current Neural Networks (RNNs), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with the given previous words and the visual content, while the(More)
While there has been increasing interest in the task of describing video with natural language, current computer vision algorithms are still severely limited in terms of the variability and complexity of the videos and their associated language that they can recognize. This is in part due to the simplicity of current benchmarks, which mostly focus on(More)
The problem of tagging is mostly considered from the perspectives of machine learning and data-driven philosophy. A fundamental issue that underlies the success of these approaches is the visual similarity, ranging from the nearest neighbor search to manifold learning, to identify similar instances of an example for tag completion. The need to searching for(More)
Search reranking is regarded as a common way to boost retrieval precision. The problem nevertheless is not trivial especially when there are multiple features or modalities to be considered for search, which often happens in image and video retrieval. This paper proposes a new reranking algorithm, named circular reranking, that reinforces the mutual(More)
One of the fundamental problems in image search is to learn the ranking functions, i.e., similarity between the query and image. The research on this topic has evolved through two paradigms: feature-based vector model and image ranker learning. The former relies on the image surrounding texts, while the latter learns a ranker based on human labeled(More)
Recognizing actions in videos is a challenging task as video is an information-intensive media with complex variations. Most existing methods have treated video as a flat data sequence while ignoring the intrinsic hierarchical structure of the video content. In particular, an action may span different granularities in this hierarchy including, from small to(More)
One of the fundamental problems in image search is to rank image documents according to a given textual query. We address two limitations of the existing image search engines in this paper. First, there is no straightforward way of comparing textual keywords with visual image content. Image search engines therefore highly depend on the surrounding texts,(More)
Automatically describing an image with a natural language has been an emerging challenge in both fields of computer vision and natural language processing. In this paper, we present Long Short-Term Memory with Attributes (LSTM-A)-a novel architecture that integrates attributes into the successful Convolutional Neural Networks (CNNs) plus Recurrent Neural(More)
Person re-identification aims at identifying a certain person across non-overlapping multi-camera networks. It is a fundamental and challenging task in automated video surveillance. Most existing researches mainly rely on hand-crafted features, resulting in unsatisfactory performance. In this paper, we propose a multi-scale triplet convolutional neural(More)