Xinhang Song

Learn More
Abstract Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In contrast to the previous image description methods that focus on describing the whole image, this paper presents a method of generating rich image descriptions from image regions.(More)
This paper describes the participation of our team-MIAR ICT in the ImageCLEF 2013 Robot Vision Challenge. The task of the Challenge asked participants to classify imaged indoor scenes and recognize the predefined objects appeared in the imaged scene. Our approach is based on the recently proposed Kernel Descriptors framework, which is an effective(More)
Food-related photos have become increasingly popular, due to social networks, food recommendation and dietary assessment systems. Reliable annotation is essential in those systems, but unconstrained automatic food recognition is still not accurate enough. Most works focus on exploiting only the visual content while ignoring the context. To address this(More)
With the fast explosive rate of the amount of image data on the Internet, how to efficiently utilize them in the cross-media scenario becomes an urgent problem. Images are usually accompanied with contextual textual information. These two heterogeneous modalities are mutually reinforcing to make the Internet content more informative. In most cases, visual(More)
In this paper, we describe the details of our methods for the participation in the subtask of the ImageCLEF 2016 Scalable Image Annotation task: Natural Language Caption Generation. The model we used is the combination of a procedure of encoding and a procedure of decoding, which includes a Convolutional neural network(CNN) and a Long Short-Term(More)
Multiple image features and multiple semantic concepts from the images have intrinsic and complex relations. These relations influence the effectiveness of image semantic analysis methods, especially on the large scale problems. In this paper, a framework of generating polysemious image representation through three levels of feature aggregation is proposed.(More)
Recently, automatic generation of image captions has attracted great interest not only because of its extensive applications but also because it connects computer vision and natural language processing. By combining convolutional neural networks (CNNs), which learn visual representations from images, and recurrent neural networks (RNNs), which translate the(More)