Discovering Connotations as Labels for Weakly Supervised Image-Sentence Data

  title={Discovering Connotations as Labels for Weakly Supervised Image-Sentence Data},
  author={Aditya Mogadala and Bhargav Kanuparthi and Achim Rettinger and York Sure-Vetter},
  journal={Companion Proceedings of the The Web Conference 2018},
Growth of multimodal content on the web and social media has generated abundant weakly aligned image-sentence pairs. However, it is hard to interpret them directly due to intrinsic intension. In this paper, we aim to annotate such image-sentence pairs with connotations as labels to capture the intrinsic intension. We achieve it with a connotation multimodal embedding model (CMEM) using a novel loss function. It's unique characteristics over previous models include: (i) the exploitation of… 

Figures and Tables from this paper

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

This survey focuses on ten prominent tasks that integrate language and vision by discussing their problem formulations, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods.



Large scale image annotation: learning to rank with joint word-image embeddings

This work proposes a strongly performing method that scales to image annotation datasets by simultaneously learning to optimize precision at k of the ranked list of annotations for a given image and learning a low-dimensional joint embedding space for both images and annotations.

Exploiting Web Images for Dataset Construction: A Domain Robust Approach

This work proposes to solve the employed problems by the cutting-plane and concave-convex procedure algorithm and builds an image dataset with 20 categories that can be generalized well to unseen target domains.

DeViSE: A Deep Visual-Semantic Embedding Model

This paper presents a new deep visual-semantic embedding model trained to identify visual objects using both labeled image data as well as semantic information gleaned from unannotated text and shows that the semantic information can be exploited to make predictions about tens of thousands of image labels not observed during training.

User Conditional Hashtag Prediction for Images

It is shown how user metadata (age, gender, etc.) combined with image features derived from a convolutional neural network can be used to perform hashtag prediction and it is demonstrated that modeling the user can significantly improve the tag prediction quality over current state-of-the-art methods.

Seeing through the Human Reporting Bias: Visual Classifiers from Noisy Human-Centric Labels

This paper proposes an algorithm to decouple the human reporting bias from the correct visually grounded labels, and shows significant improvements over traditional algorithms for both image classification and image captioning, doubling the performance of existing methods in some cases.

Attend to You: Personalized Image Captioning with Context Sequence Memory Networks

This work proposes a novel captioning model named Context Sequence Memory Network (CSMN), and shows the effectiveness of the three novel features of CSMN and its performance enhancement for personalized image captioning over state-of-the-art captioning models.

ImageNet: A large-scale hierarchical image database

A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.

Improving Pairwise Ranking for Multi-label Image Classification

  • Y. LiYale SongJiebo Luo
  • Computer Science
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
A novel loss function for pairwise ranking is proposed, which is smooth everywhere, and a label decision module is incorporated into the model, estimating the optimal confidence thresholds for each visual concept.

Learning from Noisy Large-Scale Datasets with Minimal Supervision

An approach to effectively use millions of images with noisy annotations in conjunction with a small subset of cleanly-annotated images to learn powerful image representations and is particularly effective for a large number of classes with wide range of noise in annotations.

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

The Visual Genome dataset is presented, which contains over 108K images where each image has an average of $$35$$35 objects, $$26$$26 attributes, and $$21$$21 pairwise relationships between objects, and represents the densest and largest dataset of image descriptions, objects, attributes, relationships, and question answer pairs.