MMFeat: A Toolkit for Extracting Multi-Modal Features

@inproceedings{Kiela2016MMFeatAT,
  title={MMFeat: A Toolkit for Extracting Multi-Modal Features},
  author={Douwe Kiela},
  booktitle={ACL},
  year={2016}
}
  • Douwe Kiela
  • Published in ACL 1 August 2016
  • Computer Science
Research at the intersection of language and other modalities, most notably vision, is becoming increasingly important in natural language processing. We introduce a toolkit that can be used to obtain feature representations for visual and auditory information. MMFEAT is an easy-to-use Python toolkit, which has been developed with the purpose of making non-linguistic modalities more accessible to natural language processing researchers. 
Modelling Visual Properties and Visual Context in Multimodal Semantics
TLDR
This work constructs multimodal models that differentiate between internal visual properties of the objects and their external visual context, and evaluates the models on the task of decoding brain activity associated with the meanings of nouns, demonstrating their advantage over those based on complete images.
Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics
TLDR
This study systematically compares deep visual representation learning techniques, experimenting with three well-known network architectures, and explores the optimal number of images and the multi-lingual applicability of multi-modal semantics.
Comparing Data Sources and Architectures for Deep Visual Representation Learning in Semantics
TLDR
This study systematically compare deep visual representation learning techniques, experimenting with three well-known network architectures, and explores the optimal number of images and the multi-lingual applicability of multi-modal semantics.
Deconstructing multimodality: visual properties and visual context in human semantic processing
TLDR
This work constructs multimodal models that differentiate between internal visual properties of the objects and their external visual context, and evaluates the models on the task of decoding brain activity associated with the meanings of nouns, demonstrating their advantage over those based on complete images.
Developing a Comprehensive Framework for Multimodal Feature Extraction
TLDR
The Pliers package is an open-source Python package that supports standardized annotation of diverse data types, and is expressly implemented with both ease-of-use and extensibility in mind, and can significantly reduce the time and effort required to construct simple feature extraction workflows while increasing code clarity and maintainability.
Multi-Modal Embeddings for Common Sense Reasoning
TLDR
This work builds on the idea that certain words lend themselves to be better described in a particular modality depending on the “concreteness” of the concept described by the word by allowing the model to use word embeddings derived from multiple modalities using attention mechanisms.
If Sentences Could See: Investigating Visual Information for Semantic Textual Similarity
TLDR
The effects of incorporating visual signal from images into unsupervised Semantic Textual Similarity measures are investigated and it is shown that selective inclusion of visual information may further boost performance in the multi-modal setup.
Representing Verbs with Visual Argument Vectors
TLDR
It is demonstrated that, using visual distributional models, it is possible to extract meaningful information and to effectively capture the semantic similarity between verbs.
Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description
TLDR
The results from the second shared task on multimodal machine translation and multilingual image description show multi-modal systems improved, but text-only systems remain competitive.
Visual Denotations for Recognizing Textual
TLDR
This work proposes to map phrases to their visual denotations and compare their meaning in terms of their images to show that this approach is effective in the task of Recognizing Textual Entailment when combined with specific linguistic and logic features.
...
1
2
...

References

SHOWING 1-10 OF 45 REFERENCES
Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics
We construct multi-modal concept representations by concatenating a skip-gram linguistic representation vector with a visual concept representation vector computed using the feature extraction layers
Learning Grounded Meaning Representations with Autoencoders
TLDR
A new model is introduced which uses stacked autoencoders to learn higher-level embeddings from textual and visual input and which outperforms baselines and related models on similarity judgments and concept categorization.
Vision and Feature Norms: Improving automatic feature norm learning through cross-modal maps
TLDR
This paper examines property norm prediction from visual, rather than textual, data, using cross-modal maps learnt between property norm and visual spaces, and investigates the importance of having a complete feature norm dataset, for both training and testing.
Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More
TLDR
An unsupervised method to determine whether to include perceptual input for a concept is proposed, and it is shown that it significantly improves the ability of multi-modal models to learn and represent word meanings.
Visual Information in Semantic Representation
TLDR
Experimental results show that a closer correspondence to human data can be obtained by taking the visual modality into account and a model of multimodal meaning representation which is based on the linguistic and visual context is developed.
Visual Bilingual Lexicon Induction with Transferred ConvNet Features
TLDR
By applying features from a convolutional neural network to the task of bilingual lexicon induction using imagebased features, state-of-the-art performance is obtained on a standard dataset, obtaining a 79% relative improvement over previous work which uses bags of visual words based on SIFT features.
Multi- and Cross-Modal Semantics Beyond Vision: Grounding in Auditory Perception
TLDR
This work examines grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics, including measuring conceptual similarity and relatedness, through a zero-shot learning task mapping between linguistic and auditory modalities.
Grounding Distributional Semantics in the Visual World
TLDR
This article reviews how methods from computer vision are exploited to tackle the fundamental problem of grounding distributional semantic models, bringing them closer to providing a full-fledged computational account of meaning.
Distributional Semantics in Technicolor
TLDR
While visual models with state-of-the-art computer vision techniques perform worse than textual models in general tasks, they are as good or better models of the meaning of words with visual correlates such as color terms, even in a nontrivial task that involves nonliteral uses of such words.
Multimodal Distributional Semantics
TLDR
This work proposes a flexible architecture to integrate text- and image-based distributional information, and shows in a set of empirical tests that the integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.
...
1
2
3
4
5
...