Cross-media Structured Common Space for Multimedia Event Extraction

@article{Li2020CrossmediaSC,
  title={Cross-media Structured Common Space for Multimedia Event Extraction},
  author={Manling Li and Alireza Zareian and Qi Zeng and Spencer Whitehead and Di Lu and Heng Ji and Shih-Fu Chang},
  journal={ArXiv},
  year={2020},
  volume={abs/2005.02472}
}
We introduce a new task, MultiMedia Event Extraction, which aims to extract events and their arguments from multimedia documents. We develop the first benchmark and collect a dataset of 245 multimedia news articles with extensively annotated events and arguments. We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space. The structures are aligned across… Expand
Joint Multimedia Event Extraction from Video and Article
  • Brian Chen, Xudong Lin, +5 authors Shih-Fu Chang
  • Computer Science
  • EMNLP
  • 2021
TLDR
This work introduces the new task of Video MultiMedia Event Extraction (VME) and proposes the first self-supervised multimodal event coreference model that can determine coreference between video events and text events without any manually annotated pairs and introduces the first multimodale transformer which extracts structured event information jointly from both videos and text documents. Expand
MERL: Multimodal Event Representation Learning in Heterogeneous Embedding Spaces
TLDR
A Multimodal Event Representation Learning framework (MERL) is proposed to learn event representations based on both text and image modalities simultaneously to outperforms a number of unimodal and multimodal baselines. Expand
UAMNer: uncertainty-aware multimodal named entity recognition in social media posts
  • Luping Liu, Meiling Wang, Mozhi Zhang, L. Qing, Xiaohai He
  • Computer Science
  • Applied Intelligence
  • 2021
TLDR
A novel uncertainty-aware framework for multimodal NER (UAMNer) on social media is put forward, which combines visual features with text when the text information is insufficient, thus suppressing noisy information from the irrelevant images. Expand
Visual Semantic Role Labeling for Video Understanding
TLDR
This work introduces the VidSitu benchmark, a large scale video understanding data source with 29K 10-second movie clips richly annotated with a verb and semantic-roles every 2 seconds, and provides a comprehensive analysis of the dataset in comparison to other publicly available video understanding benchmarks, several illustrative baselines and evaluate a range of standard video recognition models. Expand
Coreference by Appearance: Visually Grounded Event Coreference Resolution
  • Liming Wang, Shengyu Feng, Xudong Lin, Manling Li, Heng Ji, Shih-Fu Chang
  • CRAC
  • 2021
Event coreference resolution is critical to understand events in the growing number of online news with multiple modalities including text, video, speech, etc. However, the events and entitiesExpand
Abstract Meaning Representation Guided Graph Encoding and Decoding for Joint Information Extraction
TLDR
This work proposes a novel AMR-guided framework for joint information extraction to discover entities, relations, and events with the help of a pre-trained AMR parser to convert natural language texts into structured semantic representations. Expand
Cross-Media Keyphrase Prediction: A Unified Framework with Multi-Modality Multi-Head Attention and Image Wordings
TLDR
This work explores the joint effects of texts and images in predicting the keyphrases for a multimedia post and proposes a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions. Expand
Utilizing Text Structure for Information Extraction
Information Extraction (IE) is one of the important fields of natural language processing (NLP) with the primary goal of creating structured knowledge from unstructured text. In more than twoExpand
Deep Learning Schema-based Event Extraction: Literature Review and Current Trends
TLDR
This paper summarizes the task definition, paradigm, and models of schema-based event extraction and then discusses each of these in detail, focusing on deep learning-based models. Expand
A Comprehensive Survey on Schema-based Event Extraction with Deep Learning
TLDR
This paper summarizes the task definition, paradigm, and models of schema-based event extraction and then discusses each of these in detail, focusing on deep learning-based models. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 74 REFERENCES
Improving Event Extraction via Multimodal Integration
TLDR
This paper first discovers visual patterns from large-scale text-image pairs in a weakly-supervised manner and then proposes a multimodal event extraction algorithm where the event extractor is jointly trained with textual features and visual patterns. Expand
Bi-Level Semantic Representation Analysis for Multimedia Event Detection
TLDR
This work proposes a bi-level semantic representation analyzing method that learns weights of semantic representation attained from different multimedia archives, and restrains the negative influence of noisy or irrelevant concepts in the overall concept-level. Expand
Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks
TLDR
A word-representation model to capture meaningful semantic regularities for words and a framework based on a convolutional neural network to capture sentence-level clues are introduced. Expand
Exploring Pre-trained Language Models for Event Extraction and Generation
TLDR
This work proposes an event extraction model to overcome the roles overlap problem by separating the argument prediction in terms of roles, and proposes a method to automatically generate labeled data by editing prototypes and screen out generated samples by ranking the quality. Expand
Joint Attributes and Event Analysis for Multimedia Event Detection
TLDR
To harness video attributes, an algorithm established on a correlation vector that correlates them to a target event is proposed, which could incorporate video attributes latently as extra information into the event detector learnt from multimedia event videos in a joint framework. Expand
Acquiring Topic Features to improve Event Extraction: in Pre-selected and Balanced Collections
TLDR
This paper investigates the use of unsupervised topic models to extract topic features to improve event extraction both on test data similar to training data, and on more balanced collections, and shows that unsuper supervised topic modeling can get better results for both collections,, and especially for a more balanced collection. Expand
EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video
TLDR
Extensive experiments over the zero-shot event retrieval task when no training samples are available show that the proposed EventNet concept library consistently and significantly outperforms the state-of-the-art by a large margin up to 207%. Expand
Jointly Multiple Events Extraction via Attention-based Graph Information Aggregation
TLDR
This paper proposes a novel Jointly Multiple Events Extraction framework to jointly extract multiple event triggers and arguments by introducing syntactic shortcut arcs to enhance information flow and attention-based graph convolution networks to model graph information. Expand
Joint Event Extraction via Structured Prediction with Global Features
TLDR
This work proposes a joint framework based on structured prediction which extracts triggers and arguments together so that the local predictions can be mutually improved, and proposes to incorporate global features which explicitly capture the dependencies of multiple triggers and argued. Expand
Joint Event Extraction via Recurrent Neural Networks
TLDR
This work proposes to do event extraction in a joint framework with bidirectional recurrent neural networks, thereby benefiting from the advantages of the two models as well as addressing issues inherent in the existing approaches. Expand
...
1
2
3
4
5
...