Semantic Pooling for Complex Event Analysis in Untrimmed Videos

@article{Chang2017SemanticPF,
  title={Semantic Pooling for Complex Event Analysis in Untrimmed Videos},
  author={Xiaojun Chang and Yaoliang Yu and Yi Yang and Eric P. Xing},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2017},
  volume={39},
  pages={1617-1632}
}
Pooling plays an important role in generating a discriminative video representation. In this paper, we propose a new semantic pooling approach for challenging event analysis tasks (e.g., event detection, recognition, and recounting) in long untrimmed Internet videos, especially when only a few shots/segments are relevant to the event of interest while many other shots are irrelevant or even misleading. The commonly adopted pooling strategies aggregate the shots indifferently in one way or… 
Revealing Event Saliency in Unconstrained Video Collection
TLDR
This paper proposes an unsupervised event saliency revealing framework that first extracts features from multiple modalities to represent each shot in the given video collection, and systematically compares the method to a number of baseline methods on the TRECVID benchmarks.
Grounding Visual Concepts for Zero-Shot Event Detection and Event Captioning
TLDR
This work is the first time to define and solve the MEC task, which is a further step towards understanding video events, and achieves state-of-the-art performance on the TRECVID MEDTest dataset, as well as the newly proposed TREC VID-MEC dataset.
Complex Event Detection by Identifying Reliable Shots from Untrimmed Videos
TLDR
A new MIL method is proposed, which simultaneously learns a linear SVM classifier and infers a binary indicator for each instance in order to select reliable training instances from each positive or negative bag.
Complex event detection via attention-based video representation and classification
TLDR
Experimental results show that the proposed single model outperforms state-of-the-art approaches on all three real-world video datasets, and demonstrate the effectiveness.
Single-shot Semantic Matching Network for Moment Localization in Videos
TLDR
A lightweight single-shot semantic matching network (SSMN) is presented to avoid the complex computations required to match the query and the segment candidates, and the proposed SSMN can locate moments of any length theoretically.
One-Shot SADI-EPE: A Visual Framework of Event Progress Estimation
TLDR
A visual human action analysis-based framework, namely one-shot simultaneously action detection and identification (SADI)-EPE, is presented and an evaluation criterion for the estimation problem is proposed, which demonstrated the efficacy of the proposed framework.
Towards More Explainability: Concept Knowledge Mining Network for Event Recognition
TLDR
A concept knowledge mining network (CKMN) for event recognition that aims to obtain a complete concept representation by mining the existing pattern of each concept at different time granularities with dilated temporal pyramid convolution and temporal self-attention.
The Many Shades of Negativity
TLDR
The state-of-the-art deep convolutional neural network features are leveraged in the approach for event detection to further boost the performance and introduce a constraint for this purpose.
ZSTAD: Zero-Shot Temporal Activity Detection
TLDR
This work designs an end-to-end deep network based on R-C3D that is optimized with an innovative loss function that considers the embeddings of activity labels and their super-classes while learning the common semantics of seen and unseen activities.
...
...

References

SHOWING 1-10 OF 76 REFERENCES
Complex Event Detection using Semantic Saliency and Nearly-Isotonic SVM
TLDR
A novel notion of semantic saliency is defined that assesses the relevance of each shot with the event of interest and prioritize the shots according to their saliency scores since shots that are semantically more salient are expected to contribute more to the final event detector.
Searching Persuasively: Joint Event Detection and Evidence Recounting with Limited Supervision
TLDR
A joint framework that simultaneously detects high-level events and localizes the indicative concepts of the events and improves detection by pruning irrelevant noisy concepts while detection directs recounting to the most discriminative evidences is proposed.
Video2vec Embeddings Recognize Events When Examples Are Scarce
TLDR
By its ability to improve predictability of present day audio-visual video features, while at the same time maximizing their semantic descriptiveness, Video2vec leads to state-of-the-art accuracy for both few- and zero-example recognition of events in video.
Multimedia Event Detection Using A Classifier-Specific Intermediate Representation
TLDR
This paper has created a discriminative semantic analysis framework based on a tightly coupled intermediate representation that integrates the classifier inference and latent intermediate representation into a joint framework.
Video event recognition using concept attributes
TLDR
This work proposes to use action, scene and object concepts as semantic attributes for classification of video events in InTheWild content, such as YouTube videos, and shows how the proposed enhanced event model can further improve the zero-shot learning.
A discriminative CNN video representation for event detection
TLDR
This paper proposes using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally affordable, in a new state-of-the-art performance in event detection over the largest video datasets.
Enhancing Video Event Recognition Using Automatically Constructed Semantic-Visual Knowledge Base
TLDR
This paper proposes to construct a semantic-visual knowledge base to encode the rich event-centric concepts and their relationships from the well- established lexical databases, including FrameNet, as well as the concept-specific visual knowledge from ImageNet, and designs an effective system for video event recognition.
Learning latent temporal structure for complex event detection
TLDR
A conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks is utilized.
Bag-of-Fragments: Selecting and Encoding Video Fragments for Event Detection and Recounting
TLDR
The bag-of-fragments forms an effective encoding for event detection and is able to provide a precise temporally localized event recounting, and it is concluded that fragments matter for video event Detection and recounting.
Dynamic Pooling for Complex Event Recognition
TLDR
The problem of adaptively selecting pooling regions for the classification of complex video events is considered and it is shown that a globally optimal solution to the inference problem can be obtained efficiently, through the solution of a series of linear programs.
...
...