Video Crowd Localization With Multifocus Gaussian Neighborhood Attention and a Large-Scale Benchmark

  title={Video Crowd Localization With Multifocus Gaussian Neighborhood Attention and a Large-Scale Benchmark},
  author={Haopeng Li and Lingbo Liu and Kunlin Yang and Shinan Liu and Junyu Gao and Bin Zhao and Rui Zhang and Jun Hou},
  journal={IEEE Transactions on Image Processing},
Video crowd localization is a crucial yet challenging task, which aims to estimate exact locations of human heads in the given crowded videos. To model spatial-temporal dependencies of human mobility, we propose a multi-focus Gaussian neighborhood attention (GNA), which can effectively exploit long-range correspondences while maintaining the spatial topological structure of the input videos. In particular, our GNA can also capture the scale variation of human heads well using the equipped multi… 

DR.VIC: Decomposition and Reasoning for Video Individual Counting

This work proposes to conduct the pedes-trian counting from a new perspective - Video Individual Counting (VIC), which counts the total number of individual pedestrians in the given video (a person is only counted once).



Rethinking Counting and Localization in Crowds: A Purely Point-Based Framework

This paper proposes a purely point-based framework for joint crowd counting and individual localization, called Point to Point Network (P2PNet), and proposes a new metric, called density Normalized Average Precision (nAP), to provide more comprehensive and more precise performance evaluation.

Language Models are Few-Shot Learners

GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.

Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds

A novel approach is proposed that simultaneously solves the problems of counting, density map estimation and localization of people in a given dense crowd image and significantly outperforms state-of-the-art on the new dataset, which is the most challenging dataset with the largest number of crowd annotations in the most diverse set of scenes.

Single-Image Crowd Counting via Multi-Column Convolutional Neural Network

With the proposed simple MCNN model, the method outperforms all existing methods and experiments show that the model, once trained on one dataset, can be readily transferred to a new dataset.

A Generalized Loss Function for Crowd Counting and Localization

This paper investigates learning the density map representation through an unbalanced optimal transport problem, and proposes a generalized loss function to learn density maps for crowd counting and localization, proving that pixel-wise L2 loss and Bayesian loss are special cases and suboptimal solutions to the proposed loss function.

Drone-based Joint Density Map Estimation, Localization and Tracking with Space-Time Multi-Scale Attention Network

This paper proposes a space-time multi-scale attention network (STANet) to solve density map estimation, localization and tracking in dense crowds of video clips captured by drones with arbitrary

Perspective-Guided Convolution Networks for Crowd Counting

A novel perspective-guided convolution (PGC) for convolutional neural network (CNN) based crowd counting (i.e. PGCNet) is proposed, which aims to overcome the dramatic intra-scene scale variations of people due to the perspective effect.

Locate, Size, and Count: Accurately Resolving People in Dense Crowds via Detection

This work introduces a detection framework for dense crowd counting and eliminates the need for the prevalent density regression paradigm, and shows that LSC-CNN not only has superior localization than existing density regressors, but outperforms in counting as well.

Stand-Alone Self-Attention in Vision Models

The results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox and is especially impactful when used in later layers.

Video Joint Modelling Based on Hierarchical Transformer for Co-summarization

Transformer-based video representation reconstruction is introduced to maximize the high-level similarity between the summary and the original video and the superiority of VJMHT in terms of F-measure and rank-based evaluation.