End-to-End People Detection in Crowded Scenes

  title={End-to-End People Detection in Crowded Scenes},
  author={Russell Stewart and Mykhaylo Andriluka and A. Ng},
  journal={2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
Current people detectors operate either by scanning an image in a sliding window fashion or by classifying a discrete set of proposals. [] Key Method Because we generate predictions jointly, common post-processing steps such as nonmaximum suppression are unnecessary. We use a recurrent LSTM layer for sequence generation and train our model end-to-end with a new loss function that operates on sets of detections. We demonstrate the effectiveness of our approach on the challenging task of detecting people in…

Figures from this paper

People detection in crowded scenes via regional-based convolutional network

This work proposes an end-to-end framework that uses the convolutional network for feature representation, which generates candidate proposals online and performs the location optimization simultaneously and greatly outperforms traditional methods.

End-to-end crowd counting via joint learning local and global count

  • C. ShangH. AiBo Bai
  • Computer Science
    2016 IEEE International Conference on Image Processing (ICIP)
  • 2016
An end-to-end convolutional neural network architecture that takes a whole image as its input and directly outputs the counting result, taking advantages of contextual information when predicting both local and global count is proposed.

Social Scene Understanding: End-to-End Multi-person Action Localization and Collective Activity Recognition

A single architecture is proposed that does not rely on external detection algorithms but rather is trained end-to-end to generate dense proposal maps that are refined via a novel inference scheme.

Multi-Layer Proposal Network for People Counting in Crowded Scene

  • Chao WangYong Zhao
  • Computer Science
    2017 10th International Conference on Intelligent Computation Technology and Automation (ICICTA)
  • 2017
By separately detecting the heads of people, this system gets rid of the problems caused by occlusion and achieves robust detection results in the case that no matter whether the head is towards or backwards to the camera.

Detection in Crowded Scenes: One Proposal, Multiple Predictions

The key of the approach is to let each proposal predict a set of correlated instances rather than a single one in previous proposal-based frameworks, which can effectively handle the difficulty of detecting highly overlapped objects.

Learning to Filter Object Detections

A filtering network (FNet) is proposed, a method which replaces NMS with a differentiable neural network that allows joint reasoning and re-scoring of the generated set of hypotheses per image, and demonstrates that FNet, a feed-forward network architecture, is able to mimic NMS decisions, despite the sequential nature of NMS.

People detection in crowded scenes using hierarchical features

We propose a new architecture(based on Faster R-CNN framework) for people detection. Our model extracts the first, third, fifth stage of the VGG16 network to form a robust feature map which consists

Detective: An Attentive Recurrent Model for Sparse Object Detection

This work proposes a training mechanism based on the Hungarian algorithm and a loss that balances the localization and classification tasks that allows Detective to achieve promising results on the PASCAL VOC dataset for object detection.

End-to-End Object Detection with Transformers

This work presents a new method that views object detection as a direct set prediction problem, and demonstrates accuracy and run-time performance on par with the well-established and highly-optimized Faster RCNN baseline on the challenging COCO object detection dataset.

Progressive End-to-End Object Detection in Crowded Scenes

A progressive predicting method that first select accepted queries prone to generate true positive predictions, then refine the rest noisy queries according to the previously accepted predictions, and can significantly boost the performance of query-based detectors in crowded scenes.



Learning People Detectors for Tracking in Crowded Scenes

This paper proposes a novel joint people detector that combines a state-of-the-art single person detector with a detector for pairs of people, which explicitly exploits common patterns of person-person occlusions across multiple viewpoints that are a frequent failure case for tracking in crowded scenes.

Detection and Tracking of Occluded People

This work observes that typical occlusions are due to overlaps between people and proposes a people detector tailored to various occlusion levels, and leverages the fact that person/person Occlusion result in very characteristic appearance patterns that can help to improve detection results.

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

This integrated framework for using Convolutional Networks for classification, localization and detection is the winner of the localization task of the ImageNet Large Scale Visual Recognition Challenge 2013 and obtained very competitive results for the detection and classifications tasks.

Pedestrian detection in crowded scenes

Qualitative and quantitative results on a large data set confirm that the core part of the method is the combination of local and global cues via probabilistic top-down segmentation that allows examining and comparing object hypotheses with high precision down to the pixel level.

End-to-end integration of a Convolutional Network, Deformable Parts Model and non-maximum suppression

  • Li WanD. EigenR. Fergus
  • Computer Science
    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2015
This work trains a new model using a new structured loss function that considers all bounding boxes within an image, rather than isolated object instances, and enables the non-maximal suppression operation, previously treated as a separate post-processing stage, to be integrated into the model.

People-tracking-by-detection and people-detection-by-tracking

This paper combines the advantages of both detection and tracking in a single framework using a hierarchical Gaussian process latent variable model (hGPLVM) and presents experimental results that demonstrate how this allows to detect and track multiple people in cluttered scenes with reoccurring occlusions.

Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

This paper proposes a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%.

Multi-cue onboard pedestrian detection

Evaluating different features and classifiers in a sliding-window framework indicates that incorporating motion information improves detection performance significantly and the combination of multiple and complementary feature types can also help improve performance.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

This work introduces a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals and further merge RPN and Fast R-CNN into a single network by sharing their convolutionAL features.

Filtered channel features for pedestrian detection

A unifying framework is proposed that multiple top performing pedestrian detectors can be modelled by using an intermediate layer filtering low-level features in combination with a boosted decision forest and experimentally explore different filter families.