CounTR: Transformer-based Generalised Visual Counting

  title={CounTR: Transformer-based Generalised Visual Counting},
  author={Chang Liu and Yujie Zhong and Andrew Zisserman and Weidi Xie},
In this paper, we consider the problem of generalised visual object counting, with the goal of developing a computational model for counting the number of objects from arbitrary semantic categories, using arbitrary number of “exemplars”, i.e. zero-shot or few-shot counting. To this end, we make the following four contributions: (1) We introduce a novel transformer-based architecture for generalised visual object counting, termed as Counting TRansformer ( CounTR ), which explicitly captures the… 

Figures and Tables from this paper

A Low-Shot Object Counting Network With Iterative Prototype Adaptation

A Low-shot Object Counting network with iterative prototype Adaptation (LOCA) is proposed, which iteratively fuses the exemplar shape and appearance queries with image features and achieves state-of-the-art on zero-shot scenarios, while demonstrating better generalization capabilities.



Masked Autoencoders Are Scalable Vision Learners

This paper develops an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens.

Drone-Based Object Counting by Spatially Regularized Regional Proposal Network

This work presents a new large-scale car parking lot dataset (CARPK) that contains nearly 90,000 cars captured from different parking lots and is the first and the largest drone view dataset that supports object counting, and provides the bounding box annotations.

Learning To Count Everything

This paper presents a novel method that takes a query image together with a few exemplar objects from the query image and predicts a density map for the presence of all objects of interest in the query photo, and introduces a dataset of 147 object categories containing over 6000 images suitable for the few-shot counting task.

Class-Agnostic Counting

The model achieves competitive performance on cell and crowd counting datasets, and surpasses the state-of-the-art on the car dataset using only three training images, when training on the entire dataset, and outperforms all previous methods by a large margin.

Exemplar Free Class Agnostic Counting

This work proposes a visual counter which operates in a fully automated setting and does not require any test time adaptation, and first identifies exemplars from repeating objects in an image, and then counts the repeating objects.

Learning to Count Anything: Reference-less Class-agnostic Counting with Weak Supervision

This work identifies that counting is, at its core, a repetition-recognition task and shows that a general feature space, with global context, is sufficient to enumerate instances in an image without a prior on the object type present.

Represent, Compare, and Learn: A Similarity-Aware Framework for Class-Agnostic Counting

A similarity-aware CAC framework that jointly learns representation and similarity metric and extends BMNet to BMNet+ that models similarity from three aspects that significantly outperform state-of-the-art CAC approaches.

Few-shot Object Counting with Similarity-Aware Feature Enhancement

This work studies the problem of few-shot object counting, which counts the number of exemplar objects occurring in the query image, and proposes a novel learning block, equipped with a similarity comparison module and a feature enhancement module, to tackle the obstacle.

Emerging Properties in Self-Supervised Vision Transformers

This paper questions if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets) and implements DINO, a form of self-distillation with no labels, which implements the synergy between DINO and ViTs.

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Vision Transformer (ViT) attains excellent results compared to state-of-the-art convolutional networks while requiring substantially fewer computational resources to train.