• Corpus ID: 235694595

PoliTO-IIT Submission to the EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition

  title={PoliTO-IIT Submission to the EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition},
  author={Chiara Plizzari and Mirco Planamente and Emanuele Alberti and Barbara Caputo},
In this report, we describe the technical details of our submission to the EPIC-Kitchens-100 Unsupervised Domain Adaptation (UDA) Challenge in Action Recognition. To tackle the domain-shift which exists under the UDA setting, we first exploited a recent Domain Generalization (DG) technique, called Relative Norm Alignment (RNA). It consists in designing a model able to generalize well to any unseen domain, regardless of the possibility to access target data at training time. Then, in a second… 

Figures and Tables from this paper

Rescaling Egocentric Vision: Collection, Pipeline and Challenges for EPIC-KITCHENS-100
The pipeline to extend the largest dataset in egocentric vision, EPIC-KITCHENS, using a novel pipeline that allows denser and more complete annotations of fine-grained actions and enables new challenges such as action detection and evaluating the “test of time”—i.e. whether models trained on data collected in 2018 can generalise to new footage collected two years later.
Rescaling Egocentric Vision
This paper introduces EPIC-KITCHENS-100, the largest annotated egocentric dataset - 100 hrs, 20M frames, 90K actions - of wearable videos capturing long-term unscripted activities in 45 environments, using a novel annotation pipeline that allows denser and more complete annotations of fine-grained actions.


Multi-Modal Domain Adaptation for Fine-Grained Action Recognition
  • Jonathan Munro, D. Damen
  • Computer Science
    2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)
  • 2019
This work proposes a multi-modal approach for adapting action recognition models to novel environments, employing late fusion of the two modalities commonly used in action recognition (RGB and Flow), with multiple domain discriminators, so alignment of modalities is jointly optimised with recognition.
Unsupervised Domain Adaptation Using Feature-Whitening and Consensus Loss
This work proposes domain alignment layers which implement feature whitening for the purpose of matching source and target feature distributions, and leverage the unlabeled target data by proposing the Min-Entropy Consensus loss, which regularizes training while avoiding the adoption of many user-defined hyper-parameters.
Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset
I3D models considerably improve upon the state-of-the-art in action classification, reaching 80.2% on HMDB-51 and 97.9% on UCF-101 after pre-training on Kinetics, and a new Two-Stream Inflated 3D Conv net that is based on 2D ConvNet inflation is introduced.
Temporal Attentive Alignment for Large-Scale Video Domain Adaptation
This work proposes Temporal Attentive Adversarial Adaptation Network (TA3N), which explicitly attends to the temporal dynamics using domain discrepancy for more effective domain alignment, achieving state-of-the-art performance on four video DA datasets.
Cross-Domain First Person Audio-Visual Action Recognition through Relative Norm Alignment
This work introduces an audio-visual loss that aligns the contributions from the two modalities by acting on the magnitude of their feature norm representations, which leads to strong results in cross domain first person action recognition.
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident.
Unsupervised Domain Adaptation by Backpropagation
The method performs very well in a series of image classification experiments, achieving adaptation effect in the presence of big domain shifts and outperforming previous state-of-the-art on Office datasets.
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
This work proposes a novel architecture for multi-modal temporal-binding, i.e. the combination of modalities within a range of temporal offsets, and demonstrates the importance of audio in egocentric vision, on per-class basis, for identifying actions as well as interacting objects.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Deep Residual Learning for Image Recognition
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.