• Publications
  • Influence
Training data-efficient image transformers & distillation through attention
TLDR
This work produces a competitive convolution-free transformer by training on Imagenet only, and introduces a teacher-student strategy specific to transformers that relies on a distillation token ensuring that the student learns from the teacher through attention. Expand
ADVENT: Adversarial Entropy Minimization for Domain Adaptation in Semantic Segmentation
TLDR
This work proposes two novel, complementary methods using (i) entropy loss and (ii) adversarial loss respectively for unsupervised domain adaptation in semantic segmentation with losses based on the entropy of the pixel-wise predictions. Expand
MUTAN: Multimodal Tucker Fusion for Visual Question Answering
TLDR
MUTAN is introduced, a multimodal tensor-based Tucker decomposition to efficiently parametrize bilinear interactions between visual and textual representations, and a low-rank matrix-based decomposition is designed to explicitly constrain the interaction rank. Expand
RUBi: Reducing Unimodal Biases in Visual Question Answering
TLDR
RUBi, a new learning strategy to reduce biases in any VQA model, is proposed, which reduces the importance of the most biased examples, i.e. examples that can be correctly classified without looking at the image. Expand
WILDCAT: Weakly Supervised Learning of Deep ConvNets for Image Classification, Pointwise Localization and Segmentation
TLDR
This paper introduces WILDCAT, a deep learning method which jointly aims at aligning image regions for gaining spatial invariance and learning strongly localized features, and significantly outperforms state-of-the-art methods. Expand
Going deeper with Image Transformers
TLDR
This work builds and optimize deeper transformer networks for image classification and investigates the interplay of architecture and optimization of such dedicated transformers, improving the accuracy of deep transformers. Expand
Cross-Modal Retrieval in the Cooking Context: Learning Semantic Text-Image Embeddings
TLDR
This paper proposes a cross-modal retrieval model aligning visual and textual data (like pictures of dishes and their recipes) in a shared representation space, and describes an effective learning scheme, capable of tackling large-scale problems. Expand
MUREL: Multimodal Relational Reasoning for Visual Question Answering
TLDR
This paper proposes MuRel, a multimodal relational network which is learned end-to-end to reason over real images, and introduces the introduction of the MuRel cell, an atomic reasoning primitive representing interactions between question and image regions by a rich vectorial representation, and modeling region relations with pairwise combinations. Expand
Addressing Failure Prediction by Learning Model Confidence
TLDR
This paper proposes a new target criterion for model confidence, corresponding to the True Class Probability (TCP), and shows how using the TCP is more suited than relying on the classic maximum class probability (MCP) in the context of failure prediction. Expand
ResMLP: Feedforward networks for image classification with data-efficient training
TLDR
ResMLP is a simple residual network that alternates a linear layer in which image patches interact, independently and identically across channels, and a two-layer feed-forward network in which channels interact independently per patch. Expand
...
1
2
3
4
5
...