Share This Author
VinVL: Revisiting Visual Representations in Vision-Language Models
- Pengchuan Zhang, Xiujun Li, Jianfeng Gao
- Computer ScienceIEEE/CVF Conference on Computer Vision and…
- 1 June 2021
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCar+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
PIQA: Reasoning about Physical Commonsense in Natural Language
- Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi
- Computer ScienceAAAI
- 26 November 2019
The task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA are introduced and analysis about the dimensions of knowledge that existing models lack are provided, which offers significant opportunities for future research.
Neural Approaches to Conversational AI
This tutorial surveys neural approaches to conversational AI that were developed in the last few years, and presents a review of state-of-the-art neural approaches, drawing the connection between neural approaches and traditional symbolic approaches.
Evaluation of Text Generation: A Survey
This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.
Understanding the Difficulty of Training Transformers
It is revealed that for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable since it amplifies small parameter perturbations and result in significant disturbances in the model output, yet a light dependency limits the potential of model training and can lead to an inferior trained model.
Efficient Self-supervised Vision Transformers for Representation Learning
This paper investigates two techniques for developing efﬁcient self-supervised vision transformers (EsViT) for visual representation learning and proposes a new pre-training task of region matching which allows the model to capture ﬁne-grained region dependencies and as a result improves the quality of the learned vision representations.
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
- Pengchuan Zhang, Xiyang Dai, Jianfeng Gao
- Computer ScienceIEEE/CVF International Conference on Computer…
- 29 March 2021
A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer from a concurrent work, on a range of vision tasks, including image classification, object detection, and segmentation.
Focal Self-attention for Local-Global Interactions in Vision Transformers
A new variant of Vision Transformer models, called Focal Transformer, is proposed, which achieves superior performance over the state-of-the-art (SoTA) vision Transformers on a range of public image classification and object detection benchmarks.
Deep Learning Based Text Classification: A Comprehensive Review
A comprehensive review of more than 150 deep learning based models for text classification developed in recent years are provided, and their technical contributions, similarities, and strengths are discussed.
DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing
This paper presents a new pre-trained language model, DeBERTaV3, which improves the original DeBERTa model by replacing mask language modeling (MLM) with replaced token detection (RTD), a more…