Share This Author
VinVL: Revisiting Visual Representations in Vision-Language Models
- Pengchuan Zhang, Xiujun Li, Jianfeng Gao
- Computer ScienceIEEE/CVF Conference on Computer Vision and…
- 1 June 2021
This paper develops an improved object detection model to provide object-centric representations of images and feeds the visual features generated into a Transformer-based VL fusion model OSCAR, and utilizes an improved approach OSCar+ to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks.
PIQA: Reasoning about Physical Commonsense in Natural Language
- Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi
- Computer ScienceAAAI
- 26 November 2019
The task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA are introduced and analysis about the dimensions of knowledge that existing models lack are provided, which offers significant opportunities for future research.
Neural Approaches to Conversational AI
This tutorial surveys neural approaches to conversational AI that were developed in the last few years, and presents a review of state-of-the-art neural approaches, drawing the connection between neural approaches and traditional symbolic approaches.
Evaluation of Text Generation: A Survey
This paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models.
Understanding the Difficulty of Training Transformers
It is revealed that for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable since it amplifies small parameter perturbations and result in significant disturbances in the model output, yet a light dependency limits the potential of model training and can lead to an inferior trained model.
Efficient Self-supervised Vision Transformers for Representation Learning
This paper investigates two techniques for developing efﬁcient self-supervised vision transformers (EsViT) for visual representation learning and proposes a new pre-training task of region matching which allows the model to capture ﬁne-grained region dependencies and as a result improves the quality of the learned vision representations.
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding
- Pengchuan Zhang, Xiyang Dai, Jianfeng Gao
- Computer ScienceIEEE/CVF International Conference on Computer…
- 29 March 2021
A comprehensive empirical study shows that the new ViT significantly outperforms several strong baselines, including the existing ViT models and their ResNet counterparts, and the Pyramid Vision Transformer from a concurrent work, on a range of vision tasks, including image classification, object detection, and segmentation.
Focal Self-attention for Local-Global Interactions in Vision Transformers
A new variant of Vision Transformer models, called Focal Transformer, is proposed, which achieves superior performance over the state-of-the-art (SoTA) vision Transformers on a range of public image classification and object detection benchmarks.
Deep Learning Based Text Classification: A Comprehensive Review
A comprehensive review of more than 150 deep learning based models for text classification developed in recent years are provided, and their technical contributions, similarities, and strengths are discussed.
Soloist: Building Task Bots at Scale with Transfer Learning and Machine Teaching
- Baolin Peng, Chunyuan Li, Jinchao Li, Shahin Shayandeh, Lars Lidén, Jianfeng Gao
- Computer ScienceTransactions of the Association for Computational…
- 1 August 2021
A new method that uses transfer learning and machine teaching to build task bots at scale, Soloist, is presented, which parameterize classical modular task-oriented dialog systems using a Transformer-based auto-regressive language model, which subsumes different dialog modules into a single neural model.