• Publications
  • Influence
Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation
This work extends DeepLabv3 by adding a simple yet effective decoder module to refine the segmentation results especially along object boundaries and applies the depthwise separable convolution to both Atrous Spatial Pyramid Pooling and decoder modules, resulting in a faster and stronger encoder-decoder network.
Skip-Thought Vectors
We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the
Searching for MobileNetV3
This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art of MobileNets.
Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books
To align movies and books, a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book are proposed.
MovieQA: Understanding Stories in Movies through Question-Answering
The MovieQA dataset, which aims to evaluate automatic story comprehension from both video and text, is introduced and existing QA techniques are extended to show that question-answering with such open-ended semantics is hard.
3D Object Proposals for Accurate Object Class Detection
This method exploits stereo imagery to place proposals in the form of 3D bounding boxes in the context of autonomous driving and outperforms all existing results on all three KITTI object classes.
Panoptic-DeepLab: A Simple, Strong, and Fast Baseline for Bottom-Up Panoptic Segmentation
For the first time, a bottom-up approach could deliver state-of-the-art results on panoptic segmentation, and performs on par with several top-down approaches on the challenging COCO dataset.
Spatially Adaptive Computation Time for Residual Networks
Experimental results are presented showing that this model improves the computational efficiency of Residual Networks on the challenging ImageNet classification and COCO object detection datasets and the computation time maps on the visual saliency dataset cat2000 correlate surprisingly well with human eye fixation positions.
Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation
This paper factorizes 2D self-attention into two 1Dself-attentions, a novel building block that one could stack to form axial-att attention models for image classification and dense prediction, and achieves state-of-the-art results on Mapillary Vistas and Cityscapes.
Searching for Efficient Multi-Scale Architectures for Dense Image Prediction
This work constructs a recursive search space for meta-learning techniques for dense image prediction focused on the tasks of scene parsing, person-part segmentation, and semantic image segmentation and demonstrates that even with efficient random search, this architecture can outperform human-invented architectures.