• Publications
  • Influence
Aggregated Residual Transformations for Deep Neural Networks
On the ImageNet-1K dataset, it is empirically show that even under the restricted condition of maintaining complexity, increasing cardinality is able to improve classification accuracy and is more effective than going deeper or wider when the authors increase the capacity. Expand
Momentum Contrast for Unsupervised Visual Representation Learning
We present Momentum Contrast (MoCo) for unsupervised visual representation learning. From a perspective on contrastive learning as dictionary look-up, we build a dynamic dictionary with a queue and aExpand
Holistically-Nested Edge Detection
HED performs image-to-image prediction by means of a deep learning model that leverages fully convolutional neural networks and deeply-supervised nets, and automatically learns rich hierarchical representations that are important in order to resolve the challenging ambiguity in edge and object boundary detection. Expand
Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification
It is shown that it is possible to replace many of the 3D convolutions by low-cost 2D convolution, suggesting that temporal representation learning on high-level “semantic” features is more useful. Expand
Deeply-Supervised Nets
The proposed deeply-supervised nets (DSN) method simultaneously minimizes classification error while making the learning process of hidden layers direct and transparent, and extends techniques from stochastic gradient methods to analyze the algorithm. Expand
Decoupling Representation and Classifier for Long-Tailed Recognition
It is shown that it is possible to outperform carefully designed losses, sampling strategies, even complex modules with memory, by using a straightforward approach that decouples representation and classification. Expand
Exploring Randomly Wired Neural Networks for Image Recognition
The results suggest that new efforts focusing on designing better network generators may lead to new breakthroughs by exploring less constrained search spaces with more room for novel design. Expand
Rethinking Spatiotemporal Feature Learning For Video Understanding
Interestingly, it was found that 3D convolutions at the top layers of the network contribute more than 3D Convolutional networks at the bottom layers, while also being computationally more efficient, indicating that I3D is better at capturing high-level temporal patterns than low-level motion signals. Expand
PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding
This work aims at facilitating research on 3D representation learning by selecting a suite of diverse datasets and tasks to measure the effect of unsupervised pre-training on a large source set of 3D scenes and achieving improvement over recent best results in segmentation and detection across 6 different benchmarks. Expand
An Empirical Study of Training Self-Supervised Vision Transformers
It is revealed that ViT results are indeed partial failure, and they can be improved when training is made more stable, and the currently positive evidence as well as challenges and open questions are discussed. Expand