Understanding the Covariance Structure of Convolutional Filters

  title={Understanding the Covariance Structure of Convolutional Filters},
  author={Asher Trockman and Devin Willmott and J. Zico Kolter},
Neural network weights are typically initialized at random from univariate distributions, controlling just the variance of individual weights even in highlystructured operations like convolutions. Recent ViT-inspired convolutional networks such as ConvMixer and ConvNeXt use large-kernel depthwise convolutions whose learned filters have notable structure; this presents an opportunity to study their empirical covariances. In this work, we first observe that such learned filters have highly… 



FlexConv: Continuous Kernel Convolutions with Differentiable Kernel Sizes

FlexNets model long-term dependencies without the use of pooling, achieve state-of-the-art performance on several sequential datasets, outperform recent works with learned kernel sizes, and are competitive with much deeper ResNets on image benchmark datasets.

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks

This work demonstrates that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme, and presents an algorithm for generating such random initial orthogonal convolution kernels.

Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping

Deep Kernel Shaping (DKS) enables SGD training of residual networks without normalization layers on Imagenet and CIFAR-10 classification tasks at speeds comparable to standard ResNetV2 and Wide-ResNet models, with only a small decrease in generalization performance.

Understanding the difficulty of training deep feedforward neural networks

The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.

Patches Are All You Need?

The ConvMixer is proposed, an extremely simple model that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network.

More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity

This study ends up with a recipe for applying extremely large kernels from the perspective of sparsity, which can smoothly scale up kernels to 61x61 with better performance and proposes Sparse Large Kernel Network (SLaK), a pure CNN architecture equipped with sparse factorized 51x51 kernels that can perform on par with or better than state-of-the-art hierarchical Transformers and modern ConvNet architectures.

ImageNet classification with deep convolutional neural networks

A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.

Scaling Up Your Kernels to 31×31: Revisiting Large Kernel Design in CNNs

It is demonstrated that using a few large convolutional kernels instead of a stack of small kernels could be a more powerful paradigm, and proposed RepLKNet, a pure CNN architecture whose kernel size is as large as 31×31, in contrast to commonly used 3×3.

On the Connection between Local Attention and Dynamic Depth-wise Convolution

It is observed that the depth-wise convolution based DWNet and its dynamic variants with lower computation complexity perform on-par with or slightly better than Swin Transformer, an instance of Local Vision Trans transformer, for ImageNet classification, COCO object detection and ADE semantic segmentation.

Can CNNs Be More Robust Than Transformers?

This paper examines the design of Transformers to build pure CNN architectures without any attention-like operations that are as robust as, or even more robust than, Transformers, and leads to three highly effective architecture designs for boosting robustness, yet simple enough to be implemented in several lines of code.