Learn More
We present a novel per-dimension learning rate method for gradient descent called ADADELTA. The method dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochas-tic gradient descent. The method requires no manual tuning of a learning rate and appears robust to noisy gradient information ,(More)
We introduce DropConnect, a generalization of Dropout (Hinton et al., 2012), for regular-izing large fully-connected layers within neu-ral networks. When training with Dropout, a randomly selected subset of activations are set to zero within each layer. DropCon-nect instead sets a randomly selected subset of weights within the network to zero. Each unit(More)
We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is(More)
We present a hierarchical model that learns image de-compositions via alternating layers of convolutional sparse coding and max pooling. When trained on natural images, the layers of our model capture image information in a variety of forms: low-level edges, mid-level edge junctions, high-level object parts and complete objects. To build our model we rely(More)
Deep neural networks have recently become the gold standard for acoustic modeling in speech recognition systems. The key computational unit of a deep network is a linear projection followed by a point-wise non-linearity, which is typically a logistic function. In this work, we show that we can improve generalization and make training of deep networks faster(More)
Introduction Building robust low-level image representations, beyond edge primitives, is a long-standing goal in vision. In its most basic form, an image is a matrix of intensities. How we should progress from this matrix to stable mid-level representations, useful for high-level vision tasks, remains unclear. Popular feature representations such as SIFT or(More)
We present a type of Temporal Restricted Boltzmann Machine that defines a probability distribution over an output sequence conditional on an input sequence. It shares the desirable properties of RBMs: efficient exact inference, an exponentially more expressive latent state than HMMs, and the ability to model nonlinear structure and dynamics. We apply our(More)