• Corpus ID: 225070268

Provable Memorization via Deep Neural Networks using Sub-linear Parameters

@inproceedings{Park2021ProvableMV,
  title={Provable Memorization via Deep Neural Networks using Sub-linear Parameters},
  author={Sejun Park and Jaeho Lee and Chulhee Yun and Jinwoo Shin},
  booktitle={COLT},
  year={2021}
}
It is known that $\Theta(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs. By exploiting depth, we show that $\Theta(N^{2/3})$ parameters suffice to memorize $N$ pairs, under a mild condition on the separation of input points. In particular, deeper networks (even with width $3$) are shown to memorize more pairs than shallow networks, which also agrees with the recent line of works on the benefits of depth for function approximation. We also provide… 

Figures from this paper

On the Optimal Memorization Power of ReLU Neural Networks
TLDR
A generalized construction for networks with depth bounded by 1 ≤ L ≤ √ N , for memorizing N samples using Õ(N/L) parameters, and it is proved that having such a large bit complexity is both necessary and sufficient for memorization with a sub-linear number of parameters.
An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks
TLDR
This work improves the dependence on δ from exponential to almost linear, proving that Õ( 1 δ + √ n) neurons and â‚ n weights are sufficient, and proves new lower bounds by connecting memorization in neural networks to the purely geometric problem of separating n points on a sphere using hyperplanes.
Width is Less Important than Depth in ReLU Neural Networks
TLDR
It is shown that depth plays a more significant role than width in the expressive power of neural networks, and an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network.
When Expressivity Meets Trainability: Fewer than $n$ Neurons Can Work
TLDR
It is proved that as long as the width m ≥ 2n/d, its expressivity is strong, and it is expected that projected gradient methods converge to KKT points under mild technical conditions, but the rigorous convergence analysis is left to future work.
Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers
TLDR
This work proposes a formal definition of statistically meaningful approximation which requires the approximating network to exhibit good statistical learnability, and shows that overparameterized feedforward neural nets can SM approximate boolean circuits with sample complexity depending only polynomially on the circuit size, not the size of the network.
Metric Hypertransformers are Universal Adapted Maps
TLDR
The MHT models introduced here are able to approximate a broad range of stochastic processes’ kernels, including solutions to SDEs, many processes with arbitrarily long memory, and functions mapping sequential data to sequences of forward rate curves.
A Label Management Mechanism for Retinal Fundus Image Classification of Diabetic Retinopathy
TLDR
This work proposes a novel label management mechanism (LMM) for the DNN to overcome overfitting on the noisy data and demonstrates that LMM could boost performance of models and is superior to three state-of-the-art methods.
Expressiveness of Neural Networks Having Width Equal or Below the Input Dimension
TLDR
It is concluded from a maximum principle that for all continuous and monotonic activation functions, universal approximation of arbitrary continuous functions is impossible on sets that coincide with the boundary of an open set plus an inner point of that set.

References

SHOWING 1-10 OF 38 REFERENCES
Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity
TLDR
By exploiting depth, it is shown that 3-layer ReLU networks with $\Omega(\sqrt{N})$ hidden nodes can perfectly memorize most datasets with $N$ points, and it is proved that width $\Theta($N)$ is necessary and sufficient for memorizing data points, proving tight bounds on memorization capacity.
Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem
TLDR
A new connection between DNNs expressivity and Sharkovsky's Theorem from dynamical systems is pointed to, that enables us to characterize the depth-width trade-offs of ReLU networks for representing functions based on the presence of generalized notion of fixed points, called periodic points.
Understanding Deep Neural Networks with Rectified Linear Units
TLDR
The gap theorems hold for smoothly parametrized families of "hard" functions, contrary to countable, discrete families known in the literature, and a new lowerbound on the number of affine pieces is shown, larger than previous constructions in certain regimes of the network architecture.
Benefits of Depth in Neural Networks
TLDR
This result is proved here for a class of nodes termed "semi-algebraic gates" which includes the common choices of ReLU, maximum, indicator, and piecewise polynomial functions, therefore establishing benefits of depth against not just standard networks with ReLU gates, but also convolutional networks with reLU and maximization gates, sum-product networks, and boosted decision trees.
The Expressive Power of Neural Networks: A View from the Width
TLDR
It is shown that there exist classes of wide networks which cannot be realized by any narrow network whose depth is no more than a polynomial bound, and that narrow networks whose size exceed the polynometric bound by a constant factor can approximate wide and shallow network with high accuracy.
Universal Approximation with Deep Narrow Networks
TLDR
The classical Universal Approximation Theorem holds for neural networks of arbitrary width and bounded depth, and nowhere differentiable activation functions, density in noncompact domains with respect to the $L^p$-norm, and how the width may be reduced to just $n + m + 1$ for `most' activation functions.
Deep, Skinny Neural Networks are not Universal Approximators
TLDR
The topological constraints that the architecture of a neural network imposes on the level sets of all the functions that it is able to approximate are examined.
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.
The Power of Depth for Feedforward Neural Networks
TLDR
It is shown that there is a simple (approximately radial) function on $\reals^d$, expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, unless its width is exponential in the dimension.
Why Does Deep and Cheap Learning Work So Well?
TLDR
It is argued that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine learning, a deep neural network can be more efficient than a shallow one.
...
1
2
3
4
...