# Provable Memorization via Deep Neural Networks using Sub-linear Parameters

@inproceedings{Park2021ProvableMV, title={Provable Memorization via Deep Neural Networks using Sub-linear Parameters}, author={Sejun Park and Jaeho Lee and Chulhee Yun and Jinwoo Shin}, booktitle={COLT}, year={2021} }

It is known that $\Theta(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs. By exploiting depth, we show that $\Theta(N^{2/3})$ parameters suffice to memorize $N$ pairs, under a mild condition on the separation of input points. In particular, deeper networks (even with width $3$) are shown to memorize more pairs than shallow networks, which also agrees with the recent line of works on the benefits of depth for function approximation. We also provide…

## 4 Citations

On the Optimal Memorization Power of ReLU Neural Networks

- Computer Science, MathematicsArXiv
- 2021

A generalized construction for networks with depth bounded by 1 ≤ L ≤ √ N , for memorizing N samples using Õ(N/L) parameters, and it is proved that having such a large bit complexity is both necessary and sufficient for memorization with a sub-linear number of parameters.

An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks

- Computer Science, MathematicsArXiv
- 2021

This work improves the dependence on δ from exponential to almost linear, proving that Õ( 1 δ + √ n) neurons and â‚ n weights are sufficient, and proves new lower bounds by connecting memorization in neural networks to the purely geometric problem of separating n points on a sphere using hyperplanes.

Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers

- Computer Science, MathematicsArXiv
- 2021

This work proposes a formal definition of statistically meaningful approximation which requires the approximating network to exhibit good statistical learnability, and shows that overparameterized feedforward neural nets can SM approximate boolean circuits with sample complexity depending only polynomially on the circuit size, not the size of the network.

A Label Management Mechanism for Retinal Fundus Image Classification of Diabetic Retinopathy

- Computer ScienceArXiv
- 2021

This work proposes a novel label management mechanism (LMM) for the DNN to overcome overfitting on the noisy data and demonstrates that LMM could boost performance of models and is superior to three state-of-the-art methods.

## References

SHOWING 1-10 OF 38 REFERENCES

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

- Computer Science, MathematicsNeurIPS
- 2019

By exploiting depth, it is shown that 3-layer ReLU networks with $\Omega(\sqrt{N})$ hidden nodes can perfectly memorize most datasets with $N$ points, and it is proved that width $\Theta($N)$ is necessary and sufficient for memorizing data points, proving tight bounds on memorization capacity.

Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem

- Computer Science, MathematicsICLR
- 2020

A new connection between DNNs expressivity and Sharkovsky's Theorem from dynamical systems is pointed to, that enables us to characterize the depth-width trade-offs of ReLU networks for representing functions based on the presence of generalized notion of fixed points, called periodic points.

Understanding Deep Neural Networks with Rectified Linear Units

- Mathematics, Computer ScienceElectron. Colloquium Comput. Complex.
- 2017

The gap theorems hold for smoothly parametrized families of "hard" functions, contrary to countable, discrete families known in the literature, and a new lowerbound on the number of affine pieces is shown, larger than previous constructions in certain regimes of the network architecture.

Benefits of Depth in Neural Networks

- Mathematics, Computer ScienceCOLT
- 2016

This result is proved here for a class of nodes termed "semi-algebraic gates" which includes the common choices of ReLU, maximum, indicator, and piecewise polynomial functions, therefore establishing benefits of depth against not just standard networks with ReLU gates, but also convolutional networks with reLU and maximization gates, sum-product networks, and boosted decision trees.

The Expressive Power of Neural Networks: A View from the Width

- Computer Science, MathematicsNIPS
- 2017

It is shown that there exist classes of wide networks which cannot be realized by any narrow network whose depth is no more than a polynomial bound, and that narrow networks whose size exceed the polynometric bound by a constant factor can approximate wide and shallow network with high accuracy.

Universal Approximation with Deep Narrow Networks

- Computer Science, MathematicsCOLT 2019
- 2019

The classical Universal Approximation Theorem holds for neural networks of arbitrary width and bounded depth, and nowhere differentiable activation functions, density in noncompact domains with respect to the $L^p$-norm, and how the width may be reduced to just $n + m + 1$ for `most' activation functions.

Deep, Skinny Neural Networks are not Universal Approximators

- Computer Science, MathematicsICLR
- 2019

The topological constraints that the architecture of a neural network imposes on the level sets of all the functions that it is able to approximate are examined.

Understanding deep learning requires rethinking generalization

- Computer ScienceICLR
- 2017

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

The Power of Depth for Feedforward Neural Networks

- Computer Science, MathematicsCOLT
- 2016

It is shown that there is a simple (approximately radial) function on $\reals^d$, expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, unless its width is exponential in the dimension.

Why Does Deep and Cheap Learning Work So Well?

- Mathematics, PhysicsArXiv
- 2016

It is argued that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine learning, a deep neural network can be more efficient than a shallow one.