# Provable Memorization via Deep Neural Networks using Sub-linear Parameters

@inproceedings{Park2021ProvableMV, title={Provable Memorization via Deep Neural Networks using Sub-linear Parameters}, author={Sejun Park and Jaeho Lee and Chulhee Yun and Jinwoo Shin}, booktitle={COLT}, year={2021} }

It is known that $\Theta(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs. By exploiting depth, we show that $\Theta(N^{2/3})$ parameters suffice to memorize $N$ pairs, under a mild condition on the separation of input points. In particular, deeper networks (even with width $3$) are shown to memorize more pairs than shallow networks, which also agrees with the recent line of works on the benefits of depth for function approximation. We also provide…

## 9 Citations

On the Optimal Memorization Power of ReLU Neural Networks

- Computer Science, MathematicsArXiv
- 2021

A generalized construction for networks with depth bounded by 1 ≤ L ≤ √ N , for memorizing N samples using Õ(N/L) parameters, and it is proved that having such a large bit complexity is both necessary and sufficient for memorization with a sub-linear number of parameters.

An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks

- Computer ScienceNeurIPS
- 2021

This work improves the dependence on δ from exponential to almost linear, proving that Õ( 1 δ + √ n) neurons and â‚ n weights are sufficient, and proves new lower bounds by connecting memorization in neural networks to the purely geometric problem of separating n points on a sphere using hyperplanes.

Width is Less Important than Depth in ReLU Neural Networks

- Computer ScienceArXiv
- 2022

It is shown that depth plays a more significant role than width in the expressive power of neural networks, and an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network.

When Expressivity Meets Trainability: Fewer than $n$ Neurons Can Work

- Computer ScienceNeurIPS
- 2021

It is proved that as long as the width m ≥ 2n/d, its expressivity is strong, and it is expected that projected gradient methods converge to KKT points under mild technical conditions, but the rigorous convergence analysis is left to future work.

Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers

- Computer ScienceArXiv
- 2021

This work proposes a formal definition of statistically meaningful approximation which requires the approximating network to exhibit good statistical learnability, and shows that overparameterized feedforward neural nets can SM approximate boolean circuits with sample complexity depending only polynomially on the circuit size, not the size of the network.

Metric Hypertransformers are Universal Adapted Maps

- Mathematics, Computer ScienceArXiv
- 2022

The MHT models introduced here are able to approximate a broad range of stochastic processes’ kernels, including solutions to SDEs, many processes with arbitrarily long memory, and functions mapping sequential data to sequences of forward rate curves.

A Label Management Mechanism for Retinal Fundus Image Classification of Diabetic Retinopathy

- Computer ScienceArXiv
- 2021

This work proposes a novel label management mechanism (LMM) for the DNN to overcome overfitting on the noisy data and demonstrates that LMM could boost performance of models and is superior to three state-of-the-art methods.

LiDAR-based Localization using Universal Encoding and Memory-aware Regression

- Computer SciencePattern Recognition
- 2022

Expressiveness of Neural Networks Having Width Equal or Below the Input Dimension

- Computer Science, MathematicsArXiv
- 2020

It is concluded from a maximum principle that for all continuous and monotonic activation functions, universal approximation of arbitrary continuous functions is impossible on sets that coincide with the boundary of an open set plus an inner point of that set.

## References

SHOWING 1-10 OF 38 REFERENCES

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

- Computer ScienceNeurIPS
- 2019

By exploiting depth, it is shown that 3-layer ReLU networks with $\Omega(\sqrt{N})$ hidden nodes can perfectly memorize most datasets with $N$ points, and it is proved that width $\Theta($N)$ is necessary and sufficient for memorizing data points, proving tight bounds on memorization capacity.

Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem

- Computer ScienceICLR
- 2020

A new connection between DNNs expressivity and Sharkovsky's Theorem from dynamical systems is pointed to, that enables us to characterize the depth-width trade-offs of ReLU networks for representing functions based on the presence of generalized notion of fixed points, called periodic points.

Understanding Deep Neural Networks with Rectified Linear Units

- Computer Science, MathematicsElectron. Colloquium Comput. Complex.
- 2017

The gap theorems hold for smoothly parametrized families of "hard" functions, contrary to countable, discrete families known in the literature, and a new lowerbound on the number of affine pieces is shown, larger than previous constructions in certain regimes of the network architecture.

Benefits of Depth in Neural Networks

- Computer ScienceCOLT
- 2016

This result is proved here for a class of nodes termed "semi-algebraic gates" which includes the common choices of ReLU, maximum, indicator, and piecewise polynomial functions, therefore establishing benefits of depth against not just standard networks with ReLU gates, but also convolutional networks with reLU and maximization gates, sum-product networks, and boosted decision trees.

The Expressive Power of Neural Networks: A View from the Width

- Computer ScienceNIPS
- 2017

It is shown that there exist classes of wide networks which cannot be realized by any narrow network whose depth is no more than a polynomial bound, and that narrow networks whose size exceed the polynometric bound by a constant factor can approximate wide and shallow network with high accuracy.

Universal Approximation with Deep Narrow Networks

- Computer Science, MathematicsCOLT 2019
- 2019

The classical Universal Approximation Theorem holds for neural networks of arbitrary width and bounded depth, and nowhere differentiable activation functions, density in noncompact domains with respect to the $L^p$-norm, and how the width may be reduced to just $n + m + 1$ for `most' activation functions.

Deep, Skinny Neural Networks are not Universal Approximators

- Computer ScienceICLR
- 2019

The topological constraints that the architecture of a neural network imposes on the level sets of all the functions that it is able to approximate are examined.

Understanding deep learning requires rethinking generalization

- Computer ScienceICLR
- 2017

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

The Power of Depth for Feedforward Neural Networks

- Computer ScienceCOLT
- 2016

It is shown that there is a simple (approximately radial) function on $\reals^d$, expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, unless its width is exponential in the dimension.

Why Does Deep and Cheap Learning Work So Well?

- Computer ScienceArXiv
- 2016

It is argued that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine learning, a deep neural network can be more efficient than a shallow one.