• Corpus ID: 225070268

# Provable Memorization via Deep Neural Networks using Sub-linear Parameters

@inproceedings{Park2021ProvableMV,
title={Provable Memorization via Deep Neural Networks using Sub-linear Parameters},
author={Sejun Park and Jaeho Lee and Chulhee Yun and Jinwoo Shin},
booktitle={COLT},
year={2021}
}
• Published in COLT 26 October 2020
• Computer Science
It is known that $\Theta(N)$ parameters are sufficient for neural networks to memorize arbitrary $N$ input-label pairs. By exploiting depth, we show that $\Theta(N^{2/3})$ parameters suffice to memorize $N$ pairs, under a mild condition on the separation of input points. In particular, deeper networks (even with width $3$) are shown to memorize more pairs than shallow networks, which also agrees with the recent line of works on the benefits of depth for function approximation. We also provide…

## Figures from this paper

On the Optimal Memorization Power of ReLU Neural Networks
• Computer Science, Mathematics
ArXiv
• 2021
A generalized construction for networks with depth bounded by 1 ≤ L ≤ √ N , for memorizing N samples using Õ(N/L) parameters, and it is proved that having such a large bit complexity is both necessary and sufficient for memorization with a sub-linear number of parameters.
An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks
• Computer Science
NeurIPS
• 2021
This work improves the dependence on δ from exponential to almost linear, proving that Õ( 1 δ + √ n) neurons and â‚ n weights are sufficient, and proves new lower bounds by connecting memorization in neural networks to the purely geometric problem of separating n points on a sphere using hyperplanes.
Width is Less Important than Depth in ReLU Neural Networks
• Computer Science
ArXiv
• 2022
It is shown that depth plays a more significant role than width in the expressive power of neural networks, and an exact representation of wide and shallow networks using deep and narrow networks which, in certain cases, does not increase the number of parameters over the target network.
When Expressivity Meets Trainability: Fewer than $n$ Neurons Can Work
• Computer Science
NeurIPS
• 2021
It is proved that as long as the width m ≥ 2n/d, its expressivity is strong, and it is expected that projected gradient methods converge to KKT points under mild technical conditions, but the rigorous convergence analysis is left to future work.
Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers
• Computer Science
ArXiv
• 2021
This work proposes a formal definition of statistically meaningful approximation which requires the approximating network to exhibit good statistical learnability, and shows that overparameterized feedforward neural nets can SM approximate boolean circuits with sample complexity depending only polynomially on the circuit size, not the size of the network.
Metric Hypertransformers are Universal Adapted Maps
• Mathematics, Computer Science
ArXiv
• 2022
The MHT models introduced here are able to approximate a broad range of stochastic processes’ kernels, including solutions to SDEs, many processes with arbitrarily long memory, and functions mapping sequential data to sequences of forward rate curves.
A Label Management Mechanism for Retinal Fundus Image Classification of Diabetic Retinopathy
• Computer Science
ArXiv
• 2021
This work proposes a novel label management mechanism (LMM) for the DNN to overcome overfitting on the noisy data and demonstrates that LMM could boost performance of models and is superior to three state-of-the-art methods.
Expressiveness of Neural Networks Having Width Equal or Below the Input Dimension
• Computer Science, Mathematics
ArXiv
• 2020
It is concluded from a maximum principle that for all continuous and monotonic activation functions, universal approximation of arbitrary continuous functions is impossible on sets that coincide with the boundary of an open set plus an inner point of that set.

## References

SHOWING 1-10 OF 38 REFERENCES
Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity
• Computer Science
NeurIPS
• 2019
By exploiting depth, it is shown that 3-layer ReLU networks with $\Omega(\sqrt{N})$ hidden nodes can perfectly memorize most datasets with $N$ points, and it is proved that width $\Theta($N)$is necessary and sufficient for memorizing data points, proving tight bounds on memorization capacity. Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem • Computer Science ICLR • 2020 A new connection between DNNs expressivity and Sharkovsky's Theorem from dynamical systems is pointed to, that enables us to characterize the depth-width trade-offs of ReLU networks for representing functions based on the presence of generalized notion of fixed points, called periodic points. Understanding Deep Neural Networks with Rectified Linear Units • Computer Science, Mathematics Electron. Colloquium Comput. Complex. • 2017 The gap theorems hold for smoothly parametrized families of "hard" functions, contrary to countable, discrete families known in the literature, and a new lowerbound on the number of affine pieces is shown, larger than previous constructions in certain regimes of the network architecture. Benefits of Depth in Neural Networks This result is proved here for a class of nodes termed "semi-algebraic gates" which includes the common choices of ReLU, maximum, indicator, and piecewise polynomial functions, therefore establishing benefits of depth against not just standard networks with ReLU gates, but also convolutional networks with reLU and maximization gates, sum-product networks, and boosted decision trees. The Expressive Power of Neural Networks: A View from the Width • Computer Science NIPS • 2017 It is shown that there exist classes of wide networks which cannot be realized by any narrow network whose depth is no more than a polynomial bound, and that narrow networks whose size exceed the polynometric bound by a constant factor can approximate wide and shallow network with high accuracy. Universal Approximation with Deep Narrow Networks • Computer Science, Mathematics COLT 2019 • 2019 The classical Universal Approximation Theorem holds for neural networks of arbitrary width and bounded depth, and nowhere differentiable activation functions, density in noncompact domains with respect to the$L^p$-norm, and how the width may be reduced to just$n + m + 1$for `most' activation functions. Deep, Skinny Neural Networks are not Universal Approximators The topological constraints that the architecture of a neural network imposes on the level sets of all the functions that it is able to approximate are examined. Understanding deep learning requires rethinking generalization • Computer Science ICLR • 2017 These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity. The Power of Depth for Feedforward Neural Networks • Computer Science COLT • 2016 It is shown that there is a simple (approximately radial) function on$\reals^d\$, expressible by a small 3-layer feedforward neural networks, which cannot be approximated by any 2-layer network, unless its width is exponential in the dimension.
Why Does Deep and Cheap Learning Work So Well?
• Computer Science
ArXiv
• 2016
It is argued that when the statistical process generating the data is of a certain hierarchical form prevalent in physics and machine learning, a deep neural network can be more efficient than a shallow one.