• Corpus ID: 238419267

# Universal Approximation Under Constraints is Possible with Transformers

@article{Kratsios2021UniversalAU,
title={Universal Approximation Under Constraints is Possible with Transformers},
author={Anastasis Kratsios and Behnoosh Zamanlooy and Tianlin Liu and Ivan Dokmani'c},
journal={ArXiv},
year={2021},
volume={abs/2110.03303}
}
• Published 7 October 2021
• Computer Science, Mathematics
• ArXiv
Many practical problems need the output of a machine learning model to satisfy a set of constraints, K. There are, however, no known guarantees that classical neural networks can exactly encode constraints while simultaneously achieving universality. We provide a quantitative constrained universal approximation theorem which guarantees that for any convex or non-convex compact set K and any continuous function f : R → K, there is a probabilistic transformer F̂ whose randomized outputs all lie…
2 Citations

## Figures and Tables from this paper

Piecewise-Linear Activations or Analytic Activation Functions: Which Produce More Expressive Neural Networks?
• Computer Science
ArXiv
• 2022
The main result demonstrates that deep networks with piecewise linear activation (e.g. ReLU or PReLU) are fundamentally more expressive than deep feedforward networks with analytic activation functions and is further explained by quantitatively demonstrating the “separation phenomenon” between the networks in NN ReLU + Pool.
Metric Hypertransformers are Universal Adapted Maps
• Mathematics, Computer Science
ArXiv
• 2022
The MHT models introduced here are able to approximate a broad range of stochastic processes’ kernels, including solutions to SDEs, many processes with arbitrarily long memory, and functions mapping sequential data to sequences of forward rate curves.

## References

SHOWING 1-10 OF 102 REFERENCES
Universal Approximation with Deep Narrow Networks
• Computer Science, Mathematics
COLT 2019
• 2019
The classical Universal Approximation Theorem holds for neural networks of arbitrary width and bounded depth, and nowhere differentiable activation functions, density in noncompact domains with respect to the $L^p$-norm, and how the width may be reduced to just $n + m + 1$ for `most' activation functions.
Error bounds for approximations with deep ReLU neural networks in $W^{s, p}$ norms
• Computer Science, Mathematics
Analysis and Applications
• 2019
This work constructs, based on a calculus of ReLU networks, artificial neural networks with ReLU activation functions that achieve certain approximation rates and establishes lower bounds for the approximation by ReLU neural networks for classes of Sobolev-regular functions.
Are Transformers universal approximators of sequence-to-sequence functions?
• Computer Science
ICLR
• 2020
It is established that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models.
Minimum Width for Universal Approximation
• Computer Science
ICLR
• 2021
This work provides the first definitive result in this direction for networks using the ReLU activation functions: the minimum width required for the universal approximation of the L^p functions is exactly $\max\{d_x+1,d_y\}$.
What is Local Optimality in Nonconvex-Nonconcave Minimax Optimization?
• Computer Science
ICML
• 2020
A proper mathematical definition of local optimality for this sequential setting---local minimax is proposed, as well as its properties and existence results are presented.
Extending Lipschitz functions via random metric partitions
• Mathematics
• 2005
Many classical problems in geometry and analysis involve the gluing together of local information to produce a coherent global picture. Inevitably, the difficulty of such a procedure lies at the
Linear extension operators between spaces of Lipschitz maps and optimal transport
• Mathematics
• 2016
Abstract Motivated by the notion of K {K\hskip-0.284528pt} -gentle partition of unity introduced in [J. R. Lee and A. Naor, Extending Lipschitz functions via random metric partitions, Invent. Math.
Equivalence of approximation by convolutional neural networks and fully-connected networks
• Computer Science, Mathematics
ArXiv
• 2018
This paper establishes a connection between both network architectures and shows that all upper and lower bounds concerning approximation rates of fully-connected neural networks for functions f for an arbitrary function class $\mathcal{C}$translate to essentially the same bounds on approximation rates.
Optimal Transport: Fast Probabilistic Approximation with Exact Solvers
• Computer Science
J. Mach. Learn. Res.
• 2019
A simple subsampling scheme for fast randomized approximate computation of optimal transport distances based on averaging the exact distances between empirical measures generated from independent samples from the original measures and can be tuned towards higher accuracy or shorter computation times is proposed.
Complexity Lower Bounds for Nonconvex-Strongly-Concave Min-Max Optimization
• Computer Science
NeurIPS
• 2021
We provide a first-order oracle complexity lower bound for finding stationary points of min-max optimization problems where the objective function is smooth, nonconvex in the minimization variable,