You learnt the universal approximation theorem in Calculus 101
Call it the universal approximation theorem of piecewise constant functions or piecewise linear functions
What makes artificial neural networks of deep learning fame so great? One of the good things about neural networks, they tell you, is their ability to approximate arbitrarily complicated functions. This is called the “universal approximation theorem of neural networks.”
Sounds fancy and mysterious. Do you remember learning it in your Calculus 101 class?
In Calculus 101, when you first learn integration, they introduce the concept of the Riemann sum to find the integral of a function f(x). The integral is the area between the curve an the x-axis, and you can approximate the area by breaking up the area into a bunch of tall rectangles that go from the x-axis to the curve. Making the rectangles thinner and thinner makes the area approach the integral, if f(x) is well-behaved.
The above idea of summing rectangle areas to compute the total area is pretty intuitive and is closely related to the idea that you can approximate any f(x) as closely as you want by a piecewise constant function. You follow the function more and more accurately by making the pieces of the piecewise constant function smaller and smaller.
This is the universal approximation theorem of piecewise constant functions.
With a little bit of thought, it should be clear that piecewise constant function can approximate arbitrary functions not in just 1D, but also when x is in 2-dimensional, 3-dimensional, or more generally, when x is n-dimensional, n>1. You just break up the input space of x into lots of little n-dimensional cubes and assign a constant value equal to the average of f(x), say, in that cube. As you make the cubes smaller, this piecewise constant approximation, a different constant in every cube, gets close to f(x) everywhere.
So far so good.
What about piecewise linear functions?
It should be pretty clear that piecewise linear functions can also approximate any function f(x), usually even better than piecewise constant functions. Instead of little rectangles, picture little trapezoids.
That’s the universal approximation theorem of piecewise linear functions.
What about neural networks?
A standard feedforward neural network with ReLU activation functions is exactly just a piecewise linear function with lots of pieces.
Instead of representing each linear piece of a piecewise linear function separately, the neural network just mixes them all up and represents the piecewise linear function using a network using parameters called ‘weights and biases’. With the network, you can divide the space into more pieces and smaller pieces by having more ‘neurons’ in width (or in depth, but that’s a bit more subtle). So, you get better approximations with more neurons.
So saying “piecewise linear functions can approximate any function” is mathematically the same as saying that “neural network with ReLU activation functions can approximate any function”.
You could have learnt it in Math 101.
So are neural networks special?
Neural networks are not particularly special due to their universal approximation property. Lots of other things have this property, or close variants of the property. Polynomials, Fourier series, piecewise polynomials, etc. But when you google “universal approximation theorem”, you only get stuff about artificial neural networks. That’s bad. We should use the same terminoloy of universal approximation theorem for other things students learn: piecewise constant functions, piecewise linear functions, polynomials, splines (of which piecewise linear functions are a special case), Fourier series, etc. So that neural networks don’t seem so mysterious.
Function approximations with the neural network-like architectures may be special for other reasons: for how the piecewise constant or piecewise linear functions are parameterized in a compositional fashion, especially with deep networks, and how that makes them more trainable, sometimes, in ways not yet fully understood.
A piecewise constant function is equivalent to having binary or threshold activation functions.