How Neural Networks Extrapolate: From Feedforward to Graph Neural Networks

What?

A study of extrapolation behaviour of MLPs and GNNs, that use those MLPs as building blocks.

Why?

MLPs are bad at extrapolation. However, GNNs that use those MLPs as building blocks are efficient in doing complex algorithmic tasks often extrapolating out of distribution. How is that possible? 🤯

How?

Source: original paper.

As the paper states, it has two main contributions:

ReLU MLPs trained by gradient descent (GD) converge to linear functions along directions from the origin with a rate of $O(1/\epsilon)$.
Proving that ReLU MLPs extrapolate well in linear tasks → hypothesis that GNNs extrapolate well when equipped with an appropriate non-linearity.

The authors prove a theorem showing that outside of the train data range, along any direction $vt$ from the origin, a two-layer ReLU MLP quickly converges to a linear function with $O(1/t)$.

As the figure on the right demonstrates, ReLU MLPs can extrapolate well in the linear setting, MAPE stands for Mean Absolute Percentage Error and used to account for different ranges. $\mathrm{MAPE}=\frac{1}{n}\lvert\frac{A_i-F_i}{A_i}\rvert$, with $A_i, F_i$ being the ground truth and the predicted value respectively.

Okay, this does not mean we are doomed. If we choose an appropriate non-linearity, we might extrapolate well!

This ^^^ figure is a bridge between extrapolating in MLPs and extrapolating in GNNs. It gives us a hint how graph nets might extrapolate so well for some complicated tasks.

The authors introduce the linear algorithmic alignment hypothesis, which is an extrapolation version of their algorithmic alignment framework that suggests that GNNs can interpolate well if MLP modules are "aligned" to easy-to-learn (possibly non-linear) functions. For extrapolation, MLPs need to align with linear functions. The intuition behind algorithmic alignment comes from comparing a dynamic programming algorithm (e.g. Bellman-Ford) with a GNN update step:

$d[k][u] = \min_{v\in\mathcal{N}(u)}d[k-1][v]+w(v,u)$

Bellman-Ford step

$h^{(k)}u = \min{v \in \mathcal{V}}{\text{MLP}^{(k)}(h_u^{(k-1)}, h_v^{(k-1)}, w_{(v,u)})}$

GNN forward step

When we have the min aggregator, we do not need to learn min with an MLP → MLP can learn a linear function → Extrapolate well.