GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training

What?

Feature normalisation for graph-structured data.

Why?

Normalisation seems to help optimisation of deep neural network. What is effective for GNNs?

How?

Source: original paper.

Normalisations can be represented in a general form as doing the following operation:

$$ x_i = \gamma \frac{x_i-\mu}{\sigma}+\beta, $$

where the only thing that differs is how you compute your empirical mean $\mu$ and standard deviation $\sigma$. For instance, in BatchNorm, you will normalise a feature across all points in a batch. In InstanceNorm, you will do that per data point, e.g. per image. In LayerNorm, you will do this across the features of a layer. With graph, you will have the situation on picture below:

Main idea behind NN normalisations. G is a graph, v is a node, d is a feature dimension of a node.

GroupNorm paper has a magnificent figure, summarising normalisation techniques. However, I find my version above more intuitive for the case of GNNs.

Different normalisation techniques. Source: Group Normalisation paper.

The paper defines a normalisation before the node updater:

$$ H^{(k)} = F^{(k)}\big( \text{Norm}(W^{(k)}H^{(k-1)}Q)\big), $$

where $H$ are node features, $F$ is the node updater nonlinearity, $W$ are node updater weights and $Q$ is a neighbourhood aggregation function, i.e. $Q_{\text{GIN}} = A+I_n+\xi^{(k)}I_n$ in GIN.

InstanceNorm looks cool even on GNNs, however, sometimes (on regular graphs) it fails.

For an r-regular graph (each vertex has the same number of neighbours), the output of the normalisation layer (for GIN) is a zero matrix, when there are no features on the nodes.
However, even if there are features on the nodes, in a complete graph, you lose the adjacency matrix from the computations in GIN → you get your own features only.

To fix the above, they propose GraphNorm:

$$ \text{GraphNorm}(\hat{h}{(i,j)}) = \gamma_j\cdot\frac{\hat{h}{(i,j)}-\alpha_j\mu_j}{\hat{\sigma}_j}+\beta_j $$