Feature normalisation for graph-structured data.
Normalisation seems to help optimisation of deep neural network. What is effective for GNNs?
Source: original paper.
Normalisations can be represented in a general form as doing the following operation:
$$ x_i = \gamma \frac{x_i-\mu}{\sigma}+\beta, $$
where the only thing that differs is how you compute your empirical mean $\mu$ and standard deviation $\sigma$. For instance, in BatchNorm, you will normalise a feature across all points in a batch. In InstanceNorm, you will do that per data point, e.g. per image. In LayerNorm, you will do this across the features of a layer. With graph, you will have the situation on picture below:
Main idea behind NN normalisations. G is a graph, v is a node, d is a feature dimension of a node.
GroupNorm paper has a magnificent figure, summarising normalisation techniques. However, I find my version above more intuitive for the case of GNNs.
Different normalisation techniques. Source: Group Normalisation paper.
The paper defines a normalisation before the node updater:
$$ H^{(k)} = F^{(k)}\big( \text{Norm}(W^{(k)}H^{(k-1)}Q)\big), $$
where $H$ are node features, $F$ is the node updater nonlinearity, $W$ are node updater weights and $Q$ is a neighbourhood aggregation function, i.e. $Q_{\text{GIN}} = A+I_n+\xi^{(k)}I_n$ in GIN.
InstanceNorm looks cool even on GNNs, however, sometimes (on regular graphs) it fails.
To fix the above, they propose GraphNorm:
$$ \text{GraphNorm}(\hat{h}{(i,j)}) = \gamma_j\cdot\frac{\hat{h}{(i,j)}-\alpha_j\mu_j}{\hat{\sigma}_j}+\beta_j $$