A Generalization of Transformer Networks to Graphs

What?

Graph Transformers (GT) that attend over the local neighborhood, use Laplacian eigenvectors for positional information and incorporating edge features into the computation.

Why?

Original transformers do not leverage information about graph topology. Moreover, attention of everyone with everyone is costly and impossible to do for real-life huge graphs. Apart from that, some of the graphs have features on the edges and we need to account for that.

How?

Source: original paper.

GT leverage two main ideas:

Sparsity
- Do the attention only across a node neighbourhood.
Positional encodings
- Get eigenvectors of the normalised Laplacian and use first $k$ rows as a positional encoding.
- $\Delta = I-D^{1/2}AD^{1/2} = U^T\Lambda U$, where $A$ is the adjacency matrix, $D$ is the degree matrix and $U^T\Lambda U$ is the eigendecomposition.
- Positional encodings can be precomputed and applied before the input to the transformer.

I will not write down all the equations since they are very similar to the original transformer, and clearly illustrated on the picture above. The general flow is the following:

Project node and edge features to a hidden dimension.
Project the positional embedding (so that dimensions match)
Add projected positional embedding to the projected features of the nodes: $h^0_i = \hat{h}^0_i+\lambda^0_i.$
The a standard transformer encoder block follows with two main changes:
- A node attends only at its neighbours.
- LayerNorm becomes BatchNorm.

An interesting bit here is how you fuse edge information into the computations above. The authors choose the following way:

They multiply the projected edge features after computing the attention score, i.e. they adjust the attention weights modulating by the edge features so that attention weight become $w_{ij}^{k,l} = \text{softmax}j(\hat{w}{i,j}^{k,l}),$ where $\hat{w}{i,j}^{k,l} = (\frac{Q^{k,l}h_i^l\cdot K^{k,l}h^l_j}{\sqrt{d_k}})\cdot E^{k,l}e^l{ij}$, where $k$ is the attention head index, $i,j$ are two node indices connected by edge $(i,j)$. The rest is the same as in a typical transformer.
Apart from the rest of the transformer block for nodes, concat different heads and project $\hat{w}$ and feed it into another MLP adding residual connections. The latter part is an MLP for nodes only modulated by node similarity.

And?

We have a transformer that can process edge features as well as getting the positional information.
Using Laplacian eigenvectors for positional encoding is a pretty cool idea.
It is quite surprising that we do not see a huge performance boost when compared with GatedGCN. It's not SOTA request, I'm really surprised. Does the performance of Transformers come from those really long-range dependencies that are hard to get when we attend over the neighbourhood only?
Interestingly, BatchNorm works better here which is supported by this paper as well. Would be interesting to see if Instance/GraphNorm would further improve the performance.