What?

Graph Transformers (GT) that attend over the local neighborhood, use Laplacian eigenvectors for positional information and incorporating edge features into the computation.

Why?

Original transformers do not leverage information about graph topology. Moreover, attention of everyone with everyone is costly and impossible to do for real-life huge graphs. Apart from that, some of the graphs have features on the edges and we need to account for that.

How?

Source: original paper.

Source: original paper.

GT leverage two main ideas:

I will not write down all the equations since they are very similar to the original transformer, and clearly illustrated on the picture above. The general flow is the following:

An interesting bit here is how you fuse edge information into the computations above. The authors choose the following way:

And?