SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

What?

Simultaneous learning GNN parameters and the graph structure based on self-supervision.

Why?

i) We often don't know what the optimal structure for learning is;

ii) supervision starvation (see next section for explanations)

How?

source: original paper

We are in a setting where we want to do node classification on graphs, and only a tiny fraction of those nodes have labels, i.e. allow to use supervised learning to update the parameters.

The proposed method (SLAPS) consists of four components:

Generator
- Take node features and produce the adjacency matrix.
- Two types
  - Full Parameterisation
    - Ignore node features and generate a graph using Bernoulli distribution.
    - - simple
    - - adds $n^2$ parameters, where $n$ is the number of nodes
  - MLP-kNN
    - Project to the hidden dimension with an MLP
    - Compute adjacency using kNN and cosine similarity
      - 0 if not in top-k
      - $\text{cosine}(i,j)$ if in top-k
Adjacency processor
- Nothing fancy here, just symmetrising and normalising the adjacency matrix:
- $A = D^{-1/2}(\frac{P(\tilde{A}) + P(\tilde{A})^T}{2})D^{-1/2}$
Classifier
- The authors use a two-layer GCN and train everything with the cross-entropy loss;
Self-supervision
- One could have stopped here, but the authors talk about supervision starvation motivating the need in an auxiliary self-supervised task.
- Example of supervision starvation:
  - There's an edge betwen $v_i$ and $v_j$ and they are not directly connected to any of the labeled nodes.
  - Now, the classification loss will not depend on the $(v_i, v_j)$ edge since 2 layers of a GNN is not enough to propagate their features to the labeled node.
- The authors use denoising as an auxiliary task:
  - Corrupt the node features and let the GNN recover the data.
  - Use MSE as a loss.
- The final loss becomes $\mathcal{L} = \mathcal{L} _C + \lambda \mathcal{L} _{DAE}$, where $C$ stands for classification and DAE stands for Denoising Autoencoder.

And?

The authors show that one benefits from the self-supervision for the node classification task.
I really like the way the notation is defined at the beginning of the background section. The only thing I found confusing was parameter vectors. It was told that lowercase letters denote scalars, but $\theta_G$ parameterising the network is definitely not a scalar.
I never though of this setting, but it looks like the graph structure is inherent to the dataset here rather than to each data point of a dataset. I wonder if the conclusion of the paper hold for the node regression task, where there might be long-term dependencies between the nodes in the graph (e.g. what we had in My Body is a Cage).

This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.