What?
Simultaneous learning GNN parameters and the graph structure based on self-supervision.
Why?
i) We often don't know what the optimal structure for learning is;
ii) supervision starvation (see next section for explanations)
How?
source: original paper
We are in a setting where we want to do node classification on graphs, and only a tiny fraction of those nodes have labels, i.e. allow to use supervised learning to update the parameters.
The proposed method (SLAPS) consists of four components:
- Generator
- Take node features and produce the adjacency matrix.
- Two types
- Full Parameterisation
- Ignore node features and generate a graph using Bernoulli distribution.
-
-
-
- adds $n^2$ parameters, where $n$ is the number of nodes
- MLP-kNN
- Project to the hidden dimension with an MLP
- Compute adjacency using kNN and cosine similarity
- 0 if not in top-k
- $\text{cosine}(i,j)$ if in top-k
- Adjacency processor
- Nothing fancy here, just symmetrising and normalising the adjacency matrix:
- $A = D^{-1/2}(\frac{P(\tilde{A}) + P(\tilde{A})^T}{2})D^{-1/2}$
- Classifier
- The authors use a two-layer GCN and train everything with the cross-entropy loss;
- Self-supervision
- One could have stopped here, but the authors talk about supervision starvation motivating the need in an auxiliary self-supervised task.
- Example of supervision starvation:
- There's an edge betwen $v_i$ and $v_j$ and they are not directly connected to any of the labeled nodes.
- Now, the classification loss will not depend on the $(v_i, v_j)$ edge since 2 layers of a GNN is not enough to propagate their features to the labeled node.
- The authors use denoising as an auxiliary task:
- Corrupt the node features and let the GNN recover the data.
- Use MSE as a loss.
- The final loss becomes $\mathcal{L}
= \mathcal{L}
_C + \lambda \mathcal{L}
_{DAE}$, where $C$ stands for classification and DAE stands for Denoising Autoencoder.
And?
- The authors show that one benefits from the self-supervision for the node classification task.
- I really like the way the notation is defined at the beginning of the background section. The only thing I found confusing was parameter vectors. It was told that lowercase letters denote scalars, but $\theta_G$ parameterising the network is definitely not a scalar.
- I never though of this setting, but it looks like the graph structure is inherent to the dataset here rather than to each data point of a dataset. I wonder if the conclusion of the paper hold for the node regression task, where there might be long-term dependencies between the nodes in the graph (e.g. what we had in My Body is a Cage).
This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.