Learning Deep Generative Models of Graphs

What?

An autoregressive way of generating graphs using graph nets (leveraging both the structure & the attributes)

Why?

Generating graphs is an important problem, generating graphs with given properties is a hard task.

How?

source: original paper

TLDR: Iteratively generate a graph by adding nodes and edges connecting the newly added node to the rest of the graph. MPNNs do all the heavy machinery.

This is the set of functions we have to use to generate a graph:

MPNN to get node/edge features $h_V^{(T)} = \text{prop}^T(h_V, G)$
Readout (a gated sum is used): $h_G = R(h_V^{(T)}, G)$, where $h_G = \sum_{v\in V} g_v^G \odot h_v^G$
- The authors do a smart thing to increase the model capacity. Before they do the readout, they project node vectors to a higher dimension: $h_v^G = f_m(h_v)$ for each node $v \in V.$
When doing message aggregation, the authors compute messages in both directions and then sum the two sums: $\sum_{u:(u,v)\in E}m_{u\rightarrow v}

\sum_{u:(v,u)\in E}m'_{u\rightarrow v}$ .

$f_{\text{add node}}(G) = \text{softmax}(f_{an}(h_G))$
- $f_{an}$ is an MLP to map the graph representation to an action output space (add a new node or terminate)
$f_{\text{add edge}}(G,v) = \sigma(f_{ae}(h_G, h_v^{(T)}))$
- Similar to the above, but we check if we want to add an edge or go to the node addition generation again.
If adding an edge, where to add?
- $f_{\text{nodes}}(G,v) = \text{softmax}(\textbf{s})$, where
- $s_u = f_s(h_u^T, h_v^T), \forall u \in V$

Keeping node updaters GRU states, make the model autoregressive both across the propagation steps and different decision steps.

Now, it is interesting how to initialise the features of newly added nodes. The authors use the graph readout as well as node type embedding:

$$ h_v = f_{\text{init}}(R_\text{init}(h_V, G), x_v), $$

where $x_v$ is a node type embedding.

How does the generative model work?

A model is a joint distribution over graphs $G$ and edge orderings $\pi$ (sorry, RL folks).
We want $p(G) = \sum_{\pi \in \mathcal{P}(G)}p(G,\pi)$;
- Every time you hear the word 'marginal', expect the word 'intractable' right after it.
- Use importance sampling $p(G) = \sum_\pi p(G, \pi) = \mathbb{E}_{q(\pi \mid G)}\Big[\frac{p(G,\pi)}{q(\pi \mid G)}\Big]$, where $q(\pi \mid G)$ is a proposal distribution over permutations.
- With a fixed canonical ordering, one can use the delta function for $q(\pi\mid G)$ and put all the probability on the canonical ordering.
To learn this, we maximise the expected joint log-likelihood:
- $\mathbb{E}{p\text{data}(G,\pi)}[\log p(G,\pi)] = \mathbb{E}{p\text{data}(G)}\mathbb{E}{p\text{data}(\pi\mid G)}[\log p(G,\pi)]$;
- Now, for the reasons I do not get, $p_\text{data}(\pi \mid G) = q(\pi \mid G)$ .
<aside> 💡 Training with such a $p_\text{data}(\pi \mid G)$ will drive the posterior of the model distribution $p(\pi \mid G)$ close to the proposal distribution $q(\pi \mid G)$, therefore improving the quality of our estimate of the marginal probability.

</aside>
For the ordering, the authors either used a fixed ordering or uniform random ordering for training.

What?

Why?

How?

And?