Learning Latent Permutations with Gumbel-Sinkhorn Networks

What?

Network architecture for predicting rankings, doing permutations and placement + an analog of Gumbel-Softmax for permutation matrices (compared to vectors in GM-Softmax)

Why?

Matchings and permutations are important for aligning, canonicalizing and sorting data. We want to solve tasks, where matchings are not provided and has to be latent. We can't use REINFORCE for such models, and there's not been a reparametrisation option so far.

How?

source: original paper

Sinkhorn operator:
- As we saw yesterday, we can use softmax with temperature to soften a discrete distribution. With a zero temperature, it just becomes discrete again (returning $\arg\max)$.
- The authors show an analogous result for matrices.
- Sinkhorn Operator (normalisation) works as follows:
  - $X \in \mathbb{R}^{n \times n}$ and we will normalise it;
  - $S^0(X) = \exp{(X)}$ (element-wise)
  - $S^l(X) = \mathcal{T}_c(\mathcal{T}_r(S^{l-1}(X)))$
    - $\mathcal{T}_c$ is column-wise normalisation $\mathcal{T}_c = X \oslash (\mathbf{1}_N\mathbf{1}_N^TX)$
    - $\mathcal{T}_r$ is row-wise normalisation $\mathcal{T}_r = X \oslash (X\mathbf{1}_N\mathbf{1}_N^T)$
  - $S(X)=\lim_{l\rightarrow \infty}S^l(X)$
  - There is a result, that $S(X)$ is a doubly-stochastic matrix, i.e. both rows and columns are stochastic vectors.
  - Now, we can pose a permutation selection as a solution to the assignment problem
    - $M(X) = \arg\max_{P \in \mathcal{P}}\langle P,X\rangle_F$
      - $\langle P,X\rangle_F$ is the Frobenius inner product
        
        Sounds fancy, but it's just a dot product of matrices turned into vectors.
  - The paper shows a theorem such that in the limit, Sinkhorn operator of temperature-adjusted $X$ converges to $M(X)$:
    - $M(X) = \lim_{\tau\rightarrow0^+}S(X/\tau)$
    - $S(X/\tau) = \arg\max_{P \in \mathcal{B}_N}\langle P,X \rangle_F + \tau h(P)$,
      - $\mathcal{B}_N$ is the Birkhoff polytope of dimension $N$, i.e. the set of double stochastic matrices of dimension $N$.
      - $h(P)$ is the entropy of a doubly stochastic matrix $P$: $h(P) =-\sum_{i,j} P_{i,j} \log(P_{i,j})$.
      - This works if $X$ is sampled from a distribution that is absolutely continuous w.r.t. Lebesgue measure in $\mathbb{R}$.
- Sinkhorn Networks
  - Imagine a task where we want to learn a mapping from scrambled ~~eggs~~ objects $\tilde{X}$ to non-scrambled ones $X$.
  - We can write it down as a regression problem with the error between $X$ and a permutation of $\tilde{X}$: $f(\theta, X, \tilde{X}) = \sum_i^M \lvert \lvert X_i - P^{-1}_{\theta, \tilde{X}_i}\tilde{X}_i \rvert \rvert^2;$
  - We can use a network to parameterise a solution, we can do the following: $P_{\theta, \tilde{X}} = M(g(\tilde{X}, \theta))$, where $g$ can be a neural network.
    - This is non-differentiable because of $M$.
  - To make it differentiable, we can use $S(g(\tilde{X}, \theta)/\tau)$ instead!
    - to ensure uniqueness of $M$ (so that Theorem 1 converges), we add noise to the output of $g$.
  - For the reconsruction to be independent on the permutation, the authors want this property:
    - $P_{\theta, P', \tilde{X}}(P'\tilde{X}) = P'(P_{\theta, \tilde{X}}\tilde{X})$.
    - To achieve the above, the authors process each of the elements independently with the same network and applying the softened Sinkhorn operator (see Figure above)
    - At test time, we can use $M(g(\tilde{X}, \theta))$ instead.
- Probabilistic latent representation:
  - Consider $p(y|X) \propto \exp{(X(y))\mathbf{1}_{y\in\mathcal{Y}}}$;
  - We can parameterise the above as $X(y) = \langle X,y \rangle$ by choosing the category $y$ after injecting a noise.
  - $P$ follows the Gumbel-Matching distribution if it has the distribution arising by the rank one perturbation of $p(y|X)$: $X(P) = \langle X,P\rangle_F$ (replace $y$ by $P$)
  - ^^^ is not differentiable, but we can relax (pun intended):
    - $P$ follows the Gumbel-Sinkhorn with the parameter $X$ and temperature $\tau$ if it has the distribution of $S((X+\epsilon)/\tau)$
    - Almost surely, samples of $S$ converge to the samples of Gumbel-Matching (if $\tau \rightarrow 0$, I guess?)
  - Having said all above, we are thrilled to announce that neither of the two distributions have tractable densities. But one can use likelihood-free methods instead!

And?

Again, as for the yesterday's note, the paper is out of my comfort zone, take everything I said above with a grain of salt.
Intuitively, processing elements independently before doing softmax is one of the drawbacks of the method. It seems that this approach takes a cardinalist approach in solving the problem (project an input to some fixed place in the latent space with some order semantics on it), whereas it seems to be like an ordinalist problem (find the correct ordering). It was surprising to read that the sorting example generalised to out-of-distribution data in terms of the numbers range. It would be interesting to understand where my intuition is wrong. Also, independent processing seems not to be crucial for the paper results to hold, transformers are also permutation equivariant, but they take the dependencies into account.
For me, the connection between Sinkhorn operator in the limit and the solution to the assignment problem is mind-blowing.
One of the benchmarks considered here is doing inference over permutations of the connectome of C.elegans 🤯🤯🤯

This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.