Barking up the right tree: an approach to search over molecule synthesis DAGs

What?

A model + task specification to obtain molecules that we can synthesize in real life.

Why?

Generating a molecule with a model tells us what to synthesize. Knowing how to do that would be very useful. Moreover, the model might generate a molecule that is not possible to get in the lab. How can avoid that?

How?

source: original paper

The authors consider two problems:

G1: generating molecules for further screening and scoring;
G2: finding molecules satisfying some requirements.

The main idea is to build a synthesis task in a way that gives us an idea of how to synthesize a molecule in the real-life as well as reduce the probability of getting molecules impossible in the real life. They turn it into a sequential decision making problem:

A model can do three types of actions:
- add a node
  - a building block
  - a product node
- choose a building block (if a previous action was added, we need to select a molecule to add)
- connect nodes
  - After a product node was selected, we need to choose which nodes participate in making the product.
  - Stop making a product and stop synthesis actions are also available.
- masking is applied to avoid choosing the same molecule block again.

The process above can be represented as a DAG, and, gives us an idea of how to synthesize the molecule in the lab.

How do the authors solve the problem:

General model:

source: original paper
- RNN to get a latent $z$
- An RNN takes previous action embedding as an input.
- Action means to pick a molecule (which is a graph), GGNN is used to get an embedding.
- Output of a RNN goes into different heads (one head per action type)
- When building a product, we need to know the result of it. The authors assume access to an oracle for that.
G1 (generate a ton of molecules)
- Using VAE here (DAG of molecular graphs (DOG-AE))
- Wasserstein AEs are used here: $\min_{\phi, \theta}\mathbb{E}{\mathcal{M}\sim p(\mathcal{M})}\mathbb{q\phi(z\mid \mathcal{M})}[-\log{p}\theta(\mathcal{M}\mid z)]+\lambda \mathcal{D}(q\phi(z), p(z))$, where $\mathcal{D}$ is MMD.
- Encoder is interesting here. The authors to hierarchical message passing:
  - first level is molecular level to get an embedding of the blocks
  - second level is message passing on the synthesis DAG
G2 (generate a molecule with given properties)
- Hill climbing from GuacaMol (a cross-entropy method that can be view as REINFORCE with some reward shaping).
- Train a generative model.
- Fine-tune the decoder (sample a lot of candidates and fine-tune on top-K).

And?

It's folklore already, but SMILES is such a strong baseline!
While SMILES shows good performance, it gives molecules which we cannot synthesize. And the proposed approach >> SMILES judging by this metric.