What?
Hierarchical (TD3 + Wolpertinger) for synthesising feasible molecules.
Why?
Similarly to what we've seen in motivational section here, ML models can often generate molecules that are impossible to obtain in real life. We need to change this.
How?
source: original paperI f
- The paper uses RL to synthesise a molecule
- To reduce a humongous action space, it follows the hierarchical approach:
- First a reaction template is chosen with a policy $f(s)$ (they use $R$ for reactant as a replacement for states)
- Second, policy $\pi$ outputs an action $a\in \mathbb{R}^a$: $\pi(s,f(s))$.
- To mask invalid reaction templates, they are masked: $T = T \odot T_\text{mask}$
- Since there is a stochastic policy in the middle, we need to use Gumbel Softmax to do the backprop: $T = \text{GumbelSoftmax}(T, \tau)$, where $\tau$ is the temperature. There are more details about the Gumbel trick here.
- Now, since the action $a$ is continuous, but the action space is discrete, we need to find the one whose embedding is the closest to $a$ (use $k$NN to get multiple candidates).
And?
- This is a very nice application (with modifications) of Wolpertinger approach to a highly useful real-life problem.
- I found the description of the RL part confusing. On the one hand, the authors clearly state that they use TD3. But then they start describing DDPG and targets network on top as if this is some other algorithm.
- kNN is another confusing part of the paper. The authors first say that they generate $k$ neighbours and return the best product/reward among them. But later they say we have only used k=1 during both the training and inference phases of our algorithm for fair comparison.
- I don't really understand how action embeddings are precomputed. Are they random just random projections?
- Using $R$ in an RL paper not for the rewards only is 🤯🤯🤯
- This paper we reviewed, seems to be addressing the same problem (synthetic feasibility) with a similar approach.
- There's a lot of description of the chemical datasets/software I have never heard of. There might be some important part on that side, but for me it is just a way to get a better reward function for this environment.
This note is a part of my paper notes series. You can find more here or on Twitter. I also have a blog.