Machine Learning in the Generation of Novel Molecules

A dissertation submitted in partial fulfillment of the requirements of Master of Professional Studies in Data Science

After a very busy year, I was pleased to submit my dissertation at the end of October. Researching an application of machine learning in cheminformatics has been interesting, and I hope to have articulately described a use for sequence-to-sequence autoencoders in the representation and generation of chemical structures.

Should you wish to read it, I have embedded my dissertation in the page below. Alternatively, you can download a pdf copy of this paper here.


Techniques to sample from the vast space of possible molecular structures can aid in the design of novel chemicals with desired properties. Existing approaches rely on large databases (\(n > 10^7\)) and have high computational requirements. We demonstrate an efficient encoding of chemical space for a small data set of organic molecules known to undergo specific biotransformations (\(n < 10^3\)), leveraging techniques from NLP (natural language processing). We provide examples of the generation of semantically correct molecular representations. Furthermore we show that the generated molecular representations can be assessed to determine whether they are likely to undergo specific biotransformations which exist in the initial data set.

10th of November, 2019.