This is going to be the first in a series of posts on what I am calling “DIY Drug Discovery”. Admittedly, though, this title is hyperbolic. Discovering a new drug requires bringing it through a series of trials, which is very hard, if not impossible for an individual to do themselves. What I’m really going to be discussing is DIY drug screening. I believe in the near future individuals will be able to discover candidate drug molecules using commodity GPU hardware and open source software, such as scikit-learn.org, https://keras.io/, http://www.rdkit.org/ and https://deepchem.io/.

# Background on chemical space and drug discovery

The set of all possible molecules, which is known as chemical space, is incredibly vast. The Chemical Abstracts Service (CAS) registry lists 49,037,297 known molecules (molecules that have actually be synthesized). The number of different molecules that have been synthesized, in both public and private settings, may reach towards 100,000,000. Yet this is only tiny fraction of the number of possible molecules. Recent research has tried to enumerate the number of possible molecules containing Carbon, Nitrogen, Oxygen, Hydrogen, Oxygen, and halogens which may be of interest for drug discovery. So far, possible molecules up to a size of 17 atoms have been enumerated and a large database called GBD has been created. Enumerating possible molecules isn’t as simple as enumerating possible molecular graphs. There are many physical constraints that the authors had to take into account. For instance:

• bonding - many types of bonds are not possible – covalent bonds must be between an electron donor and an electron acceptor. A Chlorine atom cannot bond to a flourine and a sodium cannot bond to a magnesium.

• geometrical constraints - Due to the nature of atomic orbitals, each atom prefers to bond in particular configuration. For instance, carbon prefers to bond tetrahedrally and when forming rings prefers to form hexagons. Pentagons and other shapes are not necessarily impossible, but they are generally high unstable. For instance, cubane, where carbon is forced to bond at 90 degrees, is highly unstable and explosive material. Carbon bonding in a triangle, or an isolated pentagon or decagon is impossible.

• functional group stability - Many arrangements may be structurally stable in isolation but are highly reactive, making them near impossible to contain in the real world. For instance, if an oxygen or nitrogen atom is bonded to a non-aromatic C=C, this creates an Enol or enamine group which is highly reactive. The authors of the GBD database used 12 seperate filters to limit what types of functional groups could exist.

• stereochemistry - the basic idea of this constraint is that “two things cannot be in the same place at the same time”. For instance, a molecule cannot loop back and intersect itself.

• synthesizability - the molecule most be synthesizable. For instance molecules with more than three carbon rings or allenes (C=C=C) are exceedingly difficult to synthesize.

These basic physical constraints reduce the size of the chemical space significantly. For instance the authors of GBD note “The vast majority of fused small ring systems are high strained and reactive. 96.7% of molecules with carbon are removed by this constraint. However, even after applying 29 filters, the GBD database still contains 10^11 molecules. This is a large number, but tractable in the sense that it could be stored on the very latest computer hardware. (storing the database in SMILES format would consume about 63 terabytes. The entire database is not available for public consumption, only a subset of 50 million compounds is downloadable). Keep in mind that this number was obtained after applying many constraints - the number of organic molecules of drug like size with normal atoms (CNOH + halogens) and up to 4 rings has been estimated to be around 10^60!

Apart from whether a molecule can physically exist and be synthesized, there are further constraints when it comes to making a drug:

• solubility the molecule must be soluble in the blood.

• toxicity the molecule cannot be toxic.

• permeability & lipophilicity can it get through any biological membranes it needs to, such cell wall or the blood brain barrier?

• binding target affinity will it actually bind to the target of interest?

Traditionally drug design has been done through a trial and error approach. A professor once described it to me as “trying to build an airplane by attaching the wings in at random orientations, and then seeing if it flies.” Just as computer simulations are used to design highly aerodynamic airplanes, so that less actual physical assembly and testing needs to be done, physics based simulation holds the promise of being able to design and optimize drug molecules before they are actually syntheized.

However quantum simulations are extremely computationally intensive and approximate methods often are not accurate enough, so would be very helpful if machine learning could be used to screen molecules instead.

# Basic concepts of fingerprinting

The first step in molecular machine learning is encoding the structure of the molecule in a form that is amenable to machine learning. This is where a lot of research is currently focused. A useful representation encodes features that are relevant and is efficient, so as to avoid the curse of dimensionality. Fortunately, there is a way method of featurazation called fingerprinting which already has a long history of development in the world of drug discovery.

Fingerprinting was originally designed to solve the problem of molecular substructure search – how do I find all the molecules in my database with a particular substructure? A related problem is molecular similarity - given a molecule, how do I find the molecules in my database that are most similar. Only later were fingerprints applied for machine learning, or what has traditionally called quantitative structure property relationships (QSPR).

Fingerprinting creates an efficient representation of the molecular graph. The basic process of fingerprinting is as follows: First the algorithm generates a set of patterns. For instance, enumeration of different paths is common:

 0-bond paths: 1-bond paths: C OC O C==C N CN 2-bond paths: 3-bond paths: OC=C OC=CN C=CN

Storing all this data would result in an enormous representation. The trick of fingerprinting is to “hash” each of these features, which essentially means they act as seeds to a random number generator called a hash function. The hash function generates a bit string. Typically the hash function is chosen so that 4 or 5 bits per pattern are non-zero in the bit string. Next, all of the bit strings are OR’ed together.

This process is very useful for substructure searching because every bit of the substructure’s fingerprint will be on (=1) in the molecule’s fingerprint. Accidental collisions between patterns or substructures are possible, but if the bit density is low (known as a sparse representation) they are rare. Ideally, the on bit density of the bitstring should be tuned so as to reach a particular discriminatory power that is needed (ie. the chance that two structurally different molecules in my database have the same fingerprint is only 1%). There is a tradeoff between discriminatory power and the efficiency of the representation.

In practice, creating a bit density is done through folding. A very long, very sparse fingerprint is ‘folded’ down (with OR operations) to create a fingerprint with a desired length and good bit density. In case this is not obvious what this means, it can be literally thought of as folding the vector onto itself.

Now let’s test some fingerprints. We use rdkit.org, an open source cheminformatics library. Our data consists of https://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system strings, which encode molecular graphs into a string of characters, and experimentally determined solubility values.

from rdkit import Chem
from rdkit.Chem.EState import Fingerprinter
from rdkit.Chem import Descriptors
from rdkit.Chem.rdmolops import RDKFingerprint
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn import cross_validation
from keras import regularizers
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from sklearn import cross_validation
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge, LinearRegression
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV
from rdkit.Avalon.pyAvalonTools import GetAvalonFP

data = pd.read_table('../other_data_sets/solubility_data.txt', sep=' ', skipfooter=0)

def estate_fingerprint_and_mw(mol):
return np.append(FingerprintMol(mol)[0], Descriptors.MolWt(x))

data['Mol'] = data['SMILES'].apply(Chem.MolFromSmiles)
num_mols = len(data)

#Create X and y
#Convert to Numpy arrays
y = data['Solubility'].values

Next we make a bunch of different fingerprints. To do this, I have created a fingerprint object, which stores the name of the fingerprint and contains a method for applying the fingerprint and then converting the output into a NumPy array.
from rdkit.Chem.rdMolDescriptors import
from rdkit.Chem.AtomPairs.Sheridan import GetBPFingerprint
from rdkit.Chem.EState.Fingerprinter import FingerprintMol
from rdkit.Avalon.pyAvalonTools import GetAvalonFP #GetAvalonCountFP  #int vector version
from rdkit.Chem.AllChem import  GetMorganFingerprintAsBitVect, GetErGFingerprint
from rdkit.DataStructs.cDataStructs import ConvertToNumpyArray
import rdkit.DataStructs.cDataStructs

def ExplicitBitVect_to_NumpyArray(bitvector):
bitstring = bitvector.ToBitString()
intmap = map(int, bitstring)
return np.array(list(intmap))

class fingerprint():
def __init__(self, fp_fun, name):
self.fp_fun = fp_fun
self.name = name
self.x = []

def apply_fp(self, mols):
for mol in mols:
fp = self.fp_fun(mol)
if isinstance(fp, tuple):
fp = np.array(list(fp[0]))
if isinstance(fp, rdkit.DataStructs.cDataStructs.ExplicitBitVect):
fp = ExplicitBitVect_to_NumpyArray(fp)
if isinstance(fp,rdkit.DataStructs.cDataStructs.IntSparseIntVect):
fp = np.array(list(fp))

self.x += [fp]

if (str(type(self.x[0])) != "<class 'numpy.ndarray'>"):
print("WARNING: type for ", self.name, "is ", type(self.x[0]))

def make_fingerprints(length = 512, verbose=False):
fp_list = [
#fingerprint(lambda x : GetBPFingerprint(x, fpfn=AtomPair),
#            "Physiochemical properties (1996)"), ##NOTE: takes a long time to compute
fingerprint(lambda x : GetHashedAtomPairFingerprintAsBitVect(x, nBits = length),
&qfuot;Atom pair (1985)"),
fingerprint(lambda x : GetHashedTopologicalTorsionFingerprintAsBitVect(x, nBits = length),
"Topological torsion (1987)"),
fingerprint(lambda x : GetMorganFingerprintAsBitVect(x, 2, nBits = length),
"Morgan circular "),
fingerprint(FingerprintMol, "Estate (1995)"),
fingerprint(lambda x: GetAvalonFP(x, nBits=length),
"Avalon bit based (2006)"),
fingerprint(lambda x: np.append(GetAvalonFP(x, nBits=length), Descriptors.MolWt(x)),
"Avalon+mol. weight"),
fingerprint(lambda x: GetErGFingerprint(x), "ErG fingerprint (2006)"),
fingerprint(lambda x : RDKFingerprint(x, fpSize=length),
"RDKit fingerprint")
]

for fp in fp_list:
if (verbose): print("doing", fp.name)
fp.apply_fp(list(data['Mol']))

return fp_list

fp_list = make_fingerprints()


The following code compares the fingerprints we have created, using 20 fold [cross validation](https://en.wikipedia.org/wiki/Cross-validation) and a ridge regression model.
def test_model_cv(model, x, y, cv=20):
scores = cross_validation.cross_val_score(model, x, y, cv=cv, n_jobs=-1, scoring='neg_mean_absolute_error')

scores = -1scores

return scores.mean()

def test_fingerprints(fp_list, model, y, verbose = True):

fingerprint_scores = {}

for fp in fp_list:
if verbose: print("doing ", fp.name)
fingerprint_scores[fp.name] = test_model_cv(model, fp.x, y)

sorted_names = sorted(fingerprint_scores, key=fingerprint_scores.__getitem__, reverse=False)

print("\\begin{tabular}{c c}")
print("           name        &  avg abs error in CV (kJ/cc) \\\\")
print("\\hline")
for i in range(len(sorted_names)):
name = sorted_names[i]
print("%30s & %5.3f \\\\" % (name, fingerprint_scores[name]))
print("\\end{tabular}")

test_fingerprints(fp_list, Ridge(alpha=1e-9), y, verbose=True)


doing  Atom pair (1985)
doing  Topological torsion (1987)
doing  Morgan circular
doing  Estate (1995)
doing  Avalon bit based (2006)
doing  Avalon+mol. weight
doing  ErG fingerprint (2006)
doing  RDKit fingerprint

\begin{tabular}{c c}
name        &  avg abs error in CV (kJ/cc) \\
\hline
Estate (1995) & 0.699 \\
Topological torsion (1987) & 1.492 \\
ErG fingerprint (2006) & 1.697 \\
RDKit fingerprint & 2.869 \\
Morgan circular  & 2.928 \\
Atom pair (1985) & 3.043 \\
Avalon+mol. weight & 20.190 \\
Avalon bit based (2006) & 26.482 \\
\end{tabular}

def test_model_cv(model, x, y, cv=20):
scores = cross_validation.cross_val_score(model, x, y, cv=cv, n_jobs=-1, scoring='neg_mean_absolute_error')

scores = -1scores

return scores.mean()

def test_fingerprint_vs_size(model, num_sizes_to_test = 20, max_size=2048, cv = 20, verbose=False, makeplots=False):

fp_list = make_fingerprints(length = 10) #test
num_fp = len(fp_list)

sizes = np.linspace(100,max_size,num_sizes_to_test)

scores_vs_size = np.zeros([num_fp, num_sizes_to_test])

num_fp = 0
for i in range(num_sizes_to_test):
if verbose: print(i, ",", end='')
length = sizes[i]
fp_list = make_fingerprints(length = int(length))
num_fp = len(fp_list)
for j in range(num_fp):
scores_vs_size[j,i] = test_model_cv(model, fp_list[j].x, y)

if (makeplots):
import matplotlib.pyplot as plt
plt.clf()
fig = plt.figure(figsize=(10,10))
fp_names = []
for i in range(num_fp):
plt.plot(sizes, scores_vs_size[i,:],'-')
fp_names += [fp_list[i].name]
plt.title('Ridge regression, average CV score vs fingerprint length',fontsize=25)
plt.ylabel('average mean absolute error in CV ',fontsize=20)
plt.xlabel('fingerprint length', fontsize=20)
plt.legend(fp_names,fontsize=15)
plt.ylim([.2,4])
plt.show()

return scores_vs_size

scores_vs_size = test_fingerprint_vs_size(Ridge(alpha=1e-9), verbose=True, makeplots=True)

Unfortunately, we the error shooting up as the fingerprint length increases, which is not the expected behaviour. This is likely due to the fact that we are not tuning the regularization parameter alpha in the ridge regression -- we are just using the default value. The value of alpha is either set too high or too low. One solution would be to run a grid search or other hyperparamter optimization for each size we want to test. However, this is computationally expensive. An easier solution is to use Bayesian Ridge Regression in scikit-learn:

Now we see expected behaviour -- fingerprint performance increases as the length is increased, and eventually converges around 2000 bits. Note that the Estate has a fixed length of 80 and is more like a descriptor set rather than a fingerprint. Another point is that Lasso regression (L1 regularization) generally does slightly better than Ridge regression with fingerprints, since fingerprints can have elements which are always zero or almost always zero. [L1 regularization introduces a penalty term which has a gradient that forces coefficients corresponding to variables with no predictive power to be precisely zero, whereas L2 has trouble getting coefficients to be precisely zero.] In what follows, I use the Estate fingerprint, as I was comparing Estate with some other applications. I create a new data matrix *X* and then feed into machine learning models:

def estate_fingerprint(mol):
return FingerprintMol(mol)[0]

#Scale X to unit variance and zero mean
data['Fingerprint'] = data['Mol'].apply(estate_fingerprint)

X = np.array(list(data['Fingerprint']))

st = StandardScaler()
X = np.array(list(data['Fingerprint']))
X = st.fit_transform(X)


Next we do grid searches to tune the hyperparameters of the KernelRidge, Ridge, GaussianProcessRegressor, and RandomForestRegressor models in scikit-learn:
KRmodel = GridSearchCV(KernelRidge(), cv=10,
param_grid={"alpha": np.logspace(-10, -5, 10),
"gamma": np.logspace(-12, -9, 10), "kernel" : ['laplacian', 'rbf']}, scoring='neg_mean_absolute_error', n_jobs=-1)

KRmodel = KRmodel.fit(X, y)
Best_KernelRidge = KRmodel.best_estimator_
print("Best Kernel Ridge model")
print(KRmodel.best_params_)
print(-1*KRmodel.best_score_)

Best Kernel Ridge model
{'alpha': 3.5938136638046258e-10, 'gamma': 1e-10, 'kernel': 'laplacian'}
0.710770578217

Rmodel = GridSearchCV(Ridge(), cv=20,
param_grid={"alpha": np.logspace(-10, -5, 30),}, scoring='neg_mean_absolute_error', n_jobs=-1)

Rmodel = Rmodel.fit(X, y)
Best_Ridge = Rmodel.best_estimator_
print("Best Ridge model")
print(Rmodel.best_params_)
print(-1*Rmodel.best_score_)

Best Ridge model
{'alpha': 1e-10}
0.699188630722

GPmodel = GridSearchCV(GaussianProcessRegressor(normalize_y=True), cv=20,
param_grid={"alpha": np.logspace(-15, -10, 30),}, scoring='neg_mean_absolute_error', n_jobs=-1)
GPmodel = GPmodel.fit(X, y)
Best_GaussianProcessRegressor = GPmodel.best_estimator_
print("Best Gaussian Process model")
print(GPmodel.best_params_)
print(-1*GPmodel.best_score_)

Best Gaussian Process model
{'alpha': 2.2122162910704501e-15}
0.965220889797

RFmodel = GridSearchCV(RandomForestRegressor(), cv=20,
param_grid={"n_estimators": np.linspace(50, 150, 25).astype('int')}, scoring='neg_mean_absolute_error', n_jobs=-1)

RFmodel = RFmodel.fit(X, y)
Best_RandomForestRegressor = RFmodel.best_estimator_
print("Best Random Forest model")
print(RFmodel.best_params_)
print(-1*RFmodel.best_score_)

Best Random Forest model
{'n_estimators': 91}
0.64131656385


# Testing out different machine learning models

import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import rbf_kernel

def make_scatter_plot(y_pred_train, y_pred_test, y_train, y_test, title='', figsize=(6,4), fontsize=16):
plt.clf()
plt.figure(figsize=figsize)
plt.scatter(y_train,y_pred_train, label = 'Train', c='blue')
plt.title(title,fontsize=fontsize+5)
plt.xlabel('Experimental Solubility (mol/L)', fontsize=fontsize)
plt.ylabel('Predicted Solubility (mol/L)', fontsize=fontsize)
plt.scatter(y_test,y_pred_test,c='lightgreen', label='Test', alpha = 0.8)
plt.legend(loc=4)
plt.show()

def test_models_and_plot(x, y, model_dict, plots=True):
''' test a bunch of models and print out a sorted list of CV accuracies
inputs:
x: training data features, numpy array or Pandas dataframe
y: training data labels, numpy array or Pandas dataframe
model_dict: a dictionary of the form {name : model()}, where 'name' is a string
and 'model()' is a scikit-learn model object.
'''

mean_scores = {}
percent_errors = {}

for (name, model) in model_dict.items():
#print("running %s" % name)
scores = cross_validation.cross_val_score(model, x, y, cv=20, n_jobs=-1, scoring='neg_mean_absolute_error')
scores = -1*scores
mean_score = scores.mean()
mean_scores[name] = mean_score

X_train, X_test, y_train, y_test = cross_validation.train_test_split(x, y, test_size=0.1)

model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)

y_pred_test  = model.predict(X_test)

percent_error = np.mean( 100*np.abs(y_test -y_pred_test)/np.abs(y_pred_test))

percent_errors[name] = percent_error

fulltitle = name+'\n mean % error: '+str(percent_error)

if plots:
make_scatter_plot(y_pred_train, y_pred_test, y_train, y_test, title=fulltitle, figsize = (8,6))

sorted_names = sorted(percent_errors, key=mean_scores.__getitem__, reverse=False)

print("\\begin{tabular}{c c c}")
print("           name     &      % test err   & .    abs error in CV \\\\ ")
print("\\hline")
for i in range(len(sorted_names)):
name = sorted_names[i]
print("%30s & %5.3f & %5.3f \\\\" % (name, percent_errors[name], mean_scores[name]))
print("\\end{tabular}")

alpha_grid = {'alpha': np.logspace(1e-11,1e-1,8)}

model_dict = {
'Linear Regression': LinearRegression(),
'Kernel Ridge Regression': Best_KernelRidge,
'Ridge Regression':Best_Ridge,
'Guassian Process Regressor': Best_GaussianProcessRegressor,
'Support Vector Regression': SVR(),
'KNeighborsRegressor': KNeighborsRegressor(),
'Neural Network': MLPRegressor(alpha=100,max_iter=8000, hidden_layer_sizes=[8,6], early_stopping=False),
'Random forest': Best_RandomForestRegressor
}

test_models_and_plot(X, y, model_dict)

\begin{tabular}{c c c}
name     &      % test err   & .    abs error in CV \\
\hline
Support Vector Regression & 81.770 & 0.641 \\
Random forest & 46.825 & 0.643 \\
Kernel Ridge Regression & 55.056 & 0.685 \\
Ridge Regression & 40.707 & 0.699 \\
Gradient Boosted Trees & 40.320 & 0.755 \\
KNeighborsRegressor & 73.877 & 0.840 \\
Guassian Process Regressor & 38.252 & 0.967 \\
Neural Network & 35.076 & 1.043 \\
Linear Regression & 212.043 & 1384058953421.953 \\
\end{tabular}


The output is formatted in LaTeX so a nice table can be produced, which ends up looking like this:

# Conclusion

The best performing model in this case was support vector regression, followed closely by Random Forrest and Kernel Ridge Regression. In general, I find that KRR and random forest are usually the best performing models, and this is the first time SVR has won out. However, in molecular machine learning, the model that is used is really not the limiting factor, provided one does their due diligence and explores several classes of models and optimizes their hyperparameters. The component which really determines success is the featurization technique used. The machine learning model can be thought of an black box machine that extracts the nonlinear dependence of Y on X. As long as the black box is large enough and very well tuned and regularized, we don’t have to worry about it much. What we need to worry about more is creating the right X from our molecule data so that the black box can do its job. Traditionally, human created fingerprints and descriptors have been used, which are based on physics and chemical intuition about what molecular features are important. In the future ML can also be applied to fingerprint design – this will be discussed in a future post.

Another thing we observe is the large difference between the test and CV errors. The test error is for a particular train-test split, and appears to have high variance. The CV error is averaged over all test-train splits. The percentage error depends highly on whether one is working with a very soluble or insoluble molecule. For some applications the error profile of this model may be acceptable, in others it may be too crude. The error is definitely larger than a different application I applied this same code to, which was predicting decomposition energy. In that case I obtained average errors on the order of ~20% or less.