Testing out decision trees, AdaBoosted trees, and random forest

Recently I experimented with decision trees for classification, to get a better idea of how they work. First I created some 2 dimensional training data with 2 categories, using sci-kit-learn:


import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_moons, make_circles, make_classification

n_samples = 1000

X1, Y1 = make_classification(n_samples = n_samples, n_features=2, n_redundant=0, 
 n_informative=2, n_clusters_per_class=1, 
 class_sep=1, random_state=0)

X2, Y2 = make_moons(n_samples = n_samples, noise=0.3, random_state=0)

Next I wrote some code to test a decision tree with different numbers of splits for the two data sets:

from sklearn.tree import DecisionTreeClassifier

def split(X):
 mid = len(X)//2
 return X[0:mid], X[mid:]

def plot_dt_vs_num_splits(X, Y, max_num_splits = 5, plot_step = .02):
 #split train and test data
 Xtrain, Xtest = split(X)
 Ytrain, Ytest = split(Y)
 #make meshgrid for plotting 
 x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
 y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
 xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
 np.arange(y_min, y_max, plot_step))


splits = [1,2,5,10,100]
 num_splits = len(splits)
 for n in range(num_splits):
 dt = DecisionTreeClassifier(criterion = 'gini', max_depth=splits[n])
 dt.fit(Xtrain, Ytrain)

#plot on mesh
 Z = dt.predict(np.c_[xx.ravel(), yy.ravel()])
 Z = Z.reshape(xx.shape)
 plt.subplot(1, num_splits, n+1)
 plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
 plt.scatter(X[:, 0], X[:, 1], marker='o', c=Y) 
 plt.title("n_splits = " + str(splits[n]) +
 "\n train data accuracy = " + str(dt.score(Xtrain,Ytrain)) +
 "\n test data accuracy = " + str(dt.score(Xtest,Ytest)), fontsize=15)

plot_dt_vs_num_splits(X1, Y1)
plot_dt_vs_num_splits(X2, Y2)



Here decision trees are exhibiting their classic weakness – so-called “high bias”, ie overfitting. Here, the tree was able to fit the data
first set of data (1000 points, 2 Gaussians) perfectly after 10 splits. In the case of the second data sets (1000 points, ‘moons’), the test data
accuracy decreases going from 10 to 100 splits, while the training data accuracy improves to 1, which is a sign of overfitting.

That is why people use boosted ensembles of trees. An ensemble is a weighted sum of models. In ensembles, many simple models can be combined to create a more complex one. The real utility of ensembles comes from how they are trained though, using boosting methods (wikipedia), which are a set of different techniques for training ensembles while preventing (or more technically delaying) overfitting.


(max depth of base estimators = 2) 


Random Forest
(max depth of base estimators = 2)
Random forest is currently one of the most widely used classification techniques in business. Trees have the nice feature that it is possible to explain in human-understandable terms how the model reached a particular decision/output.  Here random forest outperforms Adaboost, but the ‘random’ nature of it seems to be becoming apparent.. 



more commentary will follow.

Talks on boosted trees

Peter Prettenhofer – Gradient Boosted Regression Trees in scikit-learn
Trevor Hastie (a more detailed talk on Boosting / Ensembles)
MIT : boosting (a mishmash of general concepts) 
Nando de Freitas : decision trees 
Nando de Freitas : Random forest 

Leave a Comment

Filed under machine learning, Uncategorized

The AI control problem is more pressing than the consciousness question

A lot of people commented during and after the recent Ethics of AI conference at NYU that we still don’t know what the necessary conditions for consciousness are, and that this problem lingered like an elephant in the room. The implication seemed to be that this problem cast a pale on a lot of the work that was discussed at the conference. One commentator even summarized the conference as a ‘road to nowhere‘ at least partially because of this issue. 

The ‘conditions for consciousness problem’ is critically important, and the reasons for this were articulated especially well by the panelist Susan Schneider. Several important ‘forks in the road’ in the future development of mankind hinge on whether we think the AIs we create are conscious:
Continue reading

Leave a Comment

Filed under philosophy

Some subtler problems with Wikipedia articles

Wikipedia has a lot of problems, including some subtle but serious ones that seem difficult to fix without radical changes to how the platform operates.

I’ve watched the growth of Wikipedia since my first edit, which was in 2004. Since then, I’ve accrued 16,044 edits. 10,287 of those were in 2007, when I was very active as an anti-vandalism patroller. Over the years I’ve created 68 pages in total. Wikipedia has always had obvious problems such as vandalism, systemic bias, and link dropping, which are being addressed by a variety of concerted efforts.  Lately I’ve been noticing a lot of more subtle problems with Wikipedia articles, which has caused me to seek out higher quality sources of information. To put it bluntly, Wikipedia articles are just not very well written. They lack logical progression and consistency in their style and level of technical depth. Of course, it’s difficult for the Wikipedia platform to achieve either, since many different authors are constantly adding and subtracting sentences from every article.  Continue reading

Leave a Comment

Filed under Uncategorized

Recursion is slow

Recursion is something that many computer science majors consider elegant. However, in simulation, speed far outweighs how many lines of code are underneath. [That is one reason why physicists still code in Fortran.]

Continue reading

Leave a Comment

Filed under programming, python, Uncategorized

An introduction to the water structure problem

Although the macroscopic properties of water have been heavily studied, there are things we don’t understand about this ubiquitous substance. In this post, I will provide an introduction to the problem of describing water’s structure. At first glance, the idea of a liquid having structure seems preposterous. Indeed, liquids cannot maintain a structural arrangement of atoms like solids can. Instead, the atoms/molecules tumble past each other in constant state of motion. This allows for the defining property of the liquid state – the ability to fill a container.   Continue reading


Filed under non-technical, research, Uncategorized

Our Nature Communciations paper – ice-like phonons in liquid water

Our paper, “The hydrogen-bond network of water supports propagating optical phonon-like modes” was published on January 4th in Nature Communications (full open access pdf). A press release about our work has been issued by the Stony Brook Newsroom and picked up by news aggregator Phys.org.

Our work shows that propagating vibrations or phonons can exist in water, just as in ice. The work analyzes both experimental data and the results of extensive molecular dynamics simulations performed with a rigid model (TIP4P/eps), a flexible model (TIP4P/2005f), and an ab-initio based polarizable model (TTM3F). Many of these simulations were performed on the new supercomputing cluster at Stony Brook’s Institute for Advanced Computational Science.

Continue reading


Filed under non-technical, research, Uncategorized

“Accelerando” by Charles Stross

by Charles Stross
2006, 415 pg

“There is an intrinsic unknowability about the technological singularity. Most writers leave it safely offstage or invent reasons why it doesn’t happen. Not Charles Stross. Accelerando lives up to its name, and is the most unflinching look into radical optimism I’ve seen.” – Vernor Vinge

During winter break I finally read Accelerando.  I say “finally” because this book was first recommended to me in 2009 at the (now defunct) Rensselaer Polytechnic Institute transhumanism club. Accelerando is notable as being perhaps the first novel to have a storyline which traverses directly through a technological singularity.
Continue reading

Leave a Comment

Filed under books, singularity, technology, Uncategorized

Maximum entropy priors

How do we assign priors?

If we don’t have any prior knowledge, then the obvious solution is to use the principle of indifference. This principle says that if we have no reason for suspecting one outcome over any other, than all outcomes must be considered equally likely. Jakob Bernoulli called this the “principle of insufficient reason”, a play on the “principle of sufficient reason”, which asserts that everything must have a reason or cause. This may be the case, but if we are ignorant of reasons, we cannot say that one outcome will be more likely than any other.
Continue reading

Leave a Comment

Filed under Bayesian inference, machine learning, python, Uncategorized

More counterintuitive Bayesian reasoning problems

Remember how in my last post I said Bayesian reasoning is counter-intuitive? It’s simultaneously maddening and fascinating because clearly, given we accept with certainty the assumptions that go into model/hypothesis selection and the prior, the application of Bayes’ theorem gives the correct probability for each model/hypothesis in light of the evidence presented.

Last time I gave the canonical example of a test for a disease. Humans tend to not take into account that if the base rate (over all frequency of having a disease) is very low, then a positive result on a test may not be very meaningful. If the probability of the test giving a false positive is 1%, and the base rate is also 1%, then the chance that you have disease given a positive result is only 50%. Once you understand this, the non-intuitive nature goes away.

Here I will give two more examples of highly non-intuitive Bayesian problems.

The Monty Hall problem
The famous Monty Hall problem can be solved with Bayes’ rule. A statement of the problem is:

There are three doors, labelled 1, 2, 3. A single prize has been hidden between one of them. You get to select one door. Initially your chosen door will not be opened. Instead, the gameshow host will choose one of the other two doors, and he will do so in such a way as to not reveal the prize. After this, you will be given a fresh choice among the 2 remaining doors – you can stick with your first choice, or switch to the other closed door.

Imagine you picked door 1. The gameshow host picks door 3, opening to reveal no prize. Should you (a.) stick with door 1, (b.) switch to door 2, or (c.) does it make no difference?

If you have not heard of this problem before, think it over a while.

The Bayes’ theorem solution is as follows: Denote the possible outcomes as:
 D_1 = the prize is between door 1
 D_2 = the prize is between door 2
 D_3 = the prize is between door 3

We know  P(D_1) = \frac{1}{3}. The question is what is [/latex]P(D_2 | D)[/latex]? We use the symbol  D to denote the data/evidence we have, which is the fact that the gameshow host opened door 3. We can solve this using Bayes’ theorem:

 P(D_2|E) = \frac{P(E |D_2)P(D_2) }{ P(E) }
 P(D_2|E) = \frac{P(E |D_2)P(D_2) }{P(E |D_1)P(D_1) +P(E |D_2)P(D_2)}
 \quad = \frac{( 1)(\frac{1}{2})} { (\frac{1}{3})(\frac{1}{2}) + (1)(\frac{1}{3})}
 \quad = \frac{2}{3}

Therefore it is better to switch to door 2. By switching to door 2, we double our chance of winning from  \frac{1}{3} to  \frac{2}{3}. The tricky part of the calculation is calculating the normalizing factor  P(E), where we must consider the probability the game show host will open door 3 when the prize is behind door 1 (=1/2) and the case where the prize is behind door 2 (=1).

Note the following mind-blowing shortcut to solving the problem:

Since door 3 was opened, we know  P(D_3) = 0. The gameshow host did nothing to interfere with door 1. Thus  P(D_1) = 1/3 as it was in the beginning. Now, we know  P(D_1) + P(D_2) + P(D_3) = 1, so  P(D_2) = 2/3!

Bayesian model comparison
Bayes’ theorem allows us to compare the likelihoods of different models being true. To take a concrete example, let’s assume we have black box with a button attached. Whenever we hit the button, and a light on top of the box blinks either green or red. We hit the button a number times, obtaining a sequence:


Let’s say say we model the black box as a ‘bent coin’. Thus, our model says that each outcome is statistically independent from previous outcomes and the probability of getting green is  p_g. Using Bayes’ rule, we can infer the most likely value of  p_g for this model, and compute the probability of any  p_g given a sequence of observations. In the interest of space, I won’t solve it here.

We might have a different model, though. We may model the black box as a dice. If the dice lands on 1, the green light goes on, otherwise, the red light goes on. This corresponds to our earlier, more general model, but with  p_g fixed at  p_g = \frac{1}{6}. The first model has a free parameter,  p_g, while the second model does not.

The method of Bayesian model comparison can tell us which model is more likely. Instead of analyzing the situation with a single model, we now consider both models at the same time. I like to use the term ‘metamodel’ for this. In our metamodel, we assume equal prior probabilities for each model. We denote the sequence of flashes we observed as  s. Model 1 is denoted  \mathcal {H}_1, and model 2 is denoted  \mathcal {H}_2 (the symbol  \mathcal{H} stands for ‘hypothesis’, a word which we take as synonymous with ‘model’).

 P(\mathcal{H}_1 | s) = \frac{ P( s | \mathcal{H}_1) P(\mathcal{H}_1) }{ P(s) }
 P(\mathcal{H}_2 | s) = \frac{ P( s | \mathcal{H}_2) P(\mathcal{H}_1) }{ P(s) }

The relative probability of model 2 over model 1 is encoded in the ratio of the posterior probabilities:

 \frac{P(\mathcal{H}_1 | s) }{P(\mathcal{H}_2 | s) } = \frac{P(s| \mathcal{H}_1)}{P(s|\mathcal{H}_2)}

The ratio tells us the relative probability that model 1 is correct. Note that absolute probabilities of model 1 and model 2 can be computed from this using the fact that  P(\mathcal{H}_1|s)+P(\mathcal{H}_2 |s)=1 That’s all on model comparison for now. A more detailed discussion can be found in MacKay’s book.

The case of the blood stains
This problem is taken directly MacKay’s book:

Two people have left traces of their own blood at the scene of a
 crime. A suspect, Oliver, is tested and found to have type ‘O’
 blood. The blood groups of the two traces are found to be of type
 ‘O’ (a common type in the local population, having frequency 60%)
 and of type ‘AB’ (a rare type, with frequency 1%). Do these data
 (type ‘O’ and ‘AB’ blood were found at scene) give evidence in
 favour of the proposition that Oliver was one of the two people
 present at the crime?

At first glance, lawyer may easily convince the jury that the presence of the type ‘O’ blood stain provides evidence that Oliver was present. The lawyer may argue the while the degree of weight it should carry may be small, since type ‘O’ is fairly common, nonetheless the presence of type ‘O’ should count as positive evidence. However, this is not the case!

Denote the proposition ‘the suspect and one unknown person were present’ by  S. The alternative,  \bar{S}, states that “two unknown people from the population were present”.

If we assume that the suspect, Oliver, was present, then the likelihood of the data is simply the likelihood of having a person with blood type ‘AB’:
 P(D|S) = p_{AB} = .01

The likelihood of the other case is the likelihood that two unknown people drawn from the population have blood types ‘AB’ and ‘O”:
 P(D|\bar{S}) = 2 p_{AB}p_{O} = .083

The second case is more likely. The likelihood ratio is
 \frac{.01}{.083} = .83

Thus the data actually provides weak evidence against the supposition that Oliver was present. Why is this?

We can gain some insight by first considering another suspect, Alberto, who has blood type  AB. We denote the hypothesis that Alberto was present by  S', and the hypothesis that he wasn’t present  \bar{S}. In this case, the ratio is:


Clearly, in this case, the evidence does support the hypothesis that Alberto was there. MacKay elaborates: (my emphasis added)

Now let us change the situation slightly; imagine that 99% of people are of blood type O, and the rest are of type AB. Only these two blood types exist in the population. The data at the scene are the same as before. Consider again how these data influence our beliefs about Oliver, a suspect of type O, and Alberto, a suspect of type AB. Intuitively, we still believe that the presence of the rare AB blood provides positive evidence that Alberto was there. But does the fact that type O blood was detected at the scene favour the hypothesis that Oliver was present? If this were the case, that would mean that regardless of who the suspect is, the data make it more probable they were present; everyone in the population would be under greater suspicion, which would be absurd. The data may be compatible with any suspect of either blood type being present, but if they provide evidence for some theories, they must also provide evidence against other theories.

Here is another way of thinking about this: imagine that instead of two people’s blood stains there are ten (independent stains), and that in the entire local population of one hundred, there are ninety type O suspects and ten type AB suspects. Consider a particular type O suspect, Oliver: without any other information, and before the blood test results come in, there is a one in 10 chance that he was at the scene, since we know that 10 out of the 100 suspects were present. We now get the results of blood tests, and find that nine of the ten stains are of type AB, and one of the stains is of type O. Does this make it more likely that Oliver was there? No, there is now only a one in ninety chance that he was there, since we know that only one person present was of type O.

MacKay continues to elaborate this problem by doing a more explicit calculation. In the end the conclusion is:

If there are more type O stains than the average number expected under hypothesis  \bar{S}, then the data give evidence in favour of the presence of Oliver. Conversely, if there are fewer type O stains than the expected number under  \bar{S}, then the data reduce the probability of the hypothesis that he was there.

Note the similarity with the drug test example. The base rate of blood stains must be considered.

Bayesian statistics in court
Ideally, a jury would apply Bayesian reasoning to rank the likelihood of different hypotheses. The chance that a person is a suspect is denoted  S, and the probability is encoded in the ratio  \frac{S}{\bar{S}}. In the words of MacKay:

“In my view, a jury’s task should generally be to multiply together carefully evaluated likelihood ratios from each independent piece of evidence with an equally carefully reasoned prior probabilities.”

The potential for Bayesian methods to improve the criminal justice system is huge. The issue though is that statistics can also be easily manipulated by subtle changes in the inputs. Judges and juries can easily be misled if they have no understanding of statistics. One solution is to train the jury in Bayesian statistics during the course of the case, and this has been used by lawyers to help juries understand complicated blood stain DNA evidence. However, many judges (who usually lack a deep understanding of statistics) are immensely skeptical of whether the jury can properly analyze complex statistical data without being hoodwinked. From their point of view, statistics are too opaque. There is the question of the confidence that can be placed in the jury to properly apply Bayesian methodology, even after training. Should juries be quizzed on their ability to do Bayesian reasoning before being allowed to deliberate? The challenge is to explain complicated statistical methodologies in a way that lay people can understand, and no solution has yet been found that all parties agree upon. For this reason, the use of Bayesian statistics in courts has been banned in the UK. Obviously, this is not at all an optimal situation.

Leave a Comment

Filed under Bayesian inference, machine learning, statistics, Uncategorized

Exclusion zone water

Note: this is rather technical. 

This spring I had the pleasure of speaking briefly with a distinguished engineer, inventor, businessperson, and benefactor of science. He explained how he has recently become interested in the work of Prof. Gerald Pollack, who discovered what he calls the “4th phase of water”. The very term “4th phase of water” immediately raised an alarm bell in my head, since there are actually 19 or so known phases of water. I decided to check out what this “4th phase” was. It turns out this ‘phase’ has so far only been observed at the boundary with an odd material called Nafion, so really, it’s interfacial water with special properties, not a new phase of the liquid itself. My research focus the past three years has been understanding the microscopic details underlying the dielectric properties of water.  I am very interested in the structure and behavior of water around proteins and dissolved ions (and have read numerous papers on the subject) so naturally I am interested in Dr Pollack’s claims. Additionally, Pollack has shown that he can use the exclusion done phenomena to build a device that filters out nanospheres, and he claims his discovery can be used for desalination technology. He has not yet actually presented a functioning desalination apparatus, but he has filed a patent for the technology.
Continue reading


Filed under pseudoscience, research, technical, Uncategorized