# Tag Archives: machine learning

## Testing out decision trees, AdaBoosted trees, and random forest

Recently I experimented with decision trees for classification, to get a better idea of how they work. First I created some 2 dimensional training data with 2 categories, using sci-kit-learn:

Filed under machine learning, Uncategorized

## Maximum entropy priors

How do we assign priors?

If we don’t have any prior knowledge, then the obvious solution is to use the principle of indifference. This principle says that if we have no reason for suspecting one outcome over any other, than all outcomes must be considered equally likely. Jakob Bernoulli called this the “principle of insufficient reason”, a play on the “principle of sufficient reason”, which asserts that everything must have a reason or cause. This may be the case, but if we are ignorant of reasons, we cannot say that one outcome will be more likely than any other.

Filed under Bayesian inference, machine learning, python, Uncategorized

## More counterintuitive Bayesian reasoning problems

Remember how in my last post I said Bayesian reasoning is counter-intuitive? It’s simultaneously maddening and fascinating because clearly, given we accept with certainty the assumptions that go into model/hypothesis selection and the prior, the application of Bayes’ theorem gives the correct probability for each model/hypothesis in light of the evidence presented.

Last time I gave the canonical example of a test for a disease. Humans tend to not take into account that if the base rate (over all frequency of having a disease) is very low, then a positive result on a test may not be very meaningful. If the probability of the test giving a false positive is 1%, and the base rate is also 1%, then the chance that you have disease given a positive result is only 50%. Once you understand this, the non-intuitive nature goes away.

Here I will give two more examples of highly non-intuitive Bayesian problems.

The Monty Hall problem
The famous Monty Hall problem can be solved with Bayes’ rule. A statement of the problem is:

There are three doors, labelled 1, 2, 3. A single prize has been hidden between one of them. You get to select one door. Initially your chosen door will not be opened. Instead, the gameshow host will choose one of the other two doors, and he will do so in such a way as to not reveal the prize. After this, you will be given a fresh choice among the 2 remaining doors – you can stick with your first choice, or switch to the other closed door.

Imagine you picked door 1. The gameshow host picks door 3, opening to reveal no prize. Should you (a.) stick with door 1, (b.) switch to door 2, or (c.) does it make no difference?

If you have not heard of this problem before, think it over a while.

The Bayes’ theorem solution is as follows: Denote the possible outcomes as:
= the prize is between door 1
= the prize is between door 2
= the prize is between door 3

We know . The question is what is [/latex]P(D_2 | D)[/latex]? We use the symbol to denote the data/evidence we have, which is the fact that the gameshow host opened door 3. We can solve this using Bayes’ theorem:

Therefore it is better to switch to door 2. By switching to door 2, we double our chance of winning from to . The tricky part of the calculation is calculating the normalizing factor , where we must consider the probability the game show host will open door 3 when the prize is behind door 1 (=1/2) and the case where the prize is behind door 2 (=1).

Note the following mind-blowing shortcut to solving the problem:

Since door 3 was opened, we know . The gameshow host did nothing to interfere with door 1. Thus as it was in the beginning. Now, we know , so !

Bayesian model comparison
Bayes’ theorem allows us to compare the likelihoods of different models being true. To take a concrete example, let’s assume we have black box with a button attached. Whenever we hit the button, and a light on top of the box blinks either green or red. We hit the button a number times, obtaining a sequence:

GRRGGRGRRGRG…

Let’s say say we model the black box as a ‘bent coin’. Thus, our model says that each outcome is statistically independent from previous outcomes and the probability of getting green is . Using Bayes’ rule, we can infer the most likely value of for this model, and compute the probability of any given a sequence of observations. In the interest of space, I won’t solve it here.

We might have a different model, though. We may model the black box as a dice. If the dice lands on 1, the green light goes on, otherwise, the red light goes on. This corresponds to our earlier, more general model, but with fixed at . The first model has a free parameter, , while the second model does not.

The method of Bayesian model comparison can tell us which model is more likely. Instead of analyzing the situation with a single model, we now consider both models at the same time. I like to use the term ‘metamodel’ for this. In our metamodel, we assume equal prior probabilities for each model. We denote the sequence of flashes we observed as . Model 1 is denoted , and model 2 is denoted (the symbol stands for ‘hypothesis’, a word which we take as synonymous with ‘model’).

The relative probability of model 2 over model 1 is encoded in the ratio of the posterior probabilities:

The ratio tells us the relative probability that model 1 is correct. Note that absolute probabilities of model 1 and model 2 can be computed from this using the fact that That’s all on model comparison for now. A more detailed discussion can be found in MacKay’s book.

The case of the blood stains
This problem is taken directly MacKay’s book:

Two people have left traces of their own blood at the scene of a
crime. A suspect, Oliver, is tested and found to have type ‘O’
blood. The blood groups of the two traces are found to be of type
‘O’ (a common type in the local population, having frequency 60%)
and of type ‘AB’ (a rare type, with frequency 1%). Do these data
(type ‘O’ and ‘AB’ blood were found at scene) give evidence in
favour of the proposition that Oliver was one of the two people
present at the crime?

At first glance, lawyer may easily convince the jury that the presence of the type ‘O’ blood stain provides evidence that Oliver was present. The lawyer may argue the while the degree of weight it should carry may be small, since type ‘O’ is fairly common, nonetheless the presence of type ‘O’ should count as positive evidence. However, this is not the case!

Denote the proposition ‘the suspect and one unknown person were present’ by . The alternative, , states that “two unknown people from the population were present”.

If we assume that the suspect, Oliver, was present, then the likelihood of the data is simply the likelihood of having a person with blood type ‘AB’:

The likelihood of the other case is the likelihood that two unknown people drawn from the population have blood types ‘AB’ and ‘O”:

The second case is more likely. The likelihood ratio is

Thus the data actually provides weak evidence against the supposition that Oliver was present. Why is this?

We can gain some insight by first considering another suspect, Alberto, who has blood type . We denote the hypothesis that Alberto was present by , and the hypothesis that he wasn’t present . In this case, the ratio is:

Clearly, in this case, the evidence does support the hypothesis that Alberto was there. MacKay elaborates: (my emphasis added)

Now let us change the situation slightly; imagine that 99% of people are of blood type O, and the rest are of type AB. Only these two blood types exist in the population. The data at the scene are the same as before. Consider again how these data influence our beliefs about Oliver, a suspect of type O, and Alberto, a suspect of type AB. Intuitively, we still believe that the presence of the rare AB blood provides positive evidence that Alberto was there. But does the fact that type O blood was detected at the scene favour the hypothesis that Oliver was present? If this were the case, that would mean that regardless of who the suspect is, the data make it more probable they were present; everyone in the population would be under greater suspicion, which would be absurd. The data may be compatible with any suspect of either blood type being present, but if they provide evidence for some theories, they must also provide evidence against other theories.

Here is another way of thinking about this: imagine that instead of two people’s blood stains there are ten (independent stains), and that in the entire local population of one hundred, there are ninety type O suspects and ten type AB suspects. Consider a particular type O suspect, Oliver: without any other information, and before the blood test results come in, there is a one in 10 chance that he was at the scene, since we know that 10 out of the 100 suspects were present. We now get the results of blood tests, and find that nine of the ten stains are of type AB, and one of the stains is of type O. Does this make it more likely that Oliver was there? No, there is now only a one in ninety chance that he was there, since we know that only one person present was of type O.

MacKay continues to elaborate this problem by doing a more explicit calculation. In the end the conclusion is:

If there are more type O stains than the average number expected under hypothesis , then the data give evidence in favour of the presence of Oliver. Conversely, if there are fewer type O stains than the expected number under , then the data reduce the probability of the hypothesis that he was there.

Note the similarity with the drug test example. The base rate of blood stains must be considered.

Bayesian statistics in court
Ideally, a jury would apply Bayesian reasoning to rank the likelihood of different hypotheses. The chance that a person is a suspect is denoted , and the probability is encoded in the ratio . In the words of MacKay:

“In my view, a jury’s task should generally be to multiply together carefully evaluated likelihood ratios from each independent piece of evidence with an equally carefully reasoned prior probabilities.”

The potential for Bayesian methods to improve the criminal justice system is huge. The issue though is that statistics can also be easily manipulated by subtle changes in the inputs. Judges and juries can easily be misled if they have no understanding of statistics. One solution is to train the jury in Bayesian statistics during the course of the case, and this has been used by lawyers to help juries understand complicated blood stain DNA evidence. However, many judges (who usually lack a deep understanding of statistics) are immensely skeptical of whether the jury can properly analyze complex statistical data without being hoodwinked. From their point of view, statistics are too opaque. There is the question of the confidence that can be placed in the jury to properly apply Bayesian methodology, even after training. Should juries be quizzed on their ability to do Bayesian reasoning before being allowed to deliberate? The challenge is to explain complicated statistical methodologies in a way that lay people can understand, and no solution has yet been found that all parties agree upon. For this reason, the use of Bayesian statistics in courts has been banned in the UK. Obviously, this is not at all an optimal situation.

## A physicist’s visualization of Bayes’ theorem

Have you noticed that everyone is talking about Bayes’ theorem nowadays?

Bayes’ theorem itself is not very complicated. The human mind, however, is extremely bad at trying to gain an intuitive understanding of Bayes’ theorem based (Bayesian) reasoning. The counter-intuitive nature of Bayesian reasoning, combined with the jargon and intellectual baggage that usually accompanies descriptions of Bayes’ theorem, can make it difficult to wrap one’s mind around. I am a very visual thinker, therefore, I quickly came up with a visualization of the theorem. A little Googling shows that there are many different ways of visualizing Bayes’ theorem. A few months ago I came across a visualization of Bayes’ theorem which I found somewhat perplexing.  Even though mathematical truths are universal, they are internalized differently by every individual. I would love to hear whether others find my visualization approach useful. It is a very physicist-oriented visualization.

Filed under Mathematics, technical

## Building a Kernel for 3D Shape Recognition Using Neural Networks

Note: this writeup describes work I did with Dr. Garret Kenyon during an SULI internship at Los Alamos National Laboratory in 2010

There are many divergent approaches to computer vision. As of yet, no generally accepted approach exists. Figure 1 shows the diverse fields that overlap with computer vision research. Because of the many difficulties of computer vision and absence of a general approach, most research has gone into systems for specialized applications. Much of this research has been successful, for instance we now have algorithms for fingerprint recognition, facial recognition and the classification of simple geometric objects. But we do not have a system which can look at everyday scenes and find objects.

The first approaches to computer vision were highly geometrical in nature. For instance, an algorithm might begin by isolating the edges of the image using various filters, convolutions and spatial derivatives. Next the computer will build a map of these edges and attempt to isolate the boundaries of various objects. In some cases the computer may try to classify vertices (where edges meet) as convex or concave. Next the computer will search a database of pre-programmed object information to find the nearest match. Accomplishing this search may involve mathematical techniques such as tree-search algorithms or gradient decent. Such an algorithm might be able to successfully detect the orientation of simple geometric objects such as cubes or pyramids based on the way the vertices are arranged. The distance and orientation of objects could be calculated using a combination of stereo vision and/or projective geometry. However, such a system would probably fail with curved objects.

While these “geometric” approaches have proven to be fruitful, they have also proven to be quite limited. For this reason, many people are turning to biologically inspired approaches as a solution to the general problem of computer vision.  Originally, computer models of the visual cortex were developed to test the current theories about its function. When they were used for object recognition they were found to outperform all the other approaches in certain cases.

Figure 1 – Some of the disciplines involved in computer vision research. (PD, Wikimedia)

Biological vision
A “biologically-inspired” approach is essentially a reverse engineering of the primate vision system. For this reason it is worth describing briefly what is known about the primate/human vision system. Vision involves much more than “taking a picture”. Focusing and detection of light is just the first stage in a complex system which allows humans to construct a mental picture of the environment. Very much like the pixels on a digital camera, the human eye contains thousands of rods of cones which detect light. However, even on the retina itself, pixel data starts to be encoded for processing. Rods and cones are wired together with amacrine, bipolar, horizontal and ganglion cells to create receptive fields that, loosely speaking, are sensitive to small points of light and edges.

These signals then travel to the Lateral Geniculate Body (LGB), a part of the visual cortex located in the back of the brain. There, signals pass through several layers, known as V1, V2, V3, V4, and V5 (MT). Each of these layers is a complex neural network containing tens of millions of neurons. Altogether there are roughly 150 million neurons in the LGB in humans. In these neural networks there are numerous feedforward and feedback connections between the layers and lateral connections within each layer. At each stage, more sophisticated and abstract processing takes place. For instance, neurons in V1 called “complex cells” are able to detect edge orientation, lines and direction of motion. This abstracted information is sent to higher layers for more processing. Loosely speaking, cells in V3 and V5 help distinguish local and global motion, while cells in V4 appear to be selective of spatial frequency, orientation and color. Other cells coordinate stereo vision. In MT, 2D boundaries of objects might be detected. Figuring out each of these processing steps and mapping the web of connections between the layers is still a nascent field of research.

After processing in the LGB, visual data moves into the Inferior Temporal Gyrus (IT), the temporal lobe, prefrontal cortex and the rest of the brain. Although our understanding of IT is quite poor, it is believed that this region is responsible for mapping a class (such as “car”, “tree”, “pencil”) to parts of the visual data. It appears that some class neurons are very general (“human face”) while other class neurons are more specific (“mother’s face”). There is probably extensive feedback between IT and the lower layers to check different object hypotheses. Such feedback systems would also be essential for learning. Eye movements are directed by the brain to get a better idea what objects are. Some of the larger eye movements are made consciously but the vast majorities are actually unconscious. These eye movements are a very useful area to study but are not relevant to this research since we will be processing still images with a computer.

Figure 2: An example of stimuli used by Field, Hayes, and Hess 1993. The object of this test is to detect the contour.

A paper by Yamane, Carlson et al. entitled “A neural code for three-dimensional object shape in macaque inferotemporal cortex” hypothesizes that certain neurons or collections of neurons in IT are responsible for detecting 3D surface patches. They confirmed this by subjecting macaque monkeys to a variety of shape stimuli and finding the stimuli that gave the best response for neurons. This finding is significant to this research because it means it implies that biological brains analyze surface patches in addition to edges.

“Class descriptions” in the higher brain must be very abstract to facilitate the many different ways an object may appear. For instance, the same object can appear to be large or small and can be viewed from many different orientations. This ability may have to be learned anew for each object or the brain may be able to visualize how an object will appear from different angles on its own. Most likely, there is a mix of both techniques. Object recognition is further complicated by variations in lighting, coloration and other objects (occlusions) that may be in the way. Currently, computer vision systems are very likely to fail when any of these variations are present. Biologically inspired computer vision provides an avenue to overcome such limitations.

Biologically inspired computer vision
The ultimate goal of this project is build a neural network which can detect various shapes. The first part of the project was to develop a test called the “3D Amoeba task”. This test can be used to compare the computer’s performance with humans. In the test, an image is flashed to the viewer for a very short amount of time. The objective is to say whether there is an “amoeba” present, or whether there is just “debris”. An “amoeba” is a random, lumpy closed 3D shape. “Debris” are pieces that may be lumpy but are not closed – they are flakes. Debris are scattered uniformly throughout space and may occlude the Amoeba. In this project a program was written to create 3D amoebas using random combinations of spherical harmonics, which are basis functions in 3D space.

This project builds off a facet of the Petavision project, which simulated the visual cortex using the Roadrunner super computer. Initial experiments simulated 100,000 neurons, and this number was later increased to one million. It was able to complete the “2D Amoeba task”, which was to isolate a wiggly 2D figure amongst debris.
A similar task can be found in a paper by Field, et al. published in 1993 (see figure 2).

Figure 3: Four examples of amoebas and debris.

Next, we started to make the kernel, which will store information about surface patches. An amoeba is created and analyzed. There are eight pieces of data to store : x position, y position, Gaussian curvature, two normal vector components, spatial frequency and orientation. The normal vector components are encoded as two angles, theta and phi, measured with respect to the direction of sight. This is simpler then storing them as (x,y,z) components and more intuitive as well.

Figure 4 – Amoeba training parameters

The program is run many times for many different Amoeba surfaces until a large body of data is collected. This data can then be mapped onto an eight-dimensional array of “neurons”. Each neuron has a Gaussian response curve for each parameter. So, for each patch of an amoeba, the array of neurons is “lit up” in a certain way.  These arrays can be used to train a neural network. Essentially, the neural network learns how to associate given stimuli with curvature and normal vector orientation, which it can’t detect directly. Ultimately it will be able predict the general characteristics of nearby surface patches, assuming the object is closed and smooth. In this way it will be able to detect amoebas since they are always closed and smooth, while debris have discontinuities along their edges.

In conclusion, this project resulted in a program to create a “kernel” of data for training a neural network. This is the first step in the larger project of creating a neural network to detect amoebas. Eventually it is hoped a similar methodology could be used to detect and distinguish arbitrary smooth closed shapes.