Simulation and Bootstrapping

Simulation

Many people think of a coin toss when they think about randomness. While everything in nature is random, a coin toss has the extra quality of being a completely known process. We do not know what the outcome of the next coin is going to be, but we know the average behavior quite well, viz., the outcome will be either a Head or a Tail, each of which is equally likely. It is this perfect knowledge that leads us to use a coin toss when making a fair decision.

A coin toss is a random mechanism that we can repeat at our will to produce random quantities with a known behavior. Another such a device is a die. But such practical devices used to be limited in number and scope before the advent of high speed computers.

A remarkable fact about computers is that they can be instructed to produce random quantities whose general behavior is completely under control. We can, for example make the computer toss a coin with exactly a specified amount of bias for Head! We can turn the computer easily into a 100-faced die, something that is almost to construct in practice. In fact, computers can be made to perform more exotic (as well more useful) random computations that have started to play a dominant role in statistics. In this last tutorial we shall take a look at these.

Putting it in perspective

Man has devised various techniques to deal with chance. These may be classified into roughly three categories:
  1. Wait and see: Here we patiently wait until the end of the activity. This time-honored method remains the most frequently used one even today. Who can deny the usefulness of observing the sky carefully to learn about the random patterns? However, learning by just trial-and-error may prove too expensive. When we want to design a space station to maximize its chance of capturing some rare phenomenon, it would prove it bit too costly to simply rely on trial and error!
  2. Use probability models: Often we have enough information (theoretical and/or based on prior observations) about the mechanism underlying a random phenomenon. For example, if a careful scrutiny of a die does not reveal any difference between the faces of a die, one may assume that the die is fair. Such information can be expressed as a probability distribution:
    The outcome X can take values 1,...,6 each with equal probability (taken to be 1/6 to keep the sum 1).
    Using this mathematical formulation of our knowledge about the random phenomenon we can now analyze the distribution mathematically. For example, consider the following gambling game
    A die is rolled, and the casino owner gives you Re 1 for each point shown. For example, you get Rs 6 if the die shows a 6. How much will you get on average per roll if you play this game 10000 times?
    The brute force approach is to really play it 10000 times, and then divide the total amount by 10000. A smarter way is to use the probability distribution of the die. We know that on average we get 1 about 1/6 of the times, 2 about 1/6 of the times and so on. So on an average we shall get
    (1+2+3+4+5+6)/6 = 3.5
    per roll. Here the math turned out to be simple. But in more realistic models the math is often too tough to handle even with a computer!
  3. Simulate probability models: Here also we start with a model. But instead of dealing with it mathematically, we make the computer to perform (or simulate) the random experiment following the model. This is possible because computers can generate random numbers
    Actually computers generate only pseudo-random numbers. These are not random but only behave like random numbers. However, we shall disregard the difference here.
    Now it is just like the "wait and see" strategy, except that we do not have to wait long, since computers can perform many repetitions of a random experiment very fast. Also, no real monetary loss is involved.
Relationship between MCMC and simulation
This little section may sound pedantic, but is not essential for the discussion that follows.

MCMC is a form of simulation, because MCMC generates random numbers and bases the inference on them. However, there are two important differences between MCMC and the simulation that we shall discuss here.
  1. In MCMC we do not simulate from the target distribution directly. Instead, we simulate a Markov Chain that converges to the target distribution. Thus, MCMC performs approximate simulation.
  2. The primary use of MCMC is in Bayesian statistics where we seek to simulate from the posterior distribution of the parameters. Classical simulation on the other hand simulates fresh data.

Simulation using R

Let us see the simulation approach in action for the gambling example before we move on to more serious applications.
values = 1:6 #the possible values
sample(values, 10, replace=T)
This last line asks R to sample 10 numbers from the vector values. The ``replace=T'' allows the same number to occur multiple times. Run the last line repeatedly to see that how the output changes randomly. Now to solve the problem use
money = sample(values, 10000, replace=T)
avg = mean(money)
avg
This mean is indeed pretty close to the theoretical 3.5. But will it always be the case? After all it is random game. So it is a good idea to repeat the above simulation a number of times, say, 100 times. The blue lines below are as before. But these are now enveloped in some extra lines (shown in bold). This is an example of a for loop.

avgList = c() #an empty vector
for(i in 1:100) {
   money = sample(values, 10000, replace=T)
   avg = mean(money)
   avgList = c(avgList,avg) #add avg to avgList
}

mean(avgList)
var(avgList)

Rounded corner: top-leftBackgroundRounded corner: top-right
Background To repeat some commands for a fixed number of times (say 200) in R you put the commands inside a for loop like this

for(i in 1:200) {
   #Your commands
   #go here
}

For example, you may try
for(i in 1:5) {
   print("hello")
}
Background
Rounded corner: bottom-leftBackgroundRounded corner: bottom-right

Simulating from a distribution

The fundamental concept behind the above technique is that R can ``roll a die'' or, rather, ``simulate a die''. Statistically this means generating a random number from among the values 1,...,6 with probabilities 1/6 each. This is by no means the only distribution R can generate numbers from. We can for example, generate a random number from the Uniform(0,1) distribution. Such a number can take any value between 0 and 1, each value being equally likely. Let us generate 100 such numbers.
data = runif(1000, min=0, max=1)
hist(data)
R has many probability distributions built into it, and it has ready-made functions (like runif) to generate data from them. The functions all share a common naming pattern. They start with the letter `r' followed by an acronym for the distribution. Here are some examples.
data = rnorm(1000, mean=1, sd=2) #Normal(1,2)
hist(data)
data = rpois(1000, lambda=1) #Poisson(1)
hist(data)
data = rexp(1000, rate=2) #Exponential with mean 1/2
hist(data)

Two realistic examples

An imperfect particle counter
A huge radio active source emits particles at random time points. A particle detector is detecting them. However, the detector is imperfect and gets ``jammed'' for 1 sec every time a particle hits it. No further particles can be detected within this period. After the 1 sec is over, the detector again starts functioning normally. We want to know the fraction of particles that the detector is missing. Since this fraction may depend on the rate at which the source emits particles we want to get a plot like the following:

Want a plot like this

Random emission of particles from a huge radio active source is well-studied process. A popular model is to assume that the time gaps between successive emissions are independent and have Exponential distributions.
gaps = rexp(999,rate=1)
The question now is to determine which of the particles will fail to be detected. A particle is missed if it comes within 1 sec of its predecessor. So the number of missed particles is precisely the number of gaps that are below 1.
miss = sum(gaps < 1)
miss
Now we want to repeat this experiment a number of times, say 50 times.

missList = c()
for(i in 1:50) {
   gaps = rexp(999,rate=1)
   miss = sum(gaps < 1)
   missList = c(missList,miss)
}
mean(missList)
var(missList)

All these are done for rate = 1. For other rates we need to repeat the entire process afresh. So we shall use yet another for loop.

rates = seq(0.1,3,0.1)
mnList = c()
vrList = c()
for(lambda in rates) {
   missList = c()
   for(i in 1:50) {
      gaps = rexp(1000,rate=lambda)
      miss = sum(gaps < 1)
      missList = c(missList,miss)
   }

   mn = mean(missList)
   vr = var(missList)
   mnList = c(mnList,mn)
   vrList = c(vrList, vr)
}

Now we can finally make the plot.
plot(rates,mnList/10,ty="l",ylab="% missed")
We shall throw in two error bars for a good measure.
up = mnList + 2*sqrt(vrList)
lo = mnList - 2*sqrt(vrList)

lines(rates,up/10,col="red")
lines(rates,lo/10,col="blue")

Exercise: (This exercise is difficult) We can estimate the actual rate from the hitting times of the particles if the counter were perfect. The formula is to take the reciprocal of the average of the gaps. But if we apply the same formula for the imperfect counter, then we may get a wrong estimate of the true rate. We want to make a correction graph that may be used to correct the under estimate. Explain why the following R program achieves this. You will need to look up the online help of cumsum and diff.
rates = seq(0.1,3,0.1)
avgUnderEst = c()
for(lambda in rates) {
   underEst = c()
   for(i in 1:50) {
      gaps = rexp(1000,rate=lambda)
      hits = cumsum(gaps)
      obsHits = c(hits[1], hits[gaps>1])
      obsGaps = diff(obsHits)
      underEst = c(underEst,1/mean(obsGaps))
   }
   avgUnderEst = c(avgUnderEst,mean(underEst))
}

plot(avgUnderEst,rates,ty="l")
Can you interpret the plot?

Simulating a galaxy
A galaxy is not a single homogeneous body and so it has different mass densities at different points. When we say that the mass density of a galaxy at a point (x,y,z) is f(x,y,z) we mean that for any region R in that galaxy the chance of detecting a body is
where M is the total mass of the galaxy.

There are various models that specify different forms for f. Many of these consider the galaxy to be confined in some 2D plane, so that the density may be written as f(x,y). The typical process for simulating a galaxy has two components:
  1. First we simulate the initial state of the galaxy from some probability model.
  2. Next, we let it evolve deterministically following appropriate laws of motion.
Such simulations are extremely computation intensive. We shall therefore confine ourselves to a very simple situation. To keep life simple we shall assume that the x and y-coordinates of the stars are independent normal random variables. We shall work with just 500 stars of equal masses. (It is easy to allow random masses, but in our simple implementation the simulation would then take awfully long to run!) Each star initially has tangential velocity proportional to the distance from the center.
x = rnorm(500,sd=4)
y = rnorm(500,sd=4)
vx = -0.5*y
vy = 0.5*x
Let us plot the initial stage of our galaxy.
oldpar = par(bg="black") #we want black background
plot(x,y,pch=".",col="white") #and white stars
par(oldpar) #restore the default (black on white)
This does not look much like a galaxy. We have to let it evolve over time. This basically means solving an n-body problem numerically. A popular method is the Hut-Barnes algorithm. But we shall apply Newton's laws using R in a brute force (and slow) way. The commands are in the script file newton.r.
source("newton.r")
Here we shall perform the simulation for 100 time steps, updating the picture of the galaxy every 10 steps. With only 500 points it is difficult to see much pattern, but still one may see spiral tentacles forming.

The simulation here is admittedly very crude and inefficient. If you are interested, you may see more realistic examples (with references) at this website.

Taking error bars into account

Most statistical techniques are designed assuming the observations to be made with infinite precision, an assumption that is far from true in astronomy. Astronomers often have error bars on their measurements that are hard to incorporate into regular statistical procedures. However, error bars can be taken care of quite easily using simulation as we shall see now.

Consider the principal component analysis from yesterday's lab, where we analyzed the SDSS quasar data. Let us look at it once again.
quas = read.table('SDSS_quasar.dat',head=T)
names(quas)
We shall leave out the first three columns as they do not provide any real information. Among the remaining variables we have some variables named like sig_u. These give the error bars for the corresponding measurements like u_mag. We want to carry out principal component analysis keeping these errors in mind. To keep things simple we shall consider only the first 100 rows.
n=100
quas1 = quas[1:n,-c(1:3)]
attach(quas1)
Consider the first row in the data.
quas1[1,]
Here u_mag=19.921 and sig_u=0.042. We shall interpret these as follows:
If we carry out the same measurement again and again we might get other values. If X denotes a typical such value, then X is a random variable with mean 19.921 and standard deviation 0.042.
For the ease of exposition we shall assume that X has a Gaussian distribution with this mean and standard deviation. It is quite easy to use any other distribution (even skewed ones, if the error bars are asymmetric).

Let us pretend that we are measuring the brightness of the U-V band again for the same quasar. We may simulate the new measurement (u_rnd, say) as
u_rnd = rnorm(1,mean=u_mag[1],sd=sig_u[1])
u_rnd
We want to carry this out for all the n=100 quasars.
u_rnd = rnorm(n,u_mag,sig_u)
u_rnd
We want to do this for all the measurements for which error bars are available, and pack the new simulated measurements into a data frame.
u_rnd = rnorm(n,u_mag,sig_u)
g_rnd = rnorm(n,g_mag,sig_g)
r_rnd = rnorm(n,r_mag,sig_r)
i_rnd = rnorm(n,i_mag,sig_i)
z_rnd = rnorm(n,z_mag,sig_z)
K_rnd = rnorm(n,K_mag,sig_K)
H_rnd = rnorm(n,H_mag,sig_H)
J_rnd = rnorm(n,J_mag,sig_J)
   
dat_rnd = data.frame(u_rnd,g_rnd,i_rnd,z_rnd,
                     K_rnd,H_rnd,J_rnd,
                     z,X.ray,M_i,Radio)
Now let us carry out principal component analysis of this simulated data set and plot the standard deviations for the principal components.
sdv = prcomp(dat_rnd,stand=T)$sdev
plot(sdv,ty="l",xlab='PC',ylab='std dev')
If we do the above procedure (simulate, analyze, plot) again and again, then we shall have an idea how the standard deviations are affected by measurement errors. The following R script (also stored in simpca.r) does precisely this.

sdv = c()
for(i in 1:20) {
   u_rnd = rnorm(n,u_mag,sig_u)
   g_rnd = rnorm(n,g_mag,sig_g)
   r_rnd = rnorm(n,r_mag,sig_r)
   i_rnd = rnorm(n,i_mag,sig_i)
   z_rnd = rnorm(n,u_mag,sig_u)
   K_rnd = rnorm(n,K_mag,sig_K)
   H_rnd = rnorm(n,H_mag,sig_H)
   J_rnd = rnorm(n,J_mag,sig_J)
   
   dat_rnd = data.frame(u_rnd,g_rnd,r_rnd,i_rnd,
                        K_rnd,H_rnd,J_rnd,z_rnd,
                        z,X.ray,M_i,Radio)
   
   sdv = rbind(sdv,prcomp(dat_rnd,stand=T)$sdev)
} 

plot(sdv[1,],type="l",xlab='PC',ylab='std dev')
apply(sdv,1,lines)

Notice from the plot that most of the standard deviations are not affected at all by the measurement errors. Roughly speaking, this means that the principal component analysis is doing a good job of discovering the structure of the data even in presence of measurement errors.

Bootstrapping

So far we are simulating from completely specified distributions. But suppose that someone gives some data and asks us to simulate from whatever is the distribution of the data. Since we do not have the distribution itself we cannot apply the above methods directly.

However, as it happens, there is an approximate way out that is simple to implement in R. Let us first see it using an artificial example.

Suppose someone has some data that was originally generated from N(0,1) distribution.
origData = rnorm(1000,mean=0,sd=1)
Then we shall simply resample from this data set.
newData = sample(origData,500, replace=T) 
This is called (one form of) bootstrapping.

It is instructive to compare this with the true distribution.
hist(newData,prob=T)
x = seq(-3,3,0.1)
lines(x,dnorm(x))
Now let us apply this on some real data.
hip = read.table("HIP.dat",head=T)
attach(hip)
We shall consider the Vmag variable. We want to generate 10 new values from its distribution (which is unknown to us).
newVmag = sample(Vmag,10,replace=T)
newVmag
Well, that was easily done, but the question is
What in the universe does this newVmag mean?
Does this mean we have generated 10 new galaxies? How can that be when we are just re-using the old values?

Common sense, please!

First let us beware of the following mistaken notion.
Rounded corner: top-leftBackgroundRounded corner: top-right
Background Mistaken notion: Bootstrapping is a way to increase the sample size without any extra real sampling. Background
Rounded corner: bottom-leftBackgroundRounded corner: bottom-right
If that were true, you could just keep on generating further samples from existing data and get away by pretending that they are new galaxies. Well, common sense tells us that this cannot be true.
Rounded corner: top-leftBackgroundRounded corner: top-right
Background Great lesson: Never place statistics above common sense. Background
Rounded corner: bottom-leftBackgroundRounded corner: bottom-right
By repeatedly picking from the already available sample we are not adding anything to the information. In particular, if the original sample presents a biased view of the underlying population, then none of the resamples can cure that distortion.

Why then should we do bootstrapping at all?

The next example explains why this is useful.

Astronomical data sets are often riddled with outliers (values that are far from the rest). To get rid of such outliers one sometimes ignores a few of the extreme values. One such method is the trimmed mean.
x = c(1,2,3,4,5,6,100,7,8,9)
mean(x)
mean(x,trim=0.1) #ignore the extreme 10% points in BOTH ends
mean(x[x!=1 & x!=100])
We shall compute 10%-trimmed mean of Vmag from the Hipparcos data set.
mean(Vmag,trim=0.1)
We want to estimate the standard error of this estimate. For ordinary mean we had a simple formula, but unfortunately such a simple formula does not exist for the trimmed mean. So we shall use bootstrap here as follows. We shall generate 100 resamples each of same size as the original sample.
trmean = c()
for(i in 1:100) {
   resamp = sample(Vmag,length(Vmag),replace=T)
   trmean = c(trmean,mean(resamp,trim=0.1))
}
sd(trmean)
Rounded corner: top-leftBackgroundRounded corner: top-right
Background Bootstrapping is such a popular statistical technique that R has a package called boot to perform bootstrapping. Background
Rounded corner: bottom-leftBackgroundRounded corner: bottom-right

Permutation test

Here is a variation of the same theme: resampling i.e., randomly sampling from the original sample and pretending that it is a new sample.

The situation that we are discussing next involves two samples, say, the Hyades and the non-Hyades stars. Suppose that we want to see if the their medians colors are different or not. (By this, as usual, we mean to test if the medians of the distributions underlying the two samples are same or not.) Here we shall assume that the shapes of the two distributions are the same. Since the samples are pretty large, it is reasonable to look at the differences of the medians of the two samples.
source("hyad.r")
colH = B.V[HyadFilter]
colnH = B.V[!HyadFilter & !is.na(B.V)]
m = length(colH)
n = length(colnH)
median(colH)-median(colnH)
If the population medians are the same, then we should expect this difference to be near zero, as well. So if this difference is too large, then we have reason to suspect that the underlying distributions have different medians. The question therefore is how large is ``too large"?

Sometimes statistical theory dictates a rule to decide this. But more often there is no theory to help. Then permutation tests provide an easy way out.

Imagine first what happens if the medians are the same. Then the two underlying distributions are actually the same (since their shapes already match by assumption). So the two samples together is basically like a single large sample from this common distribution. If this is the case, calling a Hyades star as non-Hyades (or vice versa) would really not make any difference.

So we shall mix up all stars together, and then pick any m of them and call them Hyades, while the remaining n would be labeled nonHyades. This should be as good as the original sample if the two distributions are indeed the same.
pool = c(colH,colnH)
mixed = sample(pool) # sample generates a random permutation
newH = mixed[1:m]
newnH = mixed[(m+1):(m+n)]
median(newH) - median(newnH)
If the two distributions were really the same then this new difference should be close to the difference based on the original data.

Of course, this apparently takes us back to the original question: how close is ``close enough"? But now it is easier to answer since we can repeat this random mixing many times.
pool = c(colH,colnH)
d = c()

for(i in 1:1000) {
   mixed = sample(pool)
   newH = mixed[1:m]
   newnH = mixed[(m+1):(n+m)]
   newDiff = median(newH) - median(newnH)
   d = c(d,newDiff)
}

Now that we have 1000 values of how the typical difference should look like if the distributions were the same, we can see where our original difference lies w.r.t. these values. First let us make a histogram of the typical values:
hist(d)
Well, the original value seems quite far from the range of typical values, right?

One way to make this idea precise is to compute the p-value, which is defined as the chance of observing a difference even more extreme than the original value. Here ``more extreme" means ``larger in absolute value".
orig = median(colH)-median(colnH)
sum(abs(d)>=abs(orig))/1000
A p-value smaller than 0.05, say, strongly indicates that the medians of the distributions are not the same.

Kernel density estimation

We have already mentioned how important the distribution underlying a sample is, and how the distribution may be specified via its CDF. An (almost) equivalent way of specifying the distribution is via its density function, which is just the derivative of the CDF. The simplest way to get an idea about the density function underlying a data set is to make its histogram. The diagram below shows a density and the typical shape of the histogram that you can expect for a data set with this density. Note the similarity in shape.

A density function and a typical histogram

Here is how we can use histograms to estimate the density of Vmag.
hist(Vmag, prob=T)
Typically densities are smooth functions, while histograms are jagged. A better estimate may be obtained using kernels as we have seen in the theory class.

Let us start with a very simple example to get the procedure right. We shall apply the Gaussian kernel to a sample consisting of just two points!
source("krnl1.r")
We shall take some time to understand the different parts of the plot that pops up. Recall that the kernel density estimator has the formula
where K is the kernel used. In our example, the sum has just two terms
that are half the Gaussian densities centred around the two points. These are shown in red. The sum (which is the kernel density estimate) is shown in black.

Now notice the little pop up window that looks something like this:

Kernel pop up

Move the slider to change the bin width (which controls how peaked the kernel is). See how a large value of the bin width produces a flat estimator, while a small value produces two peaks (one for each point). You may not readily notice that the black curve is getting flatter with increasing bin width, until you notice how the x-axis is growing and the y-axis shrinking.

Next we shall move over to a more realistic situation where we shall work with 100 observations that are artificially generated from the standard Gaussian distribution. The aim now will be to compare the estimated density with the actual density (which we know since the data set is artificial).
source("krnl2.r")
Again play with the slider and convince yourself that small bin widths produce jagged estimated, while large ones produce flat estimates. The best value is somewhere in the middle. The theoretical class has taught you how to find a good bin width. You can also let R find a good value for you as we shall see next. We shall use the Hipparcos data set once again and the function density.
dd = density(Vmag)
plot(dd$x,dd$y,ty="l")
It might be more interesting to see the estimate overlayed on top of the histogram:
hist(Vmag,prob=T)
lines(dd$x,dd$y,col="red")
What bin width did R use? To see this look inside dd:
dd
If you would like to try a different bin width (say 0.1) then that is OK too:
dd2 = density(Vmag,bw = 0.1)
lines(dd2$x,dd2$y,col="blue")

Exercise: All these examples work with the Gaussian kernel. R allows you to use a number of other kernels as well: Epanechnikov, Rectangular, and Triangular, among others. Try these out to see that the choice of the kernels does not matter as much as the bin width does. Here is how you specify the kernel:
dd3 = density(Vmag,kernel="epan") #abbrevs are OK
lines(dd3$x,dd3$y,col="green")

Q-Q plot

Suppose that you are given a random sample and asked to check if it comes from the standard Gaussian distribution. There are many ways to go about it, and the Q-Q plot is one of the easiest.

To understand this process let us use a small data set:
x = c(2.3, -0.3, 0.4, 2.0, -1.1, 1.0, 2.1, 3.4, -0.4, 0.5, 1.2,
      2.1)
n = length(x)
n
First we shall sort them in increasing order:
sortedX = sort(x)
sortedX
Now pick a number in this sorted list. Say we pick the 3rd number, which is -0.3. What is the fraction of numbers that are less than or equal to this? Clearly, it is 3/n = 3/12 = 0.25.
frac = 3/n
So we see that 0.25 fraction of the numbers are less than or equal to -0.3 in the data.

Now it is time to compare it with what we should see for a true standard Gaussian data. We shall find the number M such that the a standard Gaussian random variable will be ≤ M with chance 0.25. Then ideally M should be close to our observed -0.3.
M = qnorm(frac)
Well, due to a little technicality we subtract 1/2n from 0.25. This hardly matters when n is large.
M = qnorm(frac -  1/(2*n) )
M
This M is far from -0.3, and so we become suspicious that our sample is not from the standard Gaussian distribution.

So far we are working with just the 3rd number in sortedX, which we picked arbitrarily. We now repeat the process for all the points.
frac = (1:n)/n
M = qnorm(frac - 1/(2*n))
plot(M,sortedX)
This is called the Q-Q plot of the sample against the standard Gaussian distribution. If the points are close to the y=x line then we can be hopeful that the sample is indeed from a standard Gaussian distribution. To facilitate the comparision let us add the y=x line to the plot.
abline(a=0,b=1)
Well, our data is far from standard Gaussian!

Comparison with the Gaussian distribution is so very common that R provides a function qqnorm for it.
qqnorm(x)
R does not provide any readymade function for making a Q-Q plot against other distributions. But one can always follow the method shown here to make a Q-Q plot against any distribution for which R can compute the CDF (or, rather, its inverse using a q-function).