Simulation and Bootstrapping
Simulation
Many people think of a coin toss when they think about
randomness. While
everything in nature is random, a coin toss has the extra quality
of being a completely known process. We do not know what the
outcome of the next coin is going to be, but we know the average
behavior quite well, viz., the outcome will be either a
Head or a Tail, each of which is equally likely. It is this
perfect knowledge that leads us to use a coin toss when making a
fair decision.
A coin toss is a random mechanism that we can repeat at our will
to produce random quantities with a known behavior. Another such
a device is a die. But such practical devices used to be limited in
number and scope before the advent of high speed computers.
A remarkable fact about computers is that they can be instructed
to produce random quantities whose general behavior is
completely under control. We can, for example make the computer
toss a coin
with exactly a specified amount of bias for Head! We can turn the
computer easily into a 100faced die, something that is almost to
construct in practice. In fact, computers can be made to perform
more exotic (as well more useful) random computations that have
started to play a dominant role in statistics.
In this last tutorial we shall take a look at these.
Putting it in perspective
Man has devised various techniques to deal with chance. These may
be classified into roughly three categories:

Wait and see: Here we patiently wait until the end of the
activity. This timehonored method
remains the most frequently used one even today.
Who can deny the usefulness of observing the sky carefully to learn about the random patterns?
However, learning by just trialanderror may
prove too expensive. When we want to design a space station to
maximize its chance of capturing some rare phenomenon, it would
prove it bit too costly to simply rely on trial and error!

Use probability models: Often we have enough information
(theoretical and/or based on prior observations)
about the mechanism underlying a random phenomenon.
For example, if a careful scrutiny of a die does not
reveal any difference between the faces of a die, one may assume
that the die is fair. Such information can be expressed as a
probability distribution:
The outcome X can take values 1,...,6 each with equal
probability (taken to be 1/6 to keep the sum 1).
Using this mathematical formulation of our
knowledge about the random phenomenon we can now analyze the
distribution mathematically. For example, consider the following
gambling game
A die is rolled, and the casino owner gives you Re 1 for each
point shown. For example, you get Rs 6 if the die shows a 6. How much
will you get on average per roll if you play this game 10000
times?
The brute force approach is to really play it 10000 times, and
then divide the total amount by 10000. A smarter way is to use
the probability distribution of the die. We know that on
average we get 1 about 1/6 of the times, 2 about 1/6 of the times
and so on. So on an average we shall get
(1+2+3+4+5+6)/6 = 3.5
per roll. Here the math turned out to be simple. But in more
realistic models the math is often too tough to handle even
with a computer!

Simulate probability models: Here also we start with a
model. But instead of dealing with it mathematically, we make the
computer to perform (or simulate) the random experiment following
the model. This is possible because computers can generate random
numbers
Actually computers generate only pseudorandom
numbers. These are not random but only behave like random
numbers. However, we shall disregard the
difference here. 
Now it is just like the "wait and see" strategy, except that we
do not have to wait long, since computers can perform many
repetitions of a random experiment very fast. Also, no real
monetary loss is involved.
Relationship between MCMC and simulation
This little section may sound pedantic, but is not essential
for the discussion that follows.
MCMC is a form of simulation, because MCMC generates random
numbers and bases the inference on them. However, there are two
important differences between MCMC and the simulation that we
shall discuss here.
 In MCMC we do not simulate from the target distribution
directly. Instead, we simulate a Markov Chain that
converges to the target distribution. Thus, MCMC performs
approximate simulation.
 The primary use of MCMC is in Bayesian statistics where
we seek to simulate from the posterior distribution of the
parameters. Classical simulation on the other hand simulates
fresh data.
Simulation using R
Let us see the simulation approach in action for the gambling example
before we move on to more serious applications.
values = 1:6 #the possible values
sample(values, 10, replace=T)
This last line asks R to sample
10 numbers from the
vector values
. The ``replace=T
'' allows
the same number to occur multiple times. Run the last line
repeatedly to see that how the output changes randomly. Now to
solve the problem use
money = sample(values, 10000, replace=T)
avg = mean(money)
avg
This mean is indeed pretty close to the theoretical 3.5. But will
it always be the case? After all it is random game. So it is a
good idea to repeat the above simulation a number of times, say,
100 times. The blue lines below are as before. But these are now
enveloped in some extra lines (shown in bold). This is an example
of a for loop.
avgList = c() #an empty vector
for(i in 1:100) {
money = sample(values, 10000, replace=T)
avg = mean(money)
avgList = c(avgList,avg) #add avg to avgList
}
mean(avgList)
var(avgList)

  

To repeat some commands for a fixed number of times (say 200) in R
you put the commands inside a for loop like this
for(i in 1:200) {
#Your commands
#go here
}
For example, you may try
for(i in 1:5) {
print("hello")
}
 
  

Simulating from a distribution
The fundamental concept behind the above technique is that R can
``roll a die'' or, rather, ``simulate a die''. Statistically
this means generating a random number from among the
values 1,...,6 with probabilities 1/6 each. This is by no
means the only distribution R can generate numbers from. We
can for example, generate a random number from the
Uniform(0,1) distribution. Such a number can take any value
between 0 and 1, each value being equally likely.
Let us generate 100 such numbers.
data = runif(1000, min=0, max=1)
hist(data)
R has many probability distributions built into it, and it has
readymade functions (like runif
) to generate data from
them. The functions all share a common naming pattern. They start
with the letter `r' followed by an acronym for the distribution.
Here are some examples.
data = rnorm(1000, mean=1, sd=2) #Normal(1,2)
hist(data)
data = rpois(1000, lambda=1) #Poisson(1)
hist(data)
data = rexp(1000, rate=2) #Exponential with mean 1/2
hist(data)
Two realistic examples
An imperfect particle counter
A huge radio active source emits particles at random time points. A
particle
detector is detecting them. However, the detector is imperfect
and gets ``jammed'' for 1 sec every time a particle hits it. No
further particles can be detected within this period. After the 1
sec is over, the detector again starts functioning normally. We
want to
know the fraction of particles that the detector is
missing. Since this fraction may depend on the rate at which the
source emits particles we want to get a plot like the following:

Want a plot like this 
Random emission of particles from a huge radio active source is
wellstudied process. A popular model is to assume that the time
gaps between successive emissions are independent and have
Exponential distributions.
gaps = rexp(999,rate=1)
The question now is to determine which of the particles will
fail to be detected. A particle is missed if it comes within 1
sec of its predecessor. So the number of missed particles is
precisely the number of gaps that are below 1.
miss = sum(gaps < 1)
miss
Now we want to repeat this experiment a number of times, say 50
times.
missList = c()
for(i in 1:50) {
gaps = rexp(999,rate=1)
miss = sum(gaps < 1)
missList = c(missList,miss)
}
mean(missList)
var(missList)
All these are done for rate = 1. For other rates we need to repeat
the entire process afresh. So we shall use yet another
for
loop.
rates = seq(0.1,3,0.1)
mnList = c()
vrList = c()
for(lambda in rates) {
missList = c()
for(i in 1:50) {
gaps = rexp(1000,rate=lambda)
miss = sum(gaps < 1)
missList = c(missList,miss)
}
mn = mean(missList)
vr = var(missList)
mnList = c(mnList,mn)
vrList = c(vrList, vr)
}
Now we can finally make the plot.
plot(rates,mnList/10,ty="l",ylab="% missed")
We shall throw in two error bars for a good measure.
up = mnList + 2*sqrt(vrList)
lo = mnList  2*sqrt(vrList)
lines(rates,up/10,col="red")
lines(rates,lo/10,col="blue")
 Exercise: (This exercise is difficult)
We can estimate the actual rate from the hitting times of the
particles if the counter were perfect. The formula is to take the
reciprocal of the average of the gaps. But if we apply the same
formula for the imperfect counter, then we may get a wrong estimate
of the true rate. We want to make a correction
graph that may be used to correct the under estimate.
Explain why the following R program achieves this. You will need
to look up the online help of cumsum and diff .
rates = seq(0.1,3,0.1)
avgUnderEst = c()
for(lambda in rates) {
underEst = c()
for(i in 1:50) {
gaps = rexp(1000,rate=lambda)
hits = cumsum(gaps)
obsHits = c(hits[1], hits[gaps>1])
obsGaps = diff(obsHits)
underEst = c(underEst,1/mean(obsGaps))
}
avgUnderEst = c(avgUnderEst,mean(underEst))
}
plot(avgUnderEst,rates,ty="l")
Can you interpret the plot?

Simulating a galaxy
A galaxy is not a single homogeneous body and so it has different
mass densities at different points. When we say that the mass
density of a galaxy at a point (x,y,z) is f(x,y,z)
we mean that for any region R in that galaxy the chance of
detecting a body is
where M is the total mass of the galaxy.
There are various models that specify different forms for
f. Many of these consider the galaxy to be confined in
some 2D plane, so that the density may be written as
f(x,y). The typical process for simulating a galaxy has
two components:

First
we simulate the initial state of the galaxy from some probability model.
 Next, we let it evolve deterministically following
appropriate laws of motion.
Such simulations are extremely computation intensive. We
shall therefore confine ourselves to a very simple situation.
To keep life
simple we shall assume that the x and
ycoordinates of the stars are independent normal random
variables. We shall work with just 500 stars of equal masses. (It
is easy to allow random masses, but in our simple
implementation the simulation would then take awfully long to
run!) Each star initially has tangential velocity proportional to
the distance from the center.
x = rnorm(500,sd=4)
y = rnorm(500,sd=4)
vx = 0.5*y
vy = 0.5*x
Let us plot the initial stage of our galaxy.
oldpar = par(bg="black") #we want black background
plot(x,y,pch=".",col="white") #and white stars
par(oldpar) #restore the default (black on white)
This does not look much like a galaxy. We have to let it
evolve over time. This basically means solving an nbody
problem numerically. A popular method is the HutBarnes
algorithm. But we shall apply Newton's laws using R in a brute
force (and slow) way. The commands are in the script file
newton.r.
source("newton.r")
Here we shall perform the simulation for 100 time steps, updating
the picture of the galaxy every 10 steps. With only 500 points it
is difficult to see much pattern, but still one may see spiral
tentacles forming.
The simulation here is admittedly very crude and inefficient. If
you are interested, you may see more realistic examples (with
references) at this
website.
Taking error bars into account
Most statistical techniques are designed assuming the
observations to be made with infinite precision, an assumption
that is far from true in astronomy. Astronomers often have error
bars on their measurements that are hard to incorporate into
regular statistical procedures. However, error bars can be taken
care of quite easily using simulation as we shall see now.
Consider the principal component analysis from yesterday's lab,
where we analyzed the SDSS quasar data. Let us look at it once
again.
quas = read.table('SDSS_quasar.dat',head=T)
names(quas)
We shall leave out the first three columns as they do not provide
any real information. Among the remaining variables we have some
variables named like sig_u
. These give the error
bars for the corresponding measurements like u_mag
.
We want to carry out principal component analysis keeping these
errors in mind. To keep things simple we shall consider only
the first 100 rows.
n=100
quas1 = quas[1:n,c(1:3)]
attach(quas1)
Consider the first row in the data.
quas1[1,]
Here u_mag=19.921
and sig_u=0.042.
We
shall interpret these as follows:
If we carry out the same measurement again and again we might get
other values. If X denotes a typical such value, then
X is a random variable with mean 19.921 and
standard deviation 0.042.
For the ease of exposition we shall assume that X has a
Gaussian distribution with this mean and standard deviation.
It is quite easy to use any other distribution (even skewed ones,
if the error bars are asymmetric).
Let us pretend that we are measuring the
brightness of the UV band again for the same quasar. We may
simulate the new measurement (u_rnd
, say) as
u_rnd = rnorm(1,mean=u_mag[1],sd=sig_u[1])
u_rnd
We want to carry this out for all the n=100 quasars.
u_rnd = rnorm(n,u_mag,sig_u)
u_rnd
We want to do this for all the measurements for which error bars
are available, and pack the new simulated measurements into a
data frame.
u_rnd = rnorm(n,u_mag,sig_u)
g_rnd = rnorm(n,g_mag,sig_g)
r_rnd = rnorm(n,r_mag,sig_r)
i_rnd = rnorm(n,i_mag,sig_i)
z_rnd = rnorm(n,z_mag,sig_z)
K_rnd = rnorm(n,K_mag,sig_K)
H_rnd = rnorm(n,H_mag,sig_H)
J_rnd = rnorm(n,J_mag,sig_J)
dat_rnd = data.frame(u_rnd,g_rnd,i_rnd,z_rnd,
K_rnd,H_rnd,J_rnd,
z,X.ray,M_i,Radio)
Now let us carry out principal component analysis of this
simulated data set and plot the standard deviations for the
principal components.
sdv = prcomp(dat_rnd,stand=T)$sdev
plot(sdv,ty="l",xlab='PC',ylab='std dev')
If we do the above procedure (simulate, analyze, plot) again and
again, then we shall have an idea how the standard deviations are
affected by measurement errors. The following R script (also
stored in simpca.r) does precisely this.
sdv = c()
for(i in 1:20) {
u_rnd = rnorm(n,u_mag,sig_u)
g_rnd = rnorm(n,g_mag,sig_g)
r_rnd = rnorm(n,r_mag,sig_r)
i_rnd = rnorm(n,i_mag,sig_i)
z_rnd = rnorm(n,u_mag,sig_u)
K_rnd = rnorm(n,K_mag,sig_K)
H_rnd = rnorm(n,H_mag,sig_H)
J_rnd = rnorm(n,J_mag,sig_J)
dat_rnd = data.frame(u_rnd,g_rnd,r_rnd,i_rnd,
K_rnd,H_rnd,J_rnd,z_rnd,
z,X.ray,M_i,Radio)
sdv = rbind(sdv,prcomp(dat_rnd,stand=T)$sdev)
}
plot(sdv[1,],type="l",xlab='PC',ylab='std dev')
apply(sdv,1,lines)
Notice from the plot that most of the standard deviations are not
affected at all by the measurement errors. Roughly speaking, this means that the
principal component analysis is doing a good job of discovering
the structure of the data even in presence of measurement errors.
Bootstrapping
So far we are simulating from completely specified
distributions. But suppose that someone gives some data and asks
us to simulate from whatever is the distribution of the
data. Since we do not have the distribution itself we cannot
apply the above methods directly.
However, as it happens, there is an approximate way out that is
simple to implement in R.
Let us first see it using an artificial example.
Suppose someone has some data that was originally generated from
N(0,1) distribution.
origData = rnorm(1000,mean=0,sd=1)
Then we shall simply resample from this data set.
newData = sample(origData,500, replace=T)
This is called (one form of) bootstrapping.
It is instructive to compare this with the true distribution.
hist(newData,prob=T)
x = seq(3,3,0.1)
lines(x,dnorm(x))
Now let us apply this on some real data.
hip = read.table("HIP.dat",head=T)
attach(hip)
We shall consider the Vmag
variable. We want to
generate 10 new values from its distribution (which is unknown to
us).
newVmag = sample(Vmag,10,replace=T)
newVmag
Well, that was easily done, but the question is
What in the universe does this newVmag
mean?
Does this mean we have generated 10 new galaxies? How can that be
when we are just reusing the old values?
Common sense, please!
First let us beware of the following mistaken notion.

  

Mistaken
notion: Bootstrapping is a way to increase the
sample size without any extra real sampling.
 
  

If that were true, you could just keep on
generating further samples from existing data and get away by
pretending that they are new galaxies. Well, common sense
tells us that this cannot be true.

  

Great
lesson: Never place statistics above common sense.
 
  

By repeatedly picking from the already available sample we are
not adding anything to the information. In particular, if the
original sample presents a biased view of the underlying
population,
then none of the resamples can cure that distortion.
Why then should we do bootstrapping at all?
The next example explains why this is useful.
Astronomical data sets are often
riddled with outliers (values that are far from the rest). To get
rid of such outliers one sometimes ignores a few of the extreme
values. One such method is the trimmed mean.
x = c(1,2,3,4,5,6,100,7,8,9)
mean(x)
mean(x,trim=0.1) #ignore the extreme 10% points in BOTH ends
mean(x[x!=1 & x!=100])
We shall compute 10%trimmed mean of Vmag
from the Hipparcos data set.
mean(Vmag,trim=0.1)
We want to estimate the standard error of this estimate. For
ordinary mean we had a simple formula, but unfortunately such a
simple formula does not exist for the trimmed mean. So we shall
use bootstrap here as follows. We shall generate 100 resamples
each of same size as the original sample.
trmean = c()
for(i in 1:100) {
resamp = sample(Vmag,length(Vmag),replace=T)
trmean = c(trmean,mean(resamp,trim=0.1))
}
sd(trmean)

  

Bootstrapping is such a popular statistical technique that R has
a package called boot to perform bootstrapping.
 
  

Permutation test
Here is a variation of the same theme: resampling
i.e., randomly sampling from the original sample and
pretending that it is a new sample.
The situation that we are discussing next involves two samples,
say, the Hyades and the nonHyades stars. Suppose that we want to
see if the their medians colors are different or not. (By this,
as usual, we
mean to test if the medians of the distributions underlying
the two samples are same or not.) Here we shall assume that the
shapes of the two distributions are the same.
Since the samples are pretty
large, it is reasonable to look at the differences of the medians
of the two samples.
source("hyad.r")
colH = B.V[HyadFilter]
colnH = B.V[!HyadFilter & !is.na(B.V)]
m = length(colH)
n = length(colnH)
median(colH)median(colnH)
If the population medians are the same, then we should expect
this difference to be near zero, as well. So if this difference
is too large, then we have reason to suspect that the underlying
distributions have different medians. The question therefore is
how large is ``too large"?
Sometimes statistical theory dictates a rule to decide this. But
more often there is no theory to help. Then permutation
tests provide an easy way out.
Imagine first what happens if the medians are the same. Then the
two underlying distributions are actually the same (since their
shapes already match by assumption). So the two samples together
is basically like a single large sample from this common
distribution. If this is the case, calling a Hyades star as
nonHyades (or vice versa) would really not make any
difference.
So we shall mix up all stars together, and then pick any m
of them and call them Hyades, while the remaining n would
be labeled nonHyades. This should be as good as the original
sample if the two distributions are indeed the same.
pool = c(colH,colnH)
mixed = sample(pool) # sample generates a random permutation
newH = mixed[1:m]
newnH = mixed[(m+1):(m+n)]
median(newH)  median(newnH)
If the two distributions were really the same then this new
difference should be close to the difference based on the
original data.
Of course, this apparently takes us back to the original
question: how close is ``close enough"? But now it is easier to
answer since we can repeat this random mixing many times.
pool = c(colH,colnH)
d = c()
for(i in 1:1000) {
mixed = sample(pool)
newH = mixed[1:m]
newnH = mixed[(m+1):(n+m)]
newDiff = median(newH)  median(newnH)
d = c(d,newDiff)
}
Now that we have 1000 values of how the typical difference should
look like if the distributions were the same, we can see where
our original difference lies w.r.t. these values. First let us
make a histogram of the typical values:
hist(d)
Well, the original value seems quite far from the range of
typical values, right?
One way to make this idea precise is
to compute the pvalue, which is defined as the
chance of observing a difference even more extreme than the
original value. Here ``more extreme" means ``larger in absolute
value".
orig = median(colH)median(colnH)
sum(abs(d)>=abs(orig))/1000
A pvalue smaller than 0.05, say, strongly indicates that
the medians of the distributions are not the same.
Kernel density estimation
We have already mentioned how important the distribution
underlying a sample is, and how the distribution may be specified
via its CDF. An (almost) equivalent way of specifying the
distribution is via its density function, which is just
the derivative of the CDF. The simplest way to get an idea about
the density function underlying a data set is to make its
histogram. The diagram below shows a density and the typical shape
of the histogram that you can expect for a data set with this
density. Note the similarity in shape.

A density function and a typical histogram 
Here is how we can use histograms to estimate the density of
Vmag
.
hist(Vmag, prob=T)
Typically densities are smooth functions, while histograms are
jagged. A better estimate may be obtained using kernels as
we have seen in the theory class.
Let us start with a very simple example to get the procedure
right. We shall apply the
Gaussian kernel to a sample consisting of just two points!
source("krnl1.r")
We shall take some time to understand the different parts of the
plot that pops up. Recall that the kernel density estimator has
the formula
where K is the kernel used. In our example, the sum has
just two terms
that are half the Gaussian densities centred around the two points.
These are shown in red. The sum (which is the kernel density
estimate) is shown in black.
Now notice the little pop up window that looks something like
this:

Kernel pop up 
Move the slider to change the bin width (which controls how
peaked the kernel is). See how a large value of the bin width
produces a flat estimator, while a small value produces two peaks
(one for each point). You may not readily notice that the black
curve is getting flatter with increasing bin width, until you
notice how the xaxis is growing and the yaxis
shrinking.
Next we shall move over to a more realistic situation where we
shall work with 100 observations that are artificially generated
from the standard Gaussian distribution. The aim now will be to
compare the estimated density with the actual density (which we
know since the data set is artificial).
source("krnl2.r")
Again play with the slider and convince yourself that small bin
widths produce jagged estimated, while large ones produce flat
estimates. The best value is somewhere in the middle. The
theoretical class has taught you how to find a good bin
width. You can also let R find a good value for you as we shall
see next. We shall use the Hipparcos data set once again and the
function density
.
dd = density(Vmag)
plot(dd$x,dd$y,ty="l")
It might be more interesting to see the estimate overlayed on top
of the histogram:
hist(Vmag,prob=T)
lines(dd$x,dd$y,col="red")
What bin width did R use? To see this look inside
dd
:
dd
If you would like to try a different bin width (say 0.1) then
that is OK
too:
dd2 = density(Vmag,bw = 0.1)
lines(dd2$x,dd2$y,col="blue")
 Exercise:
All these examples work with the Gaussian kernel. R allows you to
use a number of other kernels as well: Epanechnikov, Rectangular,
and Triangular, among others.
Try these out to see that the choice of the kernels does not
matter as much as the bin width does. Here is how you specify the
kernel:
dd3 = density(Vmag,kernel="epan") #abbrevs are OK
lines(dd3$x,dd3$y,col="green")

QQ plot
Suppose that you are given a random sample and asked to check if
it comes from the standard Gaussian distribution. There are many
ways to go about it, and the QQ plot is one of the easiest.
To understand this process let us use a small data set:
x = c(2.3, 0.3, 0.4, 2.0, 1.1, 1.0, 2.1, 3.4, 0.4, 0.5, 1.2,
2.1)
n = length(x)
n
First we shall sort them in increasing order:
sortedX = sort(x)
sortedX
Now pick a number in this sorted list. Say we pick the 3rd
number, which is 0.3.
What is the fraction of numbers that are less than or equal to this?
Clearly, it is
3/n = 3/12 = 0.25.
frac = 3/n
So we see that 0.25 fraction of the
numbers are less than or equal to 0.3 in the data.
Now it is time to compare it with what we should see for a true
standard Gaussian data. We shall find
the number M such that the a standard Gaussian random
variable will be ≤ M with chance 0.25. Then ideally
M should be close to our observed 0.3.
M = qnorm(frac)
Well, due to a little technicality we subtract 1/2n from
0.25. This hardly matters when n is large.
M = qnorm(frac  1/(2*n) )
M
This M is far from 0.3, and so we become suspicious that
our sample is not from the standard Gaussian distribution.
So far we are working with just the 3rd number in
sortedX
, which we picked
arbitrarily. We now repeat the process for all the points.
frac = (1:n)/n
M = qnorm(frac  1/(2*n))
plot(M,sortedX)
This is called the QQ plot of the sample against the
standard Gaussian distribution. If the points are close to the
y=x line then we can be hopeful that the sample is indeed
from a standard Gaussian distribution. To facilitate the
comparision let us add the y=x line to the plot.
abline(a=0,b=1)
Well, our data is far from standard Gaussian!
Comparison with the Gaussian distribution is so very common that
R provides a function qqnorm
for it.
qqnorm(x)
R does not provide any readymade function for making a QQ plot
against other distributions. But one can always follow the method
shown here to make a QQ plot against any distribution for which
R can compute the CDF (or, rather, its inverse using a
qfunction).