Bootstrapping and Monte Carlo methods

A Monte Carlo method generally refers to a method that relies on simulated random numbers in some way. For instance, bootstrapping may be considered to be a particular case of a Monte Carlo method, since it relies on random resampling.


Monte Carlo integration and importance sampling

Most of this module will focus on bootstrapping, but we begin with a toy example illustrating Monte Carlo methods in general.

Quite often a quantity of interest in statistics may be expressed as an integral that we wish to evaluate. For instance, in Bayesian analysis, one is often interested in the posterior mean of a particular parameter, which may be expressed as an integral.
For purely illustrative purposes, suppose that we wish to integrate the following function over the integral (0,1):

   y=function(x) sin(x*(1-x))/(1+x+sqrt(x))

There does not appear to be any closed-form solution, so we turn to numerical methods.

R does have a function, integrate, that performs adaptive quadrature of functions of one variable:

   integrate(y,0,1)

To approximate this integral using Monte Carlo methods, we note that it may be written as E(f(U)), the expectation of f(U), where U is a uniform random variable on the integral (0,1). As is typical in statistics, we will use a sample mean to approximate the true mean E(f(U)):

   u=runif(10000)
   mean(y(u))

To get a sense of how precise our estimate is (pretending for the purpose of illustration that we do not know the correct value of the integral), we can estimate the standard deviation of this sample mean as follows:

   sqrt(var(y(u))/10000)

By the central limit theorem, we know that our sample mean will be roughly normally distributed, so we can give a 95% confidence interval for the correct value by adding and subtracting 1.96 times the above standard deviation.

Next, we consider a method for reducing the variance of our estimator known as importance sampling. Suppose we have a random variable Z with density function f(z). Then the integral of y(x) that we wish to find may be rewritten as the expectation of y(Z)/f(Z). (To see why, recall that to find the expectation of a function of Z, we multiply by the density f(Z) and then integrate.) In our toy example, suppose that Z has a beta density with parameters (2, 2.5):

   z=rbeta(10000,2,2.5)
   mean(y(z)/dbeta(z,2,2.5))

Now for the standard deviation calculation:

   sqrt(var(y(z)/dbeta(z,2,2.5))/10000)

Note that this standard deviation is roughly one third of the original standard deviation. The reason for this is that the beta(2,2.5) density happens to be somewhat close in shape to a constant times the function y(x) that we are trying to integrate, as seen in the following plot:

   x=seq(0,1,len=200)
   plot(x,y(x),type="l",
      main="Solid line: y(x)")
   k = y(.5)/dbeta(.5,2,2.5)
   lines(x,k*dbeta(x,2,2.5),lty=2,col=2)

The intuition behind importance sampling, therefore, is that y(z)/dbeta(z,2,2.5) is much closer to a constant (and therefore has smaller variance) than y(z) itself. In this toy example, importance sampling reduces the standard deviation to about one third of its original size. Stated differently, we need a sample roughly nine times as large without importance sampling as we do with importance sampling in this example in order to obtain a result of comparable precision.


Nonparametric bootstrapping of regression standard errors

Here, we revisit the Hipparcos data set and the 92 stars identified as Hyades stars in the exploratory data analysis and regression module:

   hip = read.table("http://astrostatistics.psu.edu/datasets/HIP_star.dat",
      header=T,fill=T)
   attach(hip)
   x1=(RA>50 & RA<100)
   x2=(pmRA>90 & pmRA<130)
   x3=(pmDE>-60 & pmDE< -10) # Space in '< -' is necessary!
   x4=x1&x2&x3
   x5=x4 & (DE>0) & (e_Plx<5)

Earlier, we considered a linear model for the relationship between DE and pmDE among these 92 stars:

   plot(DE[x5],pmDE[x5],pch=20)
   m=lm(pmDE[x5] ~ DE[x5])
   abline(m,lwd=2,col=2)

The red line on the plot is the usual least-squares line, for which estimation is easy and asymptotic theory gives easy-to-calculate standard errors for the coefficients:

   summary(m)$coef

However, suppose we wish to use a resistant regression method such as lqs.

   library(MASS)
   m2=lqs(pmDE[x5] ~ DE[x5])
   abline(m2,lwd=2,col=3)
   m2

In this case, it is not so easy to obtain standard errors for the coefficients. Thus, we will turn to bootstrapping. In a standard, or nonparametric, bootstrap, we repeatedly draw samples of size 92 from the empirical distribution of the data, which in this case consist of the (DE, pmDE) pairs. We use lqs to fit a line to each sample, then compute the sample covariance of the resulting coefficient vectors. The procedure works like this:

   x=DE[x5]
   y=pmDE[x5]
   m2B=NULL
   for (i in 1:200) {
      s=sample(92,replace=T)
      m2B=rbind(m2B, lqs(y[s]~x[s])$coef)
   }

We may now find the sample covariance matrix for m2B. The (marginal) standard errors of the coefficients are obtained as the square roots of the diagonal entries of this matrix:

   cov(m2B)
   se=sqrt(diag(cov(m2B)))
   se

The logic of the bootstrap procedure is that we are estimating an approximation of the true standard errors. The approximation involves replacing the true distribution of the data (unknown) with the empirical distribution of the data. This approximation may be estimated with arbitrary accuracy by a Monte Carlo approach, since the empirical distribution of the data is known and in principle we may sample from it as many times as we wish. In other words, as the bootstrap sample size increases, we get a better estimate of the true value of the approximation. On the other hand, the quality of this approximation depends on the original sample size (92, in our example) and there is nothing we can do to change it.

An alternative way to generate a bootstrap sample in this example is by generating a new value of each response variable (y) by adding the predicted value from the original lqs model to a randomly selected residual from the original set of residuals. Thus, we resample not the entire bivariate structure but merely the residuals. As an exercise, you might try implementing this approach in R. Note that this approach is not a good idea if you have reason to believe that the distribution of residuals is not the same for all points. For instance, if there is heteroscedasticity or if the residuals contain structure not explained by the model, this residual resampling approach is not warranted.


Using the boot package in R

There is a boot package in R that contains many functions relevant to bootstrapping. As a quick example, we will show here how to obtain the same kind of bootstrap example obtained above (for the lqs model of pmDE regressed on DE for the Hyades stars.)

   library(boot)
   mystat=function(a,b)
      lqs(a[b,2]~a[b,1])$coef
   m2B2 = boot(cbind(DE[x5],pmDE[x5]),
      mystat, 200)
   names(m2B2)

As explained in the help file, the boot function requires as input a function that accepts as arguments the whole dataset and an index that references an observation from that dataset. This is why we defined the mystat function above. To see the output that is similar to that obtained earlier for the m2B object, look in m2B2$t:

   cov(m2B2$t)
   sqrt(diag(cov(m2B2$t)))

Compare with the output provided by print.boot and the plot produced by plot.boot:

   m2B2
   plot(m2B2)

Another related function, for producing bootstrap confidence intervals, is boot.ci.


Parametric bootstrapping of regression standard errors

We now return to the regression problem studied earlier.

Sometimes, resampling is done from a theoretical distribution rather than from the original sample. For instance, if simple linear regression is applied to the regression of pmDE on DE, we obtain a parametric estimate of the distribution of the residuals, namely, normal with mean zero and standard deviation estimated from the regression:

   summary(m)

Remember that m was defined above as lm(pmDE[x5]~DE[x5]). We observe a residual standard error of 4.449.

A parametric bootstrap scheme proceeds by simulating a new set of pmDE (or y) values using the model

   y = 21.9 - 3.007*DE[x5] + 4.449*rnorm(92)

Then, we refit a linear model using y as the new response, obtaining slightly different values of the regression coefficients. If this is repeated, we obtain an approximation of the joint distribution of the regression coefficients for this model.

Naturally, the same approach could be tried with other regression methods such as those discussed in the EDA and regression module, but careful thought should be given to the parametric model used to generate the new residuals. In the normal case discussed here, the parametric bootstrap is simple, but it is really not necessary because standard linear regression already gives a very good approximation to the joint distribution of the regression coefficients when errors are heteroscedastic and normal. One possible use of this method is in a model that assumes the absolute residuals are exponentially distributed, in which case least absolute deviation regression as discussed in the EDA and regression module can be justified. The reader is encouraged to implement a parametric bootstrap using the rq function found in the "quantreg" package.


Estimating a p-value (parametric bootstrap for Kolmogorov-Smirnov)

In the hypothesis testing module, we conducted a permutation test of the null hypothesis that the 92 Hyades stars have the same distribution of B minus V as the 2586 non-Hyades stars. Because there are 2678!/(92!2586!)=3.78x10142 different ways to select 92 objects from 2678 objects, we had to depend on a random sample of reassignments:

   H=B.V[x5]
   nH=B.V[!x5 & !is.na(B.V)]
   tlist=NULL
   all=c(H,nH)
   for(i in 1:5000) {
      s=sample(2586,92) # choose a sample
      tlist=c(tlist, t.test(all[s],all[-s],
        var.eq=T)$stat) # add t-stat to list
   }

Even though the sample in tlist appears to be very close to standard normal according to a histogram and a Q-Q plot, the Kolmogorov-Smirnov test rejects the null hypothesis of standard normality:

   ks.test(tlist,"pnorm")

However, if we use a Kolmogorov-Smirnov test in which we allow the mean of the theoretical normal distribution to be estimated from the sample, we see no evidence of a departure from normality according to the usual output:

   newkstest=ks.test(tlist,"pnorm",mean=mean(tlist))
   newkstest

Note in the above function call that ks.test passes the "mean=" option directly to the pnorm function. The ks.test function does not know beforehand how many, if any, options might be passed to the function it requires; therefore, the function definition of ks.test uses three dots (...) in its option list to stand for some unspecified number of options.

Unfortunately, the Kolmogorov-Smirnov test is not quite valid if the theoretical distribution against which we are testing depends upon estimates derived from the data. (Imagine the extreme case: We could test the sample distribution against itself, which is simply an estimate derived from the data!) Therefore, we can get a better sense of how valid the second p-value is by using a Monte Carlo approach.

To carry out the Monte Carlo test, observe that the stat statistic for our K-S test by typing

   newkstest$stat

The value that we seek is the probability that X is at least as large as this observed statistic, where X is the random test statistic generated when we select a sample of size 5000 from the null distribution:

   ksstat=NULL
   for(i in 1:1000) {
      x=rnorm(5000)
      ksstat=c(ksstat,
         ks.test(x,pnorm,mean=mean(x))$stat)
   }

Now let's look at a histogram of the test statistics and estimate a p-value:

   hist(ksstat,nclass=40)
   ns=newkstest$stat
   abline(v=ns,lty=2,col=2)
   mean(ksstat>=ns)

Note that the approximate p-value above is smaller than the original p-value reported by newkstest, though it is still not small enough to provide strong evidence that the tlist sample is not normal with unit variance.

We conclude this section by briefly explaining the parenthetical remark "parametric bootstrap for Kolmogorov-Smirnov" in its title and discussing a related nonparametric bootstrap method. The above procedure is equivalent to a parametric bootstrap procedure in which we draw samples not from a standard normal, but from a normal distribution with mean equal to mean(tlist). This is because the "mean=mean(x)" option implies that we are applying a particular K-S test that is invariant to the mean of the distribution.

An interesting exercise is to implement a nonparametric bootstrap in this example that would be based only on resampling. In this case, we cannot merely repeatedly use

   s=sample(5000,replace=T)
   ks.test(tlist[s],pnorm,mean=mean(tlist[s]))$stat

The reason is because this statistic is biased. Therefore, it is necessary to correct for the bias. Unfortunately, this is not a trivial correction and it essentially requires a modified version of ks.test: Whereas the ordinary ks.test statistic is based on the maximum distance between the EDF of the sample and the theoretical CDF, the modified version must subtract off the bias before finding the maximum.