Likelihood Computations and Random Numbers

This tutorial focuses on likelihood computations and random number generation.  We also discuss EM algorithms and mixture models.  

Maximum likelihood estimation in a simple case

Let's reconsider the flux measurements in the gamma ray burst dataset.
   grb <- read.table("",
      header=T, skip=1)
   flux <- grb[,2]
The histogram suggests that the univariate distribution has roughly the shape of an exponential distribution (we'll speak more about what this means later). Let us replot these data in a particular (and particularly common) manner besides the histogram that is also suggestive of an exponential distribution.

As a first step, let us calculate something akin to the (x,y) coordinates of the empirical distribution function -- the function that has a jump of size 1/n at every one of the sorted data points.
   n <- length(flux)
   xx <- sort(flux)
   yy <- (1:n)/n
We could now obtain the empirical cdf by connecting the (xx,yy) points using a stair-step pattern. However, we'll look at these points slightly differently.

The exponential distribution has a distribution function given by F(x) = 1-exp(-x/mu) for positive x, where mu>0 is a scalar parameter equal to the mean of the distribution. This implies among other things that log(1-F(x)) = -x/mu is a linear function of x in which the slope is the negative reciprocal of the mean. Let us then look for the characteristic linear pattern if we plot log(1-F(x)) against x using the empirical distribution function for F:
   plot(xx, log(1-yy+1/n), xlab="flux", 
You may recall from the EDA and regression tutorial that this plot looks like the plot of time vs. flux that we produced as part of a regression analysis. This is only a coincidence; the two plots are fundamentally different. The plot seen here is a univariate plot, whereas the time vs. flux plot was a bivariate plot. We are ignoring the time variable here.

The plot certainly looks linear, so let us proceed on the assumption that the flux data are a sample from an exponential distribution with unknown parameter mu.

The overriding question of this section is this: How shall we estimate mu?

As mentioned above, mu is equal to the mean of this population. For a quick refresher on some probability theory, let us recall why this is so: The first step in going from the distribution function F(x) = 1 - exp(-x/mu) to the mean, or expectation, is to obtain the density function by differentiating: f(x) = exp(-x/mu)/mu. Notice that we typically use F(x) to denote the distribution function and f(x) to denote the density function. Next, we integrate x*f(x) over the interval 0 to infinity, which gives the mean, mu.

Since mu is the population mean, it is intuitively appealing to simply estimate mu using the sample mean. This method, in which we match the population moments to the sample moments and then solve for the parameter estimators, is called the method of moments. Though it is a well-known procedure, we focus instead on a much more widely used method (for good reason) called maximum likelihood estimation.

The first step in maximum likelihood estimation is to write down the likelihood function, which is nothing but the joint density of the dataset viewed as a function of the parameters. Next, we typically take the log, giving what is commonly called the log likelihood function. Remember that all logs are natural logs unless specified otherwise.

The log likelihood function in this case is (with apologies for the awkward notation)
l(mu) = -n log(mu) - x1/mu - ... - xn/mu
A bit of calculus reveals that l(mu) is therefore maximized at the sample mean. Thus, the sample mean is not only the method of moments estimator in this case but the maximum likelihood estimate as well.

In practice, however, it is sometimes the case that the linear-looking plot produced earlier is used to estimate mu. As we remarked, the negative reciprocal of the slope should give mu, so there is a temptation to fit a straight line using, say, least-squares regression, then use the resulting slope to estimate mu.
   mean (flux) # This is the MLE
   m1 <- lm(log(1-yy+1/n) ~ xx)
   -1/m1$coef[2] # An alternative estimator
There is a possible third method that I am told is sometimes used for some kinds of distributions. We start with a histogram, which may be viewed as a rough approximation of the density:
   h <- hist(flux)
All of the information used to produce the histogram is now stored in the h object, including the midpoints of the bins and their heights on a density scale (i.e., a scale such that the total area of the histogram equals one).

To see how to use this information, note that the logarithm of the density function is log f(x) = -log(mu) - x/mu, which is a linear function of x. Thus, plotting the logarithm of the density against x might be expected to give a line.
   counts <- h$counts
   dens <- h$density[counts>0]
   midpts <- h$mids[counts>0]
   plot(midpts, log(dens))
When using linear regression to estimate the slope of the linear pattern just produced, I am told that it is standard to weight each point by the number of observations it represents, which is proportional to the reciprocal of the variance of the estimated proportion of the number of points in that bin. We can obtain both the weighted and unweighted versions here. We can then obtain an estimate of mu using either the intercept, which is -log(mu), or the slope, which is -1/mu:
   m1 <- lm(log(dens) ~ midpts)
   m2 <- lm(log(dens) ~ midpts,
   exp(-m1$coef[1]) # This is one estimate
   -1/m1$coef[2] # This is another
   exp(-m2$coef[1]) # Yet another
   -1/m2$coef[2] # And another
We have thus produced no fewer than six different estimators of mu (actually seven, except that the MLE and the method of moments estimator are the same in this case). How should we choose one of them?

There are a couple of ways to answer this question. One is to appeal to statistical theory. The method of maximum likelihood estimation is backed by a vast statistical literature that shows it has certain properties that may be considered optimal. The method of moments is also a well-established method, but arguably with less general theory behind it than the method of maximum likelihood. The regression-based methods, on the other hand, are all essentially ad hoc.

A second way to choose among estimators is to run a simulation study in which we repeatedly simulate datasets (whose parameters are then known to us) and test the estimators to see which seems to perform best. In order to do this, we will need to be able to generate random numbers, which is the next topic in this tutorial.

Standard error of the MLE

Based on asymptotic theory (i.e., the mathematics of statistical estimators as the sample size tends to infinity), we know that for many problems, the sampling distribution of the maximum likelihood estimator (for theta) is roughly normally distributed with mean theta and variance equal to the inverse of the Fisher information of the dataset. For an i.i.d. sample of size n, the Fisher information is defined to be n times the mean of the square of the first derivative of the log-density function. Equivalently, it is -n times the mean of the second derivative of the log-density function. In many cases, it is easier to use the second derivative than the square of the first derivative.

In the previous example, the log-density function is -log(mu) - x/mu. The second derivative with respect to the parameter is 1/mu^2 - 2x/mu^3. To find the Fisher information, consider x to be a random variable; we know its expectation is mu, so the expectation of the second derivative equals -1/mu^2. We conclude that the Fisher information equals n/mu^2. Intuitively, this means that we get more "information" about the true value of mu when this value is close to zero than when it is far from zero. For the exponential distribution with mean mu, this makes sense.

Earlier we calculated the MLE. Let's give it a name:
   mu.hat <- mean (flux) # The MLE found earlier
The standard error of mu.hat, based on the Fisher information, is the square root of the inverse of the Fisher information evaluated at mu.hat:
   sqrt(mu.hat/n) # SE based on (expected) information
   mu.hat + 1.96 * c(-1,1) * sqrt(mu.hat/n) # approx. 95% CI
The Fisher information calculated above is sometimes called the expected information because it involves an expectation. As an alternative, we can use what is called the observed information, which is the negative second derivative of the log-likelihood function. In this example, the log-likelihood function evaluated at the estimate mu.hat is equal to
   -n * log(mu.hat) - sum(flux) / mu.hat
and the negative second derivative of this function, evaluated at mu.hat, is
   -n / mu.hat^2 + 2*sum(flux) / mu.hat^3 # observed information at MLE
Notice in this case (though not in every model) that the observed information evaluated at the MLE is equal to the expected information evaluated at the MLE:
   n / mu.hat^2 # expected information at MLE
Do you see why?

Generating random numbers in R

First, some semantics: "Random numbers" does not refer solely to uniform numbers between 0 and 1, though this is what "random numbers" means in some contexts. We are mostly interested in generating non-uniform random numbers here.

R handles many common distributions easily. To see a list, type"distribution",package="stats")
Let's consider the well-known normal distribution as an example:
The four functions 'rnorm', 'dnorm, 'pnorm', and 'qnorm' give random normals, the normal density (sometimes called the differential distribution function), the normal cumulative distribution function (CDF), and the inverse of the normal CDF (also called the quantile function), respectively. Almost all of the other distributions have similar sets of four functions. The 'r' versions are rbeta, rbinom, rcauchy, rchisq, rexp, rf, rgamma, rgeom, rhyper, rlogis, rlnorm, rmultinom, rnbinom, rnorm, rpois, rsignrank, rt, runif, rweibull, and rwilcox (there is no rtukey because generally only ptukey and qtukey are needed). Additional distributions are available in other packages.

As an example, suppose we wish to simulate a vector of 10 independent, standard (i.e., mean 0 and standard deviation 1) normal random variables. We use the rnorm function for this purpose, and its defaults are mean=0 and standard deviation=1. Thus, we may simply type
Suppose we wish to simulate a large number of normal random variables with mean 10 and standard deviation 3, then check a histogram against two normal density functions, one based on the true parameters and one based on estimates, to see how it looks. We'll use 'col=2, lty=2, lwd=3' to make the curve based on the true parameters red (color=2), dashed (line type=2), and wider than normal (line width=3). Also note that we are requesting 100 bins in the histogram (nclass=100) and putting it on the same vertical scale as the density functions (freq=FALSE).
   z <- rnorm(200000, mean=10, sd=3)
   x <- seq(min(z),max(z),len=200)
   lines(x,dnorm(x, mean=10, sd=3),col=2, lty=2, lwd=3) 
We can find out what proportion of the deviates lie outside 3 standard deviations from the true mean, a common cutoff used by physical scientists. We can also see the true theoretical proportion:
In the first line above, we are using sum to count the number of TRUE's in the logical vector (abs((z-10)/3)>3). This works because logical values are coerced to 0's and 1's when necessary.

The function dnorm has a closed form: With mean=0 and sd=1, dnorm(x) equals exp(-x2/2)/sqrt(2*pi). By contrast, the CDF, given by pnorm, has no closed form and must be numerically approximated. By definition, pnorm(x) equals the integral of dnorm(t) as t ranges from minus infinity to x. To find a p-value (i.e., the probability of observing a statistic more extreme than the one actually observed), we use pnorm; to construct a confidence interval (i.e., a range of reasonable values for the true parameter), we use the inverse, qnorm.
The first line above summarizes the well-known 68, 95, 99.7 rule for normal distributions (these are the approximate proportions lying within 1, 2, and 3 standard deviations from the mean). The second line gives the critical values used to construct a 90% confidence interval for a parameter when its estimator is approximately normally distributed.

Let us now briefly consider an example of a discrete distribution, which means a distribution on a finite or countably infinite set (as opposed to a continuous distribution like the normal). The Poisson distribution, which has a single real-valued parameter lambda, puts all of its probability mass on the nonnegative integers. A Poisson distribution, often used to model data consisting of counts, has mean and variance both equal to lambda.
   k <- 0:10
   dpois(k,lambda=2.5) # or equivalently,
Next, simulate some Poisson variables:
   x <- rpois(10000,lambda=2.5)

We can also illustrate the CDF for these Poisson variables:

   x.un <- sort(unique(x))

The power-law or Pareto distribution

A commonly used distribution in astrophysics is the power-law distribution, more commonly known in the statistics literature as the Pareto distribution. The R package 'actuar' actually defines this distribution, but because it is an extremely simple distribution, we will simply write the necessary functions. (For the purposes of this tutorial, do not load the 'actuar' package.)

The density function for the Pareto is f(x)=aba/x(a+1) for x>b. Here, a and b are fixed positive parameters, where b is the minimum possible value. (Note: The 'actuar' package shifts x by b, so f(x)=aba/(x+b)(a+1) for x>0.) As an example, consider the log N = -1.5 * log S relationship, where S is the apparent brightness and N is the number of standard candles randomly located in transparent space. Thus, a Pareto distribution with (a+1) = 1.5 is a reasonable, if simplistic, model for the brightness of observed standard candles in space. The b parameter merely reflects the choice of units of measurement.

As another example, consider the Salpeter function, the simple but widely known expression of the initial mass function (IMF), in which the mass of a randomly selected newly formed star has a Pareto distribution with parameter a=1.35.

It turns out that a Pareto random variable is simply b*exp(X), where X is an exponential random variable with rate=a (i.e., with mean=1/a). However, rather than exploiting this simple relationship, we wish to build functions for the Pareto distribution from scratch. Our default values, which may be changed by the user, will be a=0.5 and b=1.
   dpareto <- function(x, a=0.5, b=1) a*b^a/x^(a+1)
Next, we integrate the density function to obtain the distribution function, which is F(x)=1-(b/x)a for x>=b (and naturally F(x)=0 for x < b):
   ppareto <- function(x, a=0.5, b=1) (x > b)*(1-(b/x)^a)
Note that (x > b) in the above function is coerced to numeric, either 0 or 1.

Inverting the distribution function gives the quantile function. The following simplistic function is wrong unless 0<u<1, so a better-designed function should do some error-checking.
   qpareto <- function(u, a=0.5, b=1) b/(1-u)^(1/a)
Finally, to simulate random Pareto random variables, we use the fact that whenever the quantile function is applied to a uniform random variable, the result is a random variable with the desired distribution:
   rpareto <- function(n, a=0.5, b=1) qpareto(runif(n),a,b)
Creating functions in R, as illustrated above, is a common procedure. Note that each of the arguments of a function may be given a default value, which is used whenever the user calls the function without specifying the value of this parameter. Also note that each of the above functions consists of only a single line; however, longer functions may be created by enclosing them inside curly braces { }.

A few simple plots

The commands below create plots related to the four functions just created.
   x <- seq(1,50,len=200)
   u <- seq(.005,.9,len=200)
   z <- rpareto(200)
   dotchart(log10(z), main="200 random logged Pareto deviates",
The above commands illustrate some of the many plotting capabilities of R. The par function sets many graphical parameters, for instance, 'mfrow=c(2,2)', which divides the plotting window into a matrix of plots, set here to two rows and two columns. In the plot commands, 'type' is set here to "l" for a line plot; other common options are "p" for points (the default), "b" for connected dots, and "n" for nothing (to create axes only). Other options used: 'lty' sets the line type (1=solid, 2=dashed, etc.), 'col' sets color (1=black, 2=red, 3=green, etc.), 'main' puts a string into the plot title, and 'cex.main' sets the text magnification. Type '? par' to see a list of the many plotting parameters and options.

A simulation study

Let us posit that the random variable X has a Pareto distribution with parameters a=1.35 and b=1. We will simulate multiple datasets with this property and then apply several different estimation methods to each. To simplify matters, we will assume that the value of b is known, so that the goal is estimation of a.

To evaluate the estimators, we will look at their mean squared error (MSE), which is just what it sounds like: The average of the squared distances from the estimates to the true parameter value of a=1.35.

To illustrate the estimators we'll evaluate, let's start by simulating a single dataset of size 100:
   d <- rpareto(100, a=1.35)
Here are the estimators we'll consider:
  1. The maximum likelihood estimator. Since the density with b=1 is given by f(x) = a/x^(a+1), the log likelihood function is
    l(a) = n log(a) - (a+1)(log x1 + log x2 + ... + log xn).
    The maximizer may be found using calculus to equal n/(log x1 + ... + log xn). For our dataset, this may be found as follows:


    We used the sum of logarithms above where we could have used the equivalent mathematical expression given by the log of the product. Sometimes the former method gives more numerically stable answers for very large samples, though in this case "100/log(prod(d))" gives exactly the same answer.

  2. The method of moments estimator. By integrating, we find that the mean of the Pareto distribution with b=1 is equal to a/(a-1). (This fact requires that a be greater than 1.) Setting a/(a-1) equal to the sample mean and solving for a gives 1/(1-1/samplemean) as the estimator.

  3. What we'll call the EDF (empirical distribution function) estimator. Since log(1-F(x)) equals -a log(x) when b=1, by plotting the sorted values of log(d) against log(n/n), log((n-1)/n), ..., log(1/n), we should observe roughly a straight line. We may then use least-squares regression to find the slope of the line, which is our estimate of -a:

       lsd <- log(sort(d))
       lseq <- log((100:1)/100)
       plot(lsd, lseq)
       tmp <- lm(lseq ~ lsd)


  4. What we'll call the unweighted histogram estimator. Since log f(x) equals log(a) - (a+1) log(x) when b=1, if we plot the values of log(d) against histogram-based estimates of the log-density function, we should observe roughly a straight line with slope -(a+1) and intercept log(a). Let's use only the slope, since that is the feature that is most often the focus of a plot that is supposed to illustrate a power-law relationship.
       hd <- hist(d,nclass=20,plot=F)
       counts <- hd$counts
       ldens <- log(hd$density[counts>0])
       lmidpts <- log(hd$mids[counts>0])
       plot(lmidpts, ldens)
       tmp <- lm(ldens~lmidpts)
  5. What we'll call the weighted histogram estimator. Exactly the same as the unweighted histogram estimator, but we'll estimate the slope using weighted least squares instead of ordinary least squares. The weights should be proportional to the bin counts.
       plot(lmidpts, ldens)
       tmp <- lm(ldens~lmidpts, 
Now let's write a single function that will take a vector of data as its argument, then return all five of these estimators.
   five <- function(d) {
      lsd <- log(sort(d))
      n <- length(d)
      lseq <- log((n:1)/n)
      m1 <- lm(lseq ~ lsd)$coef
      hd <- hist(d,nclass=n/5,plot=F)
      counts <- hd$counts
      ldens <- log(hd$density[counts>0])
      lmidpts <- log(hd$mids[counts>0])
      m2 <- lm(ldens~lmidpts)$coef
      m3 <- lm(ldens~lmidpts,
      out <- c(max.lik=1/mean(log(d)), ,
The very last line of the function, "return(out)", is the value that will be returned. (We could also have simply written "out".) Let's test this function on our dataset:
There is no good way to compare these estimators based on a single sample like this. We now need to simulate multiple samples. Let's begin by taking n=100.
   n.100 <- NULL
   for(i in 1:250) { 
      dd <- rpareto(100, a=1.35)
      n.100 <- rbind(n.100, five(dd))
Now we can get estimates of the biases of the estimators (their expectations minus the true parameter) and their variances. Note that we'll use the biased formula for the variance (i.e., the one that uses n instead of n-1 in the denominator) for a technical reason explained below.
   bias.100 <- apply(n.100,2,mean) - 1.35
   var.100 <- apply(n.100,2,var) * (249/250)
It is a mathematical identity that the mean squared error (MSE) equals the square of the bias plus the variance, as we may check numerically for (say) the first column of n.100. However, the identity only works if we use the biased formula for the variance, which is why we used the multiplier (249/250) above.
   bias.100[1]^2 + var.100[1]
Thus, we can construct the MSEs and view the results as follows:
   mse.100 <- bias.100^2 + var.100
   rbind(bias.100, var.100, mse.100)
Finally, let's repeat the whole experiment using samples of size 200.
   n.200 <- NULL
   for(i in 1:250) { 
      dd <- rpareto(200, a=1.35)
      n.200 <- rbind(n.200, five(dd))
   bias.200 <- apply(n.200,2,mean) - 1.35
   var.200 <- apply(n.200,2,var) * (249/250)
   mse.200 <- bias.200^2 + var.200
   rbind(bias.200, var.200, mse.200)

EM algorithms

The class of algorithms called EM algorithms is enormously important in statistics. There are many, many different specific algorithms that can be called EM algorithms, but they have this in common: They seek to iteratively maximize a likelihood function in a situation in which the data may be thought of as incompletely observed.

The name "EM algorithm" has its genesis in a seminal 1977 paper by Dempster, Laird, and Rubin in the Journal of the Royal Statistical Society, Series B. Many distinct algorithms published prior to 1977 were examples of EM, including the Lucy-Richardson algorithm for image deconvolution that is apparently quite well known in astronomy. The major contribution of Dempster et al was to unify these algorithms and prove certain facts about them. Interesting historical note: They even "proved" an untrue fact that was refuted six years (!) later (even thirty years ago, publications in statistics churned through the pipeline at a snail's pace).

We'll derive a simple EM algorithm on a toy example: We'll pretend that some of the gamma ray burst flux measurements were right-censored, as follows:
   cflux <- flux
   cflux[flux>=60] <- 60
   n <- length(cflux)
   yy <- (1:n)/n
The situation may be viewed like this: The complete dataset is a set of n observations from an exponential distribution with unknown mean mu, say, X1, ..., Xn. What we observe, however, is Z1, ..., Zn, where Zi is defined as min(Xi, 60). The log likelihood for the observed data is as follows:
   m <- sum(flux>=60)
   s <- sum(cflux)
   loglik <- function(mu)
As it turns out, this log likelihood function can be maximized explicitly:
   mle <- s/(n-m)
However, we will construct an EM algorithm anyway for two reasons: First, it is instructive to see how the EM operates. Second, not all censoring problems admit a closed-form solution like this one does!

We start by writing down the complete-data log-likelihood for the sample. This is straightforward because the complete data are simply a random sample from an exponential distribution with mean mu. Next, we pick a starting value of mu, say mu0. Then comes the tricky part: We take the conditional expectation of the complete data log likelihood, conditional on the observed data and assuming that mu0 is the correct parameter. (This will have to happen on the chalkboard!) The result is a function of both mu and mu0, and construction of this function is called the E (expectation) step. Next, we maximize this function over mu, which is called the M (maximization) step. The result of the maximization becomes our next iterate, mu1, and the process repeats.

Let's start with mu0=20. Carrying out the calculations described above yields the following iterative scheme:
   mu <- 20
   mu <- s/n + m*mu/n; loglik(mu)
   mu <- s/n + m*mu/n; loglik(mu)
   # repeat the last line a few times
Notice that the value of the (observed data) log likelihood increases at each iteration. This is the fundamental property of any EM algorithm! In fact, it is very helpful when debugging computer code, since there must be a bug somewhere whenever the log-likelihood is ever observed to decrease. Notice also that the value of mu has converged to the true maximum likelihood estimator after a few iterations.

An EM algorithm for a mixture: A simple case

Let's try a two-component mixture of normals model on the quasar absorption line dataset that Tom showed in his lecture:
   qso <- scan("", 
      skip=1, nlines=104)[2*(1:104)]
As a Ph.D. student, I developed the R package 'mixtools', which consists of various functions for analyzing mixture models.  The functions in the package originally dealt with analyzing mixtures of regressions, but has since grown to include a wide array of other mixture procedures.  First, let us load the 'mixtools' package and look at the corresponding help file:

   install.packages("mixtools",lib="V:/") # lib=... is not always necessary!                  
   library(mixtools, lib.loc="V:/")

Let us also look at the function to implement an EM algorithm for a mixture of normals:


Notice that if you specify 'arbmean=FALSE' or 'arbvar=FALSE', you can get a scale mixture of normals or a location mixture of normals, respectively (a scale mixture is when the component means are all the same and a location mixture is when the component variances are all the same).  In addition, you can specify starting values as well as the number of components you are interested in fitting (by using the option 'k'). 

Let's look at a histogram of the quasar absorption line dataset:

   hist(qso, nclass=20)

It appears that a location mixture of normals may be appropriate.  In fact, we can perform a formal chi-square test where the null hypothesis is the component variances are the same versus the alternative that they are different.  This can be accomplished by:

   test.equality(qso, lambda=c(.9,.1), mu=c(1,.6), sigma=.001, arbvar=FALSE)

Thus, we proceed with fitting a scale mixture of normals:

   out <- normalmixEM(qso, lambda=c(.9,.1), mu=c(1,.6), sigma=.001, arbvar=FALSE)
   hist(qso, nclass=20)
   plot(out, density=TRUE)

This seems to converge to a sensible solution. However, try some other starting values (or let the algorithm generate them for you) and see what happens. If you find a different solution (i.e., a different local maximum), how does its log-likelihood value compare to that of the first solution?