grb <- read.table("http://astrostatistics.psu.edu/datasets/GRB_afterglow.dat",
header=T, skip=1)
flux <- grb[,2]
hist(flux)
The histogram suggests that the univariate distribution has roughly
the shape of an exponential distribution (we'll speak more about what this
means later). Let us replot these data in a particular (and particularly
common) manner besides the histogram
that is also suggestive of an exponential distribution.n <- length(flux) xx <- sort(flux) yy <- (1:n)/nWe could now obtain the empirical cdf by connecting the (xx,yy) points using a stair-step pattern. However, we'll look at these points slightly differently.
plot(xx, log(1-yy+1/n), xlab="flux",
ylab="log(1-F(flux))")
You may recall from the EDA and regression
tutorial that this plot looks like the plot of time vs. flux that
we produced as part of a regression analysis. This is only a coincidence;
the two plots are fundamentally different. The plot seen here is
a univariate plot, whereas the time vs. flux plot was a
bivariate plot. We are ignoring the time variable here.mean (flux) # This is the MLE m1 <- lm(log(1-yy+1/n) ~ xx) m1 -1/m1$coef[2] # An alternative estimatorThere is a possible third method that I am told is sometimes used for some kinds of distributions. We start with a histogram, which may be viewed as a rough approximation of the density:
h <- hist(flux)All of the information used to produce the histogram is now stored in the h object, including the midpoints of the bins and their heights on a density scale (i.e., a scale such that the total area of the histogram equals one).
counts <- h$counts dens <- h$density[counts>0] midpts <- h$mids[counts>0] plot(midpts, log(dens))When using linear regression to estimate the slope of the linear pattern just produced, I am told that it is standard to weight each point by the number of observations it represents, which is proportional to the reciprocal of the variance of the estimated proportion of the number of points in that bin. We can obtain both the weighted and unweighted versions here. We can then obtain an estimate of mu using either the intercept, which is -log(mu), or the slope, which is -1/mu:
m1 <- lm(log(dens) ~ midpts)
m2 <- lm(log(dens) ~ midpts,
weights=counts[counts>0])
exp(-m1$coef[1]) # This is one estimate
-1/m1$coef[2] # This is another
exp(-m2$coef[1]) # Yet another
-1/m2$coef[2] # And another
We have thus produced no fewer than six different estimators of mu (actually seven,
except that the MLE and the method of moments estimator are the same in this case).
How should we choose one of them?mu.hat <- mean (flux) # The MLE found earlierThe standard error of mu.hat, based on the Fisher information, is the square root of the inverse of the Fisher information evaluated at mu.hat:
sqrt(mu.hat/n) # SE based on (expected) information mu.hat + 1.96 * c(-1,1) * sqrt(mu.hat/n) # approx. 95% CIThe Fisher information calculated above is sometimes called the expected information because it involves an expectation. As an alternative, we can use what is called the observed information, which is the negative second derivative of the log-likelihood function. In this example, the log-likelihood function evaluated at the estimate mu.hat is equal to
-n * log(mu.hat) - sum(flux) / mu.hatand the negative second derivative of this function, evaluated at mu.hat, is
-n / mu.hat^2 + 2*sum(flux) / mu.hat^3 # observed information at MLENotice in this case (though not in every model) that the observed information evaluated at the MLE is equal to the expected information evaluated at the MLE:
n / mu.hat^2 # expected information at MLEDo you see why?
help.search("distribution",package="stats")
Let's consider the well-known normal distribution as an example:
?NormalThe four functions 'rnorm', 'dnorm, 'pnorm', and 'qnorm' give random normals, the normal density (sometimes called the differential distribution function), the normal cumulative distribution function (CDF), and the inverse of the normal CDF (also called the quantile function), respectively. Almost all of the other distributions have similar sets of four functions. The 'r' versions are rbeta, rbinom, rcauchy, rchisq, rexp, rf, rgamma, rgeom, rhyper, rlogis, rlnorm, rmultinom, rnbinom, rnorm, rpois, rsignrank, rt, runif, rweibull, and rwilcox (there is no rtukey because generally only ptukey and qtukey are needed). Additional distributions are available in other packages.
rnorm(10)Suppose we wish to simulate a large number of normal random variables with mean 10 and standard deviation 3, then check a histogram against two normal density functions, one based on the true parameters and one based on estimates, to see how it looks. We'll use 'col=2, lty=2, lwd=3' to make the curve based on the true parameters red (color=2), dashed (line type=2), and wider than normal (line width=3). Also note that we are requesting 100 bins in the histogram (nclass=100) and putting it on the same vertical scale as the density functions (freq=FALSE).
z <- rnorm(200000, mean=10, sd=3) hist(z,freq=FALSE,nclass=100) x <- seq(min(z),max(z),len=200) lines(x,dnorm(x, mean=10, sd=3),col=2, lty=2, lwd=3) lines(x,dnorm(x,mean=mean(z),sd=sqrt(var(z))))We can find out what proportion of the deviates lie outside 3 standard deviations from the true mean, a common cutoff used by physical scientists. We can also see the true theoretical proportion:
sum(abs((z-10)/3)>3)/length(z) 2*pnorm(-3)In the first line above, we are using sum to count the number of TRUE's in the logical vector (abs((z-10)/3)>3). This works because logical values are coerced to 0's and 1's when necessary.
pnorm(1:3)-pnorm(-(1:3)) qnorm(c(.05,.95))The first line above summarizes the well-known 68, 95, 99.7 rule for normal distributions (these are the approximate proportions lying within 1, 2, and 3 standard deviations from the mean). The second line gives the critical values used to construct a 90% confidence interval for a parameter when its estimator is approximately normally distributed.
k <- 0:10 dpois(k,lambda=2.5) # or equivalently, exp(-2.5)*2.5^k/factorial(k)Next, simulate some Poisson variables:
x <- rpois(10000,lambda=2.5) table(x) mean(x) var(x)
We can also illustrate the CDF for these Poisson variables:
hist(x,freq=F) x.un <- sort(unique(x)) x.cdf=ppois(x.un,lambda=2.5) x.sim=cumsum(table(x)/10000) cbind(x.sim,x.cdf)
dpareto <- function(x, a=0.5, b=1) a*b^a/x^(a+1)Next, we integrate the density function to obtain the distribution function, which is F(x)=1-(b/x)a for x>=b (and naturally F(x)=0 for x < b):
ppareto <- function(x, a=0.5, b=1) (x > b)*(1-(b/x)^a)Note that (x > b) in the above function is coerced to numeric, either 0 or 1.
qpareto <- function(u, a=0.5, b=1) b/(1-u)^(1/a)Finally, to simulate random Pareto random variables, we use the fact that whenever the quantile function is applied to a uniform random variable, the result is a random variable with the desired distribution:
rpareto <- function(n, a=0.5, b=1) qpareto(runif(n),a,b)Creating functions in R, as illustrated above, is a common procedure. Note that each of the arguments of a function may be given a default value, which is used whenever the user calls the function without specifying the value of this parameter. Also note that each of the above functions consists of only a single line; however, longer functions may be created by enclosing them inside curly braces { }.
par(mfrow=c(2,2))
x <- seq(1,50,len=200)
plot(x,dpareto(x),type="l")
plot(x,ppareto(x),type="l",lty=2)
u <- seq(.005,.9,len=200)
plot(u,qpareto(u),type="l",col=3)
z <- rpareto(200)
dotchart(log10(z), main="200 random logged Pareto deviates",
cex.main=.7)
par(mfrow=c(1,1))
The above commands illustrate some of the many plotting capabilities of R.
The
par
function sets many graphical
parameters, for instance, 'mfrow=c(2,2)', which divides the plotting window into a matrix
of plots, set here to two rows and two columns. In the
plot
commands, 'type' is set here to
"l" for a line plot; other common options are "p" for points (the default),
"b" for connected dots, and "n" for nothing (to create
axes only). Other options used:
'lty' sets the line type (1=solid, 2=dashed, etc.), 'col' sets color (1=black, 2=red,
3=green, etc.), 'main' puts a string into the plot title, and 'cex.main' sets the text magnification.
Type
'? par'
to see a list of the many plotting parameters and options.
d <- rpareto(100, a=1.35)Here are the estimators we'll consider:
hd <- hist(d,nclass=20,plot=F) counts <- hd$counts ldens <- log(hd$density[counts>0]) lmidpts <- log(hd$mids[counts>0]) plot(lmidpts, ldens) tmp <- lm(ldens~lmidpts) abline(tmp,col=2) -1-as.numeric(tmp$coef[2])
plot(lmidpts, ldens)
tmp <- lm(ldens~lmidpts,
weights=counts[counts>0])
abline(tmp,col=2)
-1-as.numeric(tmp$coef[2])
five <- function(d) {
lsd <- log(sort(d))
n <- length(d)
lseq <- log((n:1)/n)
m1 <- lm(lseq ~ lsd)$coef
hd <- hist(d,nclass=n/5,plot=F)
counts <- hd$counts
ldens <- log(hd$density[counts>0])
lmidpts <- log(hd$mids[counts>0])
m2 <- lm(ldens~lmidpts)$coef
m3 <- lm(ldens~lmidpts,
weights=counts[counts>0])$coef
out <- c(max.lik=1/mean(log(d)),
meth.mom=1/(1-1/mean(d)),
EDF=-as.numeric(m1[2]),
unwt.hist=-1-as.numeric(m2[2]),
wt.hist=-1-as.numeric(m3[2]))
return(out)
}
The very last line of the function, "return(out)", is the value that will be returned.
(We could also have simply written "out".)
Let's test this function on our dataset:
five(d)There is no good way to compare these estimators based on a single sample like this. We now need to simulate multiple samples. Let's begin by taking n=100.
n.100 <- NULL
for(i in 1:250) {
dd <- rpareto(100, a=1.35)
n.100 <- rbind(n.100, five(dd))
}
Now we can get estimates of the biases of the estimators
(their expectations minus
the true parameter) and their variances. Note that we'll use the
biased formula for the variance (i.e., the one that uses n instead of n-1
in the denominator) for a technical reason explained below.
bias.100 <- apply(n.100,2,mean) - 1.35 var.100 <- apply(n.100,2,var) * (249/250)It is a mathematical identity that the mean squared error (MSE) equals the square of the bias plus the variance, as we may check numerically for (say) the first column of n.100. However, the identity only works if we use the biased formula for the variance, which is why we used the multiplier (249/250) above.
mean((n.100[,1]-1.35)^2) bias.100[1]^2 + var.100[1]Thus, we can construct the MSEs and view the results as follows:
mse.100 <- bias.100^2 + var.100 rbind(bias.100, var.100, mse.100)Finally, let's repeat the whole experiment using samples of size 200.
n.200 <- NULL
for(i in 1:250) {
dd <- rpareto(200, a=1.35)
n.200 <- rbind(n.200, five(dd))
}
bias.200 <- apply(n.200,2,mean) - 1.35
var.200 <- apply(n.200,2,var) * (249/250)
mse.200 <- bias.200^2 + var.200
rbind(bias.200, var.200, mse.200)
cflux <- flux cflux[flux>=60] <- 60 n <- length(cflux) yy <- (1:n)/n plot(sort(cflux),log(1-yy+1/n))The situation may be viewed like this: The complete dataset is a set of n observations from an exponential distribution with unknown mean mu, say, X1, ..., Xn. What we observe, however, is Z1, ..., Zn, where Zi is defined as min(Xi, 60). The log likelihood for the observed data is as follows:
m <- sum(flux>=60)
s <- sum(cflux)
loglik <- function(mu)
-(n-m)*log(mu)-s/mu
As it turns out, this log likelihood function can be maximized explicitly:
mle <- s/(n-m)However, we will construct an EM algorithm anyway for two reasons: First, it is instructive to see how the EM operates. Second, not all censoring problems admit a closed-form solution like this one does!
mu <- 20 loglik(mu) mu <- s/n + m*mu/n; loglik(mu) mu <- s/n + m*mu/n; loglik(mu) # repeat the last line a few timesNotice that the value of the (observed data) log likelihood increases at each iteration. This is the fundamental property of any EM algorithm! In fact, it is very helpful when debugging computer code, since there must be a bug somewhere whenever the log-likelihood is ever observed to decrease. Notice also that the value of mu has converged to the true maximum likelihood estimator after a few iterations.
qso <- scan("http://www.astrostatistics.psu.edu/datasets/QSO_absorb.txt",
skip=1, nlines=104)[2*(1:104)]
As a Ph.D. student, I developed the R package 'mixtools', which consists of
various functions for analyzing mixture models. The functions in the
package originally dealt with analyzing mixtures of regressions, but has since
grown to include a wide array of other mixture procedures. First, let us
load the 'mixtools' package and look at the corresponding help file:
install.packages("mixtools",lib="V:/")
# lib=... is not always necessary!
library(mixtools, lib.loc="V:/")
help(package="mixtools")
Let us also look at the function to implement an EM algorithm for a mixture of normals:
?normalmixEM
Notice that if you specify 'arbmean=FALSE' or 'arbvar=FALSE', you can get a scale mixture of normals or a location mixture of normals, respectively (a scale mixture is when the component means are all the same and a location mixture is when the component variances are all the same). In addition, you can specify starting values as well as the number of components you are interested in fitting (by using the option 'k').
Let's look at a histogram of the quasar absorption line dataset:
hist(qso, nclass=20)
It appears that a location mixture of normals may be appropriate. In fact, we can perform a formal chi-square test where the null hypothesis is the component variances are the same versus the alternative that they are different. This can be accomplished by:
test.equality(qso, lambda=c(.9,.1), mu=c(1,.6), sigma=.001, arbvar=FALSE)
Thus, we proceed with fitting a scale mixture of normals:
out <- normalmixEM(qso, lambda=c(.9,.1), mu=c(1,.6), sigma=.001, arbvar=FALSE) out[2:5] hist(qso, nclass=20) plot(out, density=TRUE)
This seems to converge to a sensible solution. However, try some other starting values (or let the algorithm generate them for you) and see what happens. If you find a different solution (i.e., a different local maximum), how does its log-likelihood value compare to that of the first solution?