Let's generate from a bivariate normal distribution in which the standard deviations of the components are 2 and 3 where the correlation between the components is -1/2. For simplicity, let the mean of the vectors be the origin. We need to figure out what the covariance matrix looks like.

The diagonal elements of the covariance matrix are the marginal variances, namely 4 and 9. The off-diagonal element is the covariance, which equals the correlation times the product of the marginal standard deviations, or -3:

sigma <- matrix(c(4,-3,-3,9),2,2) sigmaWe now seek to find a matrix M such that M times its transpose equals sigma. There are many matrices that do this; one of them is the transpose of the Cholesky square root:

M <- t(chol(sigma)) M %*% t(M)We now recall that if Z is a random vector and M is a matrix, then the covariance matrix of MZ equals M cov(Z) M

Z <- matrix(rnorm(400),2,200) # 2 rows, 200 columns X <- t(M %*% Z)The transpose above is taken so that X becomes a 200x2 matrix, since R prefers to have the columns as the vector components rather than the rows. Let us now plot the randomly generated normals and find the sample mean and covariance.

plot(X) Xbar <- apply(X,2,mean) S <- cov(X)We can compare the S matrix with the sigma matrix, but it is also nice to plot an ellipse to see what shape these matrices correspond to. The car package, which we used in the EDA and regression tutorial, has the capability to plot ellipses. You might not need to run the install.packages function below since this package may already have been installed during the earlier tutorials. However, the library function is necessary.

install.packages("car",lib="V:/") library(car,lib.loc="V:/")To use the ellipse function in the car package, we need the center (mean), shape (covariance), and the radius. The radius is the radius of a circle that represents the "ellipse" for a standard bivariate normal distribution. To understand how to provide a radius, it is helpful to know that if we sum the squares of k independent standard normal random variables, the result is (by definition) a chi-squared random variable on k degrees of freedom. Thus, for a standard bivariate normal vector, the squares of the radii should be determined by the quantiles of the chi-squared distribution on 2 degrees of freedom. Let us then construct an ellipse with radius based on the median of the chi-squared distribution. Thus, this ellipse should contain roughly half of the points generated. We'll also produce a second ellipse, based on the true mean and covariance matrix, for purposes of comparison.

ellipse(Xbar, S, sqrt(qchisq(.5,2))) ellipse(c(0,0), sigma, sqrt(qchisq(.5,2)), col=3, lty=2)

combo <- read.csv("http://astrostatistics.psu.edu/datasets/COMBO17.csv",na.strings="NA") names(combo) attach(combo) xy <- cbind(BjMAG, S280MAG-BjMAG) xy <- xy[Mcz>.7 & Mcz<.9,] xy <- na.omit(xy) plot(xy[,1],xy[,2], pch=20, xlab=expression(M[B]), ylab=expression("(280-B)"[rest]), main="z between 0.7 and 0.9", cex.lab=1.5)Notice that the plot command uses the expression function. This is one way to put mathematical notation on an R plot. See the help function for plotmath for details.

To duplicate the covariance and correlation matrices seen in the lecture, let us consider a subset of the 65 combo variables:

subcombo <- combo[,c(2,5,6,8,9,10,12,14)] names(subcombo) var(subcombo) # Covariance matrix cor(subcombo) # Correlation matrixNow let's do a quick principal components analysis using the singular value decomposition. First, we obtain the SVD:

s <- svd(subcombo) names(s) U <- s$u V <- s$v D <- diag(s$d)There are three matrices that make up the SVD: U, D, and V. The U and V matrices are 3462x8 and 8x8 matrices, respectively, and they are both orthogonal. We can check the orthogonality of U and V by simply multiplying:

t(U) %*% U t(V) %*% VThe SVD has the property that X (in this case, subcombo) is equal to U %*% D %*% t(V):

max(abs(subcombo - U %*% D %*% t(V))) # Should be zeroNow let's obtain the principal components scores. To do this, we must first center each column of the subcombo matrix. To do this, we'll use the sweep function:

cent.subcombo <- sweep(subcombo,2,apply(subcombo,2,mean))To check that this has worked, let's verify that the mean of each column of cent.subcombo equals zero:

apply(cent.subcombo,2,mean)We don't need the old 's' object any more, so let's rename it now so that it is the SVD for the centered dataset:

s <- svd(cent.subcombo) U <- s$u V <- s$v D <- diag(s$d)To obtain the principal component scores for each galaxy, we merely multiply U by D.

pcscores <- U %*% D plot(pcscores[,1:2])Each of the first two principal components (plotted above) is a linear combination of the original eight variables. To find out what these linear combinations are, we may examine the V matrix. Remember, if we multiply the principal component scores by t(V), we obtain the original centered dataset.

As a refresher for mixtures, suppose there exist k multivariate normal distributions (subpopulations). To select our sample, someone in a closed room first rolls a k-sided die (not necessarily a fair die), then selects a random member of the subpopulation indicated by the die. The

To take a simple example, suppose you are given a dataset consisting

Note here that model-based clustering using mixture models is not the same thing as discriminant analysis, in which we are given not only observations but also their known class memberships. The goal in discriminant analysis is to build a rule for classifying future observations based on a training sample, whereas clustering usually has broader goals than this: To discover the classes in the first place.

There is a CRAN package that does model-based clustering assuming normal distributions. It is called mclust. There is an updated version of mclust on CRAN, but we'll use an older version called mclust02.

install.packages("mclust02",lib="V:/") # The lib= part might not be needed library(mclust02,lib.loc="V:/")Let's first fit a two-component normal mixture model (i.e., search for two multivariate normal clusters).

mc2 <- Mclust(xy, minG=2, maxG=2) names(mc2) mc2$mu mc2$sigmaLet's take a look at 50% ellipses of the clustering solution to see what the solution "looks like".

plot(xy[,1],xy[,2], pch=20, xlab=expression(M[B]), ylab=expression("(280-B)"[rest]), main="z between 0.7 and 0.9", cex.lab=1.5) r <- sqrt(qchisq(.5,2)) for(i in 1:mc2$G) ellipse(mc2$mu[,i], mc2$sigma[,,i], r, col=1+i)Is two components the "best" solution? There is no best answer to this question in general, but we can do things like consider model selection criteria to try to decide. By default, the Mclust uses BIC to search for a best model from 1 to 9 components:

mc <- Mclust(xy) mc$GSo let's see what the three-component solution looks like:

plot(xy[,1],xy[,2], pch=20, xlab=expression(M[B]), ylab=expression("(280-B)"[rest]), main="z between 0.7 and 0.9", cex.lab=1.5) for(i in 1:mc$G) ellipse(mc$mu[,i], mc$sigma[,,i], r, col=1+i)

quas <- read.table("http://astrostatistics.psu.edu/datasets/SDSS_quasar.dat",header=T) dim(quas)We want to get rid of the missing values. However, in this case missing values create more havoc than usual due to the fact that we will be working with covariance matrices. Thus, we will eliminate all rows with missing values:

quas[quas==0 | quas==-1 | quas==-9] <- NA quas <- na.omit(quas) dim(quas)This leaves us with a much smaller dataset, but for purposes of illustration it will serve well. Once the principal component loadings are determined, we can then apply these loadings, or a simplified version thereof, to the

In principal components analysis, we wish to reduce the number of variables. The method is to find the "best" linear combinations of all existing variables. To understand what is "best" in this context, consider the 22 quantitative measurement columns in the quas dataset (the first column is the SDSS designation of the object). Each row may be considered to be a point in 22-dimensional Euclidean space. Thus, the entire dataset consists of a cloud of 279 22-dimensional points. The "best" linear combination here will be the single vector in 22-space parallel to which the variance of these 279 points is the greatest. The second-best will be the single vector orthogonal to the first along which the variance is the greatest, and so on.

We will implement principal components in R using two distinct approaches. One approach is to use the princomp function. Another is to obtain the same results from scratch using an eigenvalue decomposition. We will use the former approach for analysis and interpretation; the latter approach is presented only to help you understand how the method works mathematically.

To create a single object containing all the principal components information you will need, type

pc <- princomp(quas[,-1])Note that we omit the first column from the analysis since it is not a quantitative measurement.

Let's see what kind of information is carried in pc.

names(pc) ?princompBefore explaining what each of these things means, let's briefly show how to obtain the important bits, namely pc$sdev and pc$loadings, from scratch using an eigenvalue/eigenvector decomposition of the sample covariance matrix. The square roots of the eigenvalues give pc$sdev and the matrix of normalized eigenvectors gives pc$loadings. (Note, however, that a normalized eigenvector is still a normalized eigenvector if multiplied by -1; therefore, some of the columns of the eigenvector matrix differ from the corresponding columns of pc$loadings by a sign change.)

In other words, it is possible to reconstruct all of the information in pc by using

s <- cov(quas[,-1]) es <- eigen(s)One may compare sqrt(es$val) with pc$sdev and es$vec with pc$load to verify that they are the same except for sign changes in some columns of pc$load.

If one invokes the princomp command with cor=TRUE, then the eigen decomposition is performed on the correlation matrix, obtained via cor(quas[,-1]), rather than the covariance matrix. Which method is more appropriate in this case? To answer this question, let's examine the standard deviations of the columns of quas:

apply(quas[,-1],2,sd)Note that the variation of the R.A and Dec. columns is far larger than that of any other column. Thus, we should not be surprised if these two columns dominate the first two principal components. In fact, since these two columns together with z give position of the object, we might want to extract them from the principal components analysis altogether, retaining them unchanged in the reduced dataset. However, we could essentially put all variables on an equal footing in terms of variability by using the correlation rather than the covariance (this is equivalent to standardizing each of the variables to have standard deviation equal to 1 before performing the princomp analysis). In the following development, we use essentially the first approach.

The two most important pieces of information in a principal components analysis are the variances explained (eigenvalues) and variable loadings (eigenvectors). The former may be viewed graphically using a technique called a screeplot:

screeplot(pc)In the above plot, we see that the variance of the first two PCs dwarfs the others. To see what this means, we must look at the loadings for these first two PCs:

loadings(pc)This last command prints a lot of information. Scroll up to see the loadings of components 1 and 2, with any loadings less than 0.1 in absolute value suppressed as unimportant. (In reality, the loadings for the first principal component are a vector of real numbers, scaled so that the sum of their squares equals 1. Each element in the vector gives the relative weight of the corresponding variable.)

To see

pc$load[,1]We may conclude from the output above that the first principal component consists essentially of nothing other than R.A (recall that the standard deviation of R.A was much larger than that of the other variables, so this result is really not surprising).

It is also unsurprising to see that the second principal component consists almost entirely of the Declination:

pc$load[,2]These two principal components together comprise over 99.8% of the total variance of these variables, which makes it difficult to see easily the effect of the remaining principal components. As explained earlier, one way to deal with the problem of variables on vastly different scales is by analyzing the correlation matrix rather than the covariance matrix. However, in this case, the two variables causing the trouble are easy to identify; thus, we'll proceed by performing a new principal components analysis on the remaining columns of quas after R.A and Dec. are removed:

pc2 <- princomp(quas[,-(1:3)]) screeplot(pc2)In the new screeplot, we see three or four PCs with relatively large variance, one to four PCs with moderate variance, and the rest with relatively small variance. Let's see what the variable loadings for these first five PCs are:

loadings(pc2)Again, it is necessary to scroll up to see the important output.

Examining these loadings, the first PC is somewhat difficult to interpret, but the second is basically an average of all the "_mag" variables. Notably, the three variables (u_mag, g_mag, r_mag) always occur with roughly the same weights in the first few PCs, indicating that we may replace these three with a single variable equal to their mean. The same is true of (i_mag, z_mag) and (J_mag, H_mag, K_mag). We could thus reduce these 8 variables to 3.

Another approach is to analyze only the principal component scores themselves, which are contained in pc$scores. This 279x22 matrix contains exactly the same information as the original dataset, but the axes have been rotated so that the first axis is the most important in explaining information, followed by the second, etc. Based on our analysis, only 5 or 6 of these PCs should be very variable.

pairs(pc2$scores[,1:6],pch=".")The drawback to the above plots, of course, is that many of them are difficult to interpret.

A biplot for a principal components analysis is a way of seeing both the PC scores and the factor loadings simultaneously.

biplot(pc2, choices=1:2)In summary, principal components provides an objective way to decide, based on data alone, how to reduce the dimensionality of a dataset to ease interpretability. However, substantive astronomical knowledge should be at least as important as such considerations (e.g., if M_i is known to be important, then maybe it should be kept regardless of what PC analysis says).

We will analyze the Shapley galaxy dataset, which may be downloaded by typing

shap <- read.table("http://www.astrostatistics.psu.edu/datasets/Shapley_galaxy.dat",head=T)Let us have a look:

dim(shap) names(shap) pairs(shap,pch=46)It looks like we have to deal with some missing Mag observations in column 3:

shap[shap[,3]==0,3] <- NA pairs(shap,pch=46)Next, let's make a rough cut using the V variable and color those points red:

attach(shap) a <- V>12000 & V<16000 pairs(shap,pch=46,col=1+a)We'd like to search for clusters in space. Let's plot R.A against Dec. and use different colors for different V values: black, red, green, and blue for V/1000 in (12,13), (13,14), (14,15), and (15,16), respectively.

plot(shap[a,1:2], pch=20, col=V[a]%/%1000 - 12)Now we begin to apply some clustering techniques. Most of these are contained in the cluster package, which must be loaded. We will consider the technique called agnes (agglomerative nesting) first.

library(cluster) ?agnesThere are two options of agnes that we care about: "stand" and "method". The first of these should be set to TRUE if we want agnes to standardize the variables before clustering. Let's decide whether this makes sense by checking the variability of each variable (first, we'll reduce the dataset to just those variables and quasars we want to use for the clustering):

shap2 <- shap[a,c(1,2,4)] apply(shap2,2,sd)We see that because of the units used, the V variable has much higher variance than the other two variables. Therefore, if we were to apply agnes using "stand=FALSE", we would essentially be clustering only on V, which would not be very interesting. One solution here is to convert the (R.A, Dec., V) points into (x, y, z) points in Euclidean space. Another, slightly rougher, solution is simply to use "stand=TRUE", which is what we'll do here:

ag <- agnes(shap2,stand=TRUE)Note that we have used the default value of "method", which is "average". Let's take a look at a dendrogram:

plot(ag,which=2)You can see the plotting options for an R object of class "agnes" by reading the help file for plot.agnes. You can also see what information is included in the ag object by looking at the help file for agnes.object.

The dendrogram is hard to read near the leaves because there are 1629 observations in this dataset. To obtain the order in which the original observations must be rearranged to obtain the order seen in the dendrogram, look at ag$order. Since we don't really care about retaining the original order of the galaxies, let's simply reorder them as in the dendrogram:

shap2 <- shap2[ag$order,]In ag$height, there are 1628 values, one for each of the merges. (Each merge reduces the number of clusters by one, so in reducing from 1629 observations to 1 cluster we must have 1628 merges.) We can use these values to determine where to make cuts in the dataset: Whenever ag$height[i] is high enough, we should make a cut between observations i and i+1.

Let's try making a cut at a height of 4:

pltree(ag) abline(col=2, h=4) which(ag$hei>4) abline(col=2, v=.5+ which(ag$hei>4))Although the four cuts identified by the which function result in 5 clusters, several of these clusters consist of only a few observations and could perhaps be lumped together with other clusters. On the other hand, using a height cutoff of 3.5 instead of 4 leads to four good-sized clusters:

which(ag$hei>3.5)Let's produce a categorical variable for the clusters:

agclust <- cut(1:1629,lab=1:4, breaks <- c(0,188,1278,1535,1629)) table(agclust)Now we may use this categorical variable to add color to a pairs plot:

pairs(shap2,pch=20, col=as.numeric(agclust))Note that the factor agclust must be explicitly coerced to a numeric vector using as.numeric.

Above, we used the method="average" option. Two other commonly used options are "single", which tends to produce stringy clusters because two clusters are considered as close as their closest elements; and "complete", which is the opposite of "single" in the sense that two clusters are considered as close as the most distant elements.

There are many clustering/partitioning algorithms, far more than we can present here. One way to see the many options in R is to look at the list of functions for the cluster package. There are also a couple of clustering algorithms in the standard R package, namely hierarchical clustering (hclust) and k-means clustering (kmeans).