Multivariate Techniques

This module concerns two general multivariate techniques, principal components and (various methods of) clustering.


Principal components

In this module, we return first to the SDSS quasar dataset.

   quas=read.table( "http://astrostatistics.psu.edu/datasets/SDSS_quasar.dat",
      header=T)
   dim(quas)

As before, we want to get rid of the missing values. However, in this case missing values create more havoc than usual due to the fact that we will be working with covariance matrices. Thus, we will eliminate all rows with missing values:

   quas[quas==0 | quas==-1 | quas==-9]=NA
   quas=quas[!apply(quas,1,
      function(a) any(is.na(a))),]
   dim(quas)

This leaves us with a much smaller dataset, but for purposes of illustration it will serve well. Once the principal component loadings are determined, we can then apply these loadings, or a simplified version thereof, to the whole dataset.

In the exploratory data analysis and regression module, we reduced the number of variables by simply getting rid of some of the columns of quas. In principal components analysis, the goal is similar: We wish to reduce the number of variables. However, the method is different: We will find the "best" linear combinations of all existing variables.

To understand what is "best" in this context, consider the 22 quantitative measurement columns in the quas dataset (the first column is the SDSS designation of the object). Each row may be considered to be a point in 22-dimensional Euclidean space. Thus, the entire dataset consists of a cloud of 279 22-dimensional points. The "best" linear combination here will be the single vector in 22-space parallel to which the variance of these 279 points is the greatest. The second-best will be the single vector orthogonal to the first along which the variance is the greatest, and so on.

We will implement principal components in R using two distinct approaches. One approach is to use the princomp function. Another is to obtain the same results from scratch using an eigenvalue decomposition. We will use the former approach for analysis and interpretation; the latter approach is presented only to help you understand how the method works mathematically.

To create a single object containing all the principal components information you will need, type

   pc=princomp(quas[,-1])

Note that we omit the first column from the analysis since it is not a quantitative measurement.

Let's see what kind of information is carried in pc.

   names(pc)
   ?princomp

Before explaining what each of these things means, let's briefly show how to obtain the important bits, namely pc$sdev and pc$loadings, from scratch using an eigenvalue/eigenvector decomposition of the sample covariance matrix. The square roots of the eigenvalues give pc$sdev and the matrix of normalized eigenvectors gives pc$loadings. (Note, however, that a normalized eigenvector is still a normalized eigenvector if multiplied by -1; therefore, some of the columns of the eigenvector matrix differ from the corresponding columns of pc$loadings by a sign change.)

In other words, it is possible to reconstruct all of the information in pc by using

   s=cov(quas[,-1])
   es=eigen(s)

One may compare sqrt(es$val) with pc$sdev and es$vec with pc$load to verify that they are the same except for sign changes in some columns of pc$load.

If one invokes the princomp command with cor=TRUE, then the eigen decomposition is performed on the correlation matrix, obtained via cor(quas[,-1]), rather than the covariance matrix. Which method is more appropriate in this case? To answer this question, let's examine the standard deviations of the columns of quas:

   sqrt(apply(quas[,-1],2,var))

Note that the variation of the R.A and Dec. columns is far larger than that of any other column. Thus, we should not be surprised if these two columns dominate the first two principal components. In fact, since these two columns together with z give position of the object, we might want to extract them from the principal components analysis altogether, retaining them unchanged in the reduced data set. However, we could essentially put all variables on an equal footing in terms of variability by using the correlation rather than the covariance (this is equivalent to standardizing each of the variables to have standard deviation equal to 1 before performing the princomp analysis). In the following development, we use essentially the first approach.

The two most important pieces of information in a principal components analysis are the variances explained (eigenvalues) and variable loadings (eigenvectors). The former may be viewed graphically using a technique called a screeplot:

   screeplot(pc)

In the above plot, we see that the variance of the first two PCs dwarfs the others. To see what this means, we must look at the loadings for these first two PCs:

   loadings(pc)

This last command prints a lot of information. Scroll up to see the loadings of components 1 and 2, with any loadings less than 0.1 in absolute value suppressed as unimportant. (In reality, the loadings for the first principal component are a vector of real numbers, scaled so that the sum of their squares equals 1. Each element in the vector gives the relative weight of the corresponding variable.)

To see all of the loadings for the first principal component (and only those), type

   pc$load[,1]

We may conclude from the output above that the first principal component consists essentially of nothing other than R.A (recall that the standard deviation of R.A was much larger than that of the other variables, so this result is really not surprising).

It is also unsurprising to see that the second principal component consists almost entirely of the Declination:

   pc$load[,2]

These two principal components together comprise over 99.8% of the total variance of these variables, which makes it difficult to see easily the effect of the remaining principal components. As explained earlier, one way to deal with the problem of variables on vastly different scales is by analyzing the correlation matrix rather than the covariance matrix. However, in this case, the two variables causing the trouble are easy to identify; thus, we'll proceed by performing a new principal components analysis on the remaining columns of quas after R.A and Dec. are removed:

   pc2=princomp(quas[,-(1:3)])
   screeplot(pc2)

In the new screeplot, we see three or four PCs with relatively large variance, one to four PCs with moderate variance, and the rest with relatively small variance. Let's see what the variable loadings for these first five PCs are:

   loadings(pc2)

Again, it is necessary to scroll up to see the important output.

Examining these loadings, the first PC is somewhat difficult to interpret, but the second is basically an average of all the "_mag" variables. Notably, the three variables (u_mag, g_mag, r_mag) always occur with roughly the same weights in the first few PCs, indicating that we may replace these three with a single variable equal to their mean. The same is true of (i_mag, z_mag) and (J_mag, H_mag, K_mag). We could thus reduce these 8 variables to 3.

Another approach is to analyze only the principal component scores themselves, which are contained in pc$scores. This 279x22 matrix contains exactly the same information as the original dataset, but the axes have been rotated so that the first axis is the most important in explaining information, followed by the second, etc. Based on our analysis, only 5 or 6 of these PCs should be very variable.

   pairs(pc$scores[,1:6],pch=".")

The drawback to the above plots, of course, is that many of them are difficult to interpret.

In summary, principal components provides an objective way to decide, based on data alone, how to reduce the dimensionality of a dataset to ease interpretability. However, substantive astronomical knowledge should be at least as important as such considerations (e.g., if M_i is known to be important, then maybe it should be kept regardless of what PC analysis says).


Clustering via agglomerative nesting (agnes)

We turn now to a few of the many methods of clustering. The goal of a clustering algorithm is to identify structure within a multivariate cloud of points by assigning each point to one of a small number of groups (some clustering algorithms don't provide specific assignments for each point but instead tell how likely each point is to belong to each group).

We will analyze the Shapley galaxy dataset, which may be downloaded by typing

   shap=read.table(
      "http://www.astrostatistics.psu.edu/datasets/Shapley_galaxy.dat",
      head=T)

Let's have a look:

   dim(shap)
   names(shap)
   pairs(shap,pch=46)

It looks like we have to deal with some missing Mag observations in column 3:

   shap[shap[,3]==0,3]=NA
   pairs(shap,pch=46)

Next, let's make a rough cut using the V variable and color those points red:

   attach(shap)
   a=V>12000 & V<16000
   pairs(shap,pch=46,col=1+a)

We'd like to search for clusters in space. Let's plot R.A against Dec and use different colors for different V values: black, red, green, and blue for V/1000 in (12,13), (13,14), (14,15), and (15,16), respectively.

   plot(shap[a,1:2], pch=20,
      col=V[a]%/%1000 - 12)

Now we begin to apply some clustering techniques. Most of these are contained in the cluster package, which must be loaded. We will consider the technique called agnes (agglomerative nesting) first.

   library(cluster)
   ?agnes

There are two options of agnes that we care about: "stand" and "method". The first of these should be set to TRUE if we want agnes to standardize the variables before clustering. Let's decide whether this makes sense by checking the variability of each variable (first, we'll reduce the dataset to just those variables and quasars we want to use for the clustering):

   shap2=shap[a,c(1,2,4)]
   sqrt(apply(shap2,2,var))

We see that because of the units used, the V variable has much higher variance than the other two variables. Therefore, if we were to apply agnes using "stand=FALSE", we would essentially be clustering only on V, which would not be very interesting. One solution here is to convert the (R.A, Dec., V) points into (x, y, z) points in Euclidean space. Another, slightly rougher, solution is simply to use "stand=TRUE", which is what we'll do here:

   ag=agnes(shap2,stand=TRUE)

Note that we have used the default value of "method", which is "average". Let's take a look at a dendrogram:

   plot(ag,which=2)

You can see the plotting options for an R object of class "agnes" by reading the help file for plot.agnes. You can also see what information is included in the ag object by looking at the help file for agnes.object.

The dendrogram is hard to read near the leaves because there are 1629 observations in this dataset. To obtain the order in which the original observations must be rearranged to obtain the order seen in the dendrogram, look at ag$order. Since we don't really care about retaining the original order of the galaxies, let's simply reorder them as in the dendrogram:

   shap2=shap2[ag$order,]

In ag$height, there are 1628 values, one for each of the merges. (Each merge reduces the number of clusters by one, so in reducing from 1629 observations to 1 cluster we must have 1628 merges.) We can use these values to determine where to make cuts in the dataset: Whenever ag$height[i] is high enough, we should make a cut between observations i and i+1.

Let's try making a cut at a height of 4:

   pltree(ag)
   abline(col=2, h=4)
   which(ag$hei>4)
   abline(col=2, v=.5+
      which(ag$hei>4))

Although the four cuts identified by the which function result in 5 clusters, several of these clusters consist of only a few observations and could perhaps be lumped together with other clusters. On the other hand, using a height cutoff of 3.5 instead of 4 leads to four good-sized clusters:

   which(ag$hei>3.5)

Let's produce a categorical variable for the clusters:

   agclust=cut(1:1629,lab=1:4,
      breaks=c(0,188,1278,1535,1629))
   table(agclust)

Now we may use this categorical variable to add color to a pairs plot:

   pairs(shap2,pch=20,
      col=as.numeric(agclust))

Note that the factor agclust must be explicitly coerced to a numeric vector using as.numeric.

Above, we used the method="average" option. Two other commonly used options are "single", which tends to produce stringy clusters because two clusters are considered as close as their closest elements; and "complete", which is the opposite of "single" in the sense that two clusters are considered as close as the most distant elements.

There are many clustering/partitioning algorithms, far more than we can present here. One way to see the many options in R is to look at the list of functions for the cluster package. There are also a couple of clustering algorithms in the standard R package, namely hierarchical clustering (hclust) and k-means clustering (kmeans).