This module concerns two general multivariate techniques, principal
components and (various methods of) clustering.
In this module, we return first to the
SDSS quasar dataset.
As before, we want to get rid of the missing values. However, in this case missing
values create more havoc than usual due to the fact that we will be working with
covariance matrices. Thus, we will eliminate all rows with missing values:
quas[quas==0 | quas==-1 | quas==-9]=NA
This leaves us with a much smaller dataset, but for purposes of illustration it will serve
well. Once the principal component loadings are determined, we can then apply these loadings,
or a simplified version thereof,
to the whole dataset.
exploratory data analysis and regression
module, we reduced the number of variables by simply getting rid
of some of the columns of quas. In principal components analysis,
the goal is similar: We wish to reduce the number of variables. However,
the method is different: We will find the "best" linear combinations of
all existing variables.
To understand what is "best" in this context,
consider the 22 quantitative measurement columns in the
quas dataset (the first column is the SDSS designation of the object). Each
row may be considered to be a point in 22-dimensional Euclidean space. Thus,
the entire dataset consists of a cloud of 279 22-dimensional points. The
"best" linear combination here will be the single vector in 22-space parallel to
which the variance of these 279 points is the greatest. The second-best
will be the single vector orthogonal to the first along which the variance is
the greatest, and so on.
We will implement principal components in R using two distinct approaches.
One approach is to use the
function. Another is to obtain the same results from scratch using an eigenvalue
We will use the former approach for analysis and interpretation; the latter
approach is presented only to help you understand how the method works
To create a single object containing all the principal components
information you will need, type
Note that we omit the first column from the analysis since it is not a quantitative
Let's see what kind of information is carried in pc.
Before explaining what each of these things means, let's briefly show
how to obtain the important bits, namely pc$sdev and pc$loadings, from scratch
using an eigenvalue/eigenvector decomposition of the sample covariance
matrix. The square roots of the eigenvalues give pc$sdev and the matrix
of normalized eigenvectors gives pc$loadings. (Note, however, that a normalized
eigenvector is still a normalized eigenvector if multiplied by -1; therefore,
some of the columns of the eigenvector matrix differ from the corresponding
columns of pc$loadings by a sign change.)
In other words, it is possible to reconstruct all of the information in
pc by using
One may compare sqrt(es$val) with pc$sdev and es$vec with pc$load to verify
that they are the same except for sign changes in some columns of pc$load.
If one invokes the princomp command with cor=TRUE, then the
eigen decomposition is performed on the correlation matrix, obtained via
cor(quas[,-1]), rather than the covariance matrix. Which method is more
appropriate in this case? To answer this question, let's examine
the standard deviations of the columns of quas:
Note that the variation of the R.A and Dec. columns is far larger than that of any other
column. Thus, we should not be surprised if these two columns dominate the first two principal
components. In fact, since these two columns together with z give position of the object, we
might want to extract them from the principal components analysis altogether, retaining them
unchanged in the reduced data set. However, we could essentially put all variables on
an equal footing in terms of variability by using the correlation rather than the covariance
(this is equivalent to standardizing each of the variables to have standard deviation equal
to 1 before performing the princomp analysis). In the following development, we use essentially
the first approach.
The two most important pieces of information in a principal components analysis are the
variances explained (eigenvalues) and variable loadings (eigenvectors). The former may
be viewed graphically using a technique called a
In the above plot, we see that the variance of the first two PCs dwarfs the others. To
see what this means, we must look at the loadings for these first two PCs:
This last command prints a lot of information. Scroll up to see the loadings of components 1
and 2, with any loadings less than 0.1 in absolute value suppressed as unimportant. (In reality,
the loadings for the first principal component are a vector of real numbers, scaled so that
the sum of their squares equals 1. Each element in the vector gives the relative weight of
the corresponding variable.)
To see all of the loadings for the first principal component (and only those), type
We may conclude from the output above that the first principal component consists
essentially of nothing other than R.A (recall that the standard deviation of R.A was much larger than
that of the other variables, so this result is really not surprising).
It is also unsurprising to see that the second principal component consists almost entirely of the
These two principal components together comprise over 99.8% of the total variance of these variables, which
makes it difficult to see easily the effect of the remaining principal components. As explained earlier,
one way to deal with the problem of variables on vastly different scales is by analyzing the correlation
matrix rather than the covariance matrix. However, in this case, the two variables causing the trouble
are easy to identify; thus, we'll proceed by performing a new principal components analysis on the remaining
columns of quas after R.A and Dec. are removed:
In the new screeplot, we see three or
four PCs with relatively large variance, one to four PCs with moderate variance, and the
rest with relatively small variance. Let's see what the variable loadings for these first five PCs
Again, it is necessary to scroll up to see the important output.
Examining these loadings, the first PC is somewhat difficult to interpret, but the second is
basically an average of all the "_mag" variables. Notably, the three variables (u_mag, g_mag, r_mag)
always occur with roughly the same weights in the first few PCs, indicating that we may replace these
three with a single variable equal to their mean. The same is true of (i_mag, z_mag) and
(J_mag, H_mag, K_mag). We could thus reduce these 8 variables to 3.
Another approach is to analyze only the principal component
scores themselves, which are contained in pc$scores. This 279x22 matrix contains exactly the
same information as the original dataset, but the axes have been rotated so that the first axis is
the most important in explaining information, followed by the second, etc. Based on our analysis,
only 5 or 6 of these PCs should be very variable.
The drawback to the above plots, of course, is that many of them are difficult to interpret.
In summary, principal components provides an objective way to decide, based on data alone,
how to reduce the dimensionality of a dataset to ease interpretability. However,
substantive astronomical knowledge should be at least as important as
such considerations (e.g., if M_i is known to be important, then maybe it should be kept
regardless of what PC analysis says).
Clustering via agglomerative nesting (agnes)
We turn now to a few of the many methods of clustering. The goal of a clustering algorithm
is to identify structure within a multivariate cloud of points by assigning each point to one
of a small number of groups (some clustering algorithms don't provide specific assignments
for each point but instead tell how likely each point is to belong to each group).
We will analyze the
Shapley galaxy dataset, which may be downloaded by typing
Let's have a look:
It looks like we have to deal with some missing Mag observations in column 3:
Next, let's make a rough cut using the V variable and color those points red:
a=V>12000 & V<16000
We'd like to search for clusters in space. Let's plot R.A against Dec and use different
colors for different V values:
black, red, green, and blue for V/1000 in (12,13), (13,14), (14,15), and (15,16), respectively.
col=V[a]%/%1000 - 12)
Now we begin to apply some clustering techniques. Most of these are contained in the
cluster package, which must be loaded. We will consider the technique called
agnes (agglomerative nesting) first.
There are two options of
that we care about: "stand" and "method". The first of these should be
set to TRUE if we want
to standardize the variables before clustering. Let's decide whether this makes
sense by checking the variability of each variable (first, we'll reduce the dataset to
just those variables and quasars we want to use for the clustering):
We see that because of the units used, the V variable has much higher variance
than the other two variables. Therefore, if we were to apply
agnes using "stand=FALSE", we would
essentially be clustering only on V, which would not be very interesting.
One solution here is to convert the (R.A, Dec., V) points into (x, y, z) points
in Euclidean space. Another, slightly rougher, solution is simply to use "stand=TRUE",
which is what we'll do here:
Note that we have used the default value of "method", which is "average".
Let's take a look at a dendrogram:
You can see the plotting options for an R object of class "agnes" by
reading the help file for
You can also see what information is included in the ag object by
looking at the help file for
The dendrogram is hard to read near the leaves because there are 1629
observations in this dataset. To obtain the order in which the original
observations must be rearranged to obtain the order seen in the dendrogram,
look at ag$order. Since we don't really care about retaining the original order
of the galaxies, let's simply reorder them as in the dendrogram:
In ag$height, there are 1628 values, one for each of the merges. (Each merge reduces
the number of clusters by one, so in reducing from 1629 observations to 1 cluster we
must have 1628 merges.) We can use these values to determine where to make cuts in the
dataset: Whenever ag$height[i] is high enough, we should make a cut between observations
i and i+1.
Let's try making a cut at a height of 4:
Although the four cuts identified by the
function result in 5 clusters, several of these clusters consist of only a few observations and
could perhaps be lumped together with other clusters. On the other hand, using a height cutoff
of 3.5 instead of 4 leads to four good-sized clusters:
Let's produce a categorical variable for the clusters:
Now we may use this categorical variable to add color to a pairs plot:
Note that the
agclust must be explicitly coerced to a numeric vector using
Above, we used the method="average" option. Two other commonly used options
are "single", which tends to produce stringy clusters because two clusters
are considered as close as their closest elements; and "complete", which is the opposite
of "single" in the sense that two clusters are considered as close as the most distant elements.
There are many clustering/partitioning algorithms, far more than we can present here.
One way to see the many options in R is to look at the list of functions for the
cluster package. There are also a couple of
clustering algorithms in the standard R package, namely hierarchical clustering
and k-means clustering