This tutorial deals with a few multivariate techniques including clustering
components. We begin with a short introduction to generating multivariate
normal random vectors.
Multivariate normal distributions
We'll start off by generating some multivariate normal random vectors.
There are packages that do this automatically, such as the mvtnorm package
available from CRAN, but it is easy and instructive to do from first principles.
Let's generate from a bivariate normal distribution in which the standard
deviations of the components are 2 and 3 where the correlation between
the components is -1/2. For simplicity, let the mean of the vectors be the
origin. We need to figure out what the covariance matrix looks like.
The diagonal elements of the covariance matrix are the marginal variances, namely
4 and 9. The off-diagonal element is the covariance, which equals the
correlation times the product of the marginal standard deviations, or -3:
We now seek to find a matrix M such that M times its transpose equals sigma.
There are many matrices that do this; one of them is the transpose of the
Cholesky square root:
M <- t(chol(sigma))
M %*% t(M)
We now recall that if Z is a random vector and M is a matrix, then the
covariance matrix of MZ equals M cov(Z) Mt. It is very easy to
simulate normal random vectors whose covariance matrix is the identity
matrix; this is accomplished whenever the vector components are independent
standard normals. Thus, we obtain a multivariate normal random vector
with covariance matrix sigma if we first generate a standard normal vector
and then multiply by the matrix M above. Let us create a dataset
with 200 such vectors:
Z <- matrix(rnorm(400),2,200)
# 2 rows, 200 columns
X <- t(M %*% Z)
The transpose above is taken so that X becomes a 200 x 2 matrix, since R
prefers to have the columns as the vector components rather than the rows.
Let us now plot the randomly generated normals and find the sample mean
Xbar <- apply(X,2,mean)
S <- cov(X)
We can compare the S matrix with the sigma matrix, but it is also nice to plot
an ellipse to see what shape these matrices correspond to. The car package,
which we used in the
EDA and regression tutorial, has the capability
to plot ellipses. You might not need to run the install.packages
function below since this package may already have been installed
in the previous tutorial. However, the library function is necessary.
To use the ellipse function in the car package, we need the center (mean),
shape (covariance), and the radius. The radius is the radius of a circle
that represents the "ellipse" for a standard bivariate normal distribution.
To understand how to provide a radius, it is helpful to know that if we sum
the squares of k independent standard normal random variables, the result
is (by definition) a chi-squared random variable on k degrees of freedom.
Thus, for a standard bivariate normal vector, the squares of the radii should
be determined by the quantiles of the chi-squared distribution on 2 degrees
of freedom. Let us then construct an ellipses with radius based on the median
of the chi-squared distribution. Thus, this ellipse should contain roughly
half of the points generated. We'll also produce a second ellipse, based
on the true mean and covariance matrix, for purposes of comparison.
sqrt(qchisq(.5,2)), col=3, lty=2)
Let's apply some of the bivariate normal results of the previous section
to looking for clusters in an astronomical dataset. The
COMBO-17 dataset provides brightness measurements on 3462 galaxies. Here,
we use a subset of this dataset to try to reproduce a figure that appears
COMBO17.html web page (not completely successfully, it turns out,
though I'm not quite sure why).
combo <- read.csv(
xy <- cbind(BjMAG, S280MAG-BjMAG)
xy <- xy[Mcz>.7 &
xy <- na.omit(xy)
main="z between 0.7 and 0.9",
Notice that the
command uses the
function. This is one way
to put mathematical notation to an R plot. See the help function for
In model-based clustering, the assumption is (usually) that the multivariate
sample is a random sample from a mixture of multivariate normal distributions.
A mixture in this case is a weighted sum of different normal
Think of it like this: Supppose there exist k multivariate normal distributions
select our sample, someone in a closed room first rolls a k-sided die (not
necessarily a fair die), then selects a random member of the subpopulation
indicated by the die. The only thing we get to observe is the final
observation; we do not know which number came up on the die or any of the
characteristics (parameters) of the normal subpopulations.
To take a simple example, suppose you are given a dataset consisting only
heights of a sample of individuals. You know that there are two subpopulations,
males and females, and that heights in each subpopulation are roughly normally
distributed. Is it possible, without knowing the sexes corresponding to the
measurements you are given, to estimate the means and standard deviations
for each sex, along with the proportion of males? The answer is yes.
Note here that model-based clustering using mixture models is not the same
thing as disciminant analysis, in which we are given not only observations but
also their known class memberships. The goal in discriminant analysis is to
build a rule for classifying future observations based on a training sample,
whereas clustering usually has broader goals than this: To discover the
classes in the first place.
There is a CRAN package that does model-based clustering assuming normal
distributions. It is called mclust:
Let's first fit a two-component normal mixture model (i.e., search for two
multivariate normal clusters).
mc2 <- Mclust(xy, minG=2, maxG=2)
Let's take a look at 50% ellipses of the clustering solution to see what the
solution "looks like".
r <- sqrt(qchisq(.5,2))
for(i in 1:mc2$G)
ellipse(mc2$mu[,i], mc2$sigma[,,i], r, col=1+i)
Is two components the "best" solution? There is no best answer to this question
in general, but we can do things like consider model selection criteria
to try to decide. By default, the Mclust uses BIC to search for a best
model from 1 to 9 components:
mc <- Mclust(xy)
So let's see what the three-component solution looks like:
main="z between 0.7 and 0.9",
for(i in 1:mc$G)
ellipse(mc$mu[,i], mc$sigma[,,i], r, col=1+i)
We now look at a different dataset, the SDSS quasar dataset. Download this dataset from
We want to get rid of the missing values. However, in this case missing
values create more havoc than usual due to the fact that we will be working with
covariance matrices. Thus, we will eliminate all rows with missing values:
quas[quas==0 | quas==-1 | quas==-9]=NA
quas <- na.omit(quas)
This leaves us with a much smaller dataset, but for purposes of illustration it will serve
well. Once the principal component loadings are determined, we can then apply these loadings,
or a simplified version thereof,
to the whole dataset.
In principal components analysis,
we wish to reduce the number of variables.
The method is to find the "best" linear combinations of
all existing variables.
To understand what is "best" in this context,
consider the 22 quantitative measurement columns in the
quas dataset (the first column is the SDSS designation of the object). Each
row may be considered to be a point in 22-dimensional Euclidean space. Thus,
the entire dataset consists of a cloud of 279 22-dimensional points. The
"best" linear combination here will be the single vector in 22-space parallel to
which the variance of these 279 points is the greatest. The second-best
will be the single vector orthogonal to the first along which the variance is
the greatest, and so on.
We will implement principal components in R using two distinct approaches.
One approach is to use the
function. Another is to obtain the same results from scratch using an eigenvalue
We will use the former approach for analysis and interpretation; the latter
approach is presented only to help you understand how the method works
To create a single object containing all the principal components
information you will need, type
Note that we omit the first column from the analysis since it is not a quantitative
Let's see what kind of information is carried in pc.
Before explaining what each of these things means, let's briefly show
how to obtain the important bits, namely pc$sdev and pc$loadings, from scratch
using an eigenvalue/eigenvector decomposition of the sample covariance
matrix. The square roots of the eigenvalues give pc$sdev and the matrix
of normalized eigenvectors gives pc$loadings. (Note, however, that a normalized
eigenvector is still a normalized eigenvector if multiplied by -1; therefore,
some of the columns of the eigenvector matrix differ from the corresponding
columns of pc$loadings by a sign change.)
In other words, it is possible to reconstruct all of the information in
pc by using
One may compare sqrt(es$val) with pc$sdev and es$vec with pc$load to verify
that they are the same except for sign changes in some columns of pc$load.
If one invokes the princomp command with cor=TRUE, then the
eigen decomposition is performed on the correlation matrix, obtained via
cor(quas[,-1]), rather than the covariance matrix. Which method is more
appropriate in this case? To answer this question, let's examine
the standard deviations of the columns of quas:
Note that the variation of the R.A and Dec. columns is far larger than that of any other
column. Thus, we should not be surprised if these two columns dominate the first two principal
components. In fact, since these two columns together with z give position of the object, we
might want to extract them from the principal components analysis altogether, retaining them
unchanged in the reduced data set. However, we could essentially put all variables on
an equal footing in terms of variability by using the correlation rather than the covariance
(this is equivalent to standardizing each of the variables to have standard deviation equal
to 1 before performing the princomp analysis). In the following development, we use essentially
the first approach.
The two most important pieces of information in a principal components analysis are the
variances explained (eigenvalues) and variable loadings (eigenvectors). The former may
be viewed graphically using a technique called a
In the above plot, we see that the variance of the first two PCs dwarfs the others. To
see what this means, we must look at the loadings for these first two PCs:
This last command prints a lot of information. Scroll up to see the loadings of components 1
and 2, with any loadings less than 0.1 in absolute value suppressed as unimportant. (In reality,
the loadings for the first principal component are a vector of real numbers, scaled so that
the sum of their squares equals 1. Each element in the vector gives the relative weight of
the corresponding variable.)
To see all of the loadings for the first principal component (and only those), type
We may conclude from the output above that the first principal component consists
essentially of nothing other than R.A (recall that the standard deviation of R.A was much larger than
that of the other variables, so this result is really not surprising).
It is also unsurprising to see that the second principal component consists almost entirely of the
These two principal components together comprise over 99.8% of the total variance
of these variables, which
makes it difficult to see easily the effect of the remaining principal components.
As explained earlier,
one way to deal with the problem of variables on vastly different scales is by
analyzing the correlation
matrix rather than the covariance matrix. However, in this case, the two
variables causing the trouble
are easy to identify; thus, we'll proceed by performing a new principal
components analysis on the remaining
columns of quas after R.A and Dec. are removed:
In the new screeplot, we see three or
four PCs with relatively large variance, one to four PCs with moderate variance, and the
rest with relatively small variance. Let's see what the variable loadings for these first five PCs
Again, it is necessary to scroll up to see the important output.
Examining these loadings, the first PC is somewhat difficult to interpret, but the second is
basically an average of all the "_mag" variables. Notably, the three variables (u_mag, g_mag, r_mag)
always occur with roughly the same weights in the first few PCs, indicating that we may replace these
three with a single variable equal to their mean. The same is true of (i_mag, z_mag) and
(J_mag, H_mag, K_mag). We could thus reduce these 8 variables to 3.
Another approach is to analyze only the principal component
scores themselves, which are contained in pc$scores. This 279x22 matrix contains exactly the
same information as the original dataset, but the axes have been rotated so that the first axis is
the most important in explaining information, followed by the second, etc. Based on our analysis,
only 5 or 6 of these PCs should be very variable.
The drawback to the above plots, of course, is that many of them are difficult to interpret.
A biplot for a principal components analysis is a way of seeing both the
PC scores and the factor loadings simultaneously.
In summary, principal components provides an objective way to decide, based on data alone,
how to reduce the dimensionality of a dataset to ease interpretability. However,
substantive astronomical knowledge should be at least as important as
such considerations (e.g., if M_i is known to be important, then maybe it should be kept
regardless of what PC analysis says).
Clustering via agglomerative nesting (agnes)
We turn now to a few of the many methods of clustering. The goal of a clustering algorithm
is to identify structure within a multivariate cloud of points by assigning each point to one
of a small number of groups (some clustering algorithms don't provide specific assignments
for each point but instead tell how likely each point is to belong to each group).
We will analyze the
Shapley galaxy dataset, which may be downloaded by typing
Let's have a look:
It looks like we have to deal with some missing Mag observations in column 3:
Next, let's make a rough cut using the V variable and color those points red:
a=V>12000 & V<16000
We'd like to search for clusters in space. Let's plot R.A against Dec and use different
colors for different V values:
black, red, green, and blue for V/1000 in (12,13), (13,14), (14,15), and (15,16), respectively.
col=V[a]%/%1000 - 12)
Now we begin to apply some clustering techniques. Most of these are contained in the
cluster package, which must be loaded. We will consider the technique called
agnes (agglomerative nesting) first.
There are two options of
that we care about: "stand" and "method". The first of these should be
set to TRUE if we want
to standardize the variables before clustering. Let's decide whether this makes
sense by checking the variability of each variable (first, we'll reduce the dataset to
just those variables and quasars we want to use for the clustering):
We see that because of the units used, the V variable has much higher variance
than the other two variables. Therefore, if we were to apply
agnes using "stand=FALSE", we would
essentially be clustering only on V, which would not be very interesting.
One solution here is to convert the (R.A, Dec., V) points into (x, y, z) points
in Euclidean space. Another, slightly rougher, solution is simply to use "stand=TRUE",
which is what we'll do here:
Note that we have used the default value of "method", which is "average".
Let's take a look at a dendrogram:
You can see the plotting options for an R object of class "agnes" by
reading the help file for
You can also see what information is included in the ag object by
looking at the help file for
The dendrogram is hard to read near the leaves because there are 1629
observations in this dataset. To obtain the order in which the original
observations must be rearranged to obtain the order seen in the dendrogram,
look at ag$order. Since we don't really care about retaining the original order
of the galaxies, let's simply reorder them as in the dendrogram:
In ag$height, there are 1628 values, one for each of the merges. (Each merge reduces
the number of clusters by one, so in reducing from 1629 observations to 1 cluster we
must have 1628 merges.) We can use these values to determine where to make cuts in the
dataset: Whenever ag$height[i] is high enough, we should make a cut between observations
i and i+1.
Let's try making a cut at a height of 4:
Although the four cuts identified by the
function result in 5 clusters, several of these clusters consist of only a few observations and
could perhaps be lumped together with other clusters. On the other hand, using a height cutoff
of 3.5 instead of 4 leads to four good-sized clusters:
Let's produce a categorical variable for the clusters:
Now we may use this categorical variable to add color to a pairs plot:
Note that the
agclust must be explicitly coerced to a numeric vector using
Above, we used the method="average" option. Two other commonly used options
are "single", which tends to produce stringy clusters because two clusters
are considered as close as their closest elements; and "complete", which is the opposite
of "single" in the sense that two clusters are considered as close as the most distant elements.
There are many clustering/partitioning algorithms, far more than we can present here.
One way to see the many options in R is to look at the list of functions for the
cluster package. There are also a couple of
clustering algorithms in the standard R package, namely hierarchical clustering
and k-means clustering