Exploratory Data Analysis and Regression

This module demonstrates some of the capabilities of R for exploring univariate properties of quantitative variables and relationships among two or more such variables.

Univariate explorations

We begin by examining the Hipparcos dataset found at http://astrostatistics.psu.edu/datasets/HIP_star.html. Type

Recall the variable names in this dataset using the names function. By using attach, we can automatically create temporary variables with these names (these variables are not saved as part of the R session, and they are superseded by any other R objects of the same names).

names(hip)
attach(hip)

After using the attach command, we can obtain, say, individual summaries of the variables:

summary(Vmag)
summary(B.V)

We can also use summary on the entire 'hip' matrix.

summary(hip)

Next, summarize some of this information graphically using a box-and-whisker plot showing median, quartile and outliers for the four variables Vmag, pmRA, pmDE, and B.V (the last variable used to be B-V, or B minus V, but R does not allow certain characters). These are the 2nd, 6th, 7th, and 9th columns of 'hip'.

boxplot(hip[,c(2,6,7,9)])

Our first attempt looks bad due to different scales of the variables, so we construct an array of four single-variable plots:

par(mfrow=c(2,2))
for(i in c(2,6,7,9))
boxplot(hip[,i],main=names(hip)[i])
par(mfrow=c(1,1))

The boxplot command does more than produce plots; it also returns output that can be more closely examined. Below, we use the same command as earlier, but we produce the plot and save the output.

b=boxplot(hip[,c(2,6,7,9)])
names(b)

'b' is an object called a list. To understand its contents, read the help for boxplot. Suppose we wish to see all of the outliers in the pmRA variable, which is the second of the four variables in the current boxplot:

b\$names[2]
b\$out[b\$group==2]

Next, we'll make a more elaborate boxplot. Suppose we wish to examine the values of Vmag, with objects broken into categories according to the B.V variable:

boxplot(Vmag~cut(B.V,breaks=(-1:6)/2),
notch=T, varwidth=T, las=1, tcl=.5,
xlab=expression("B minus V"),
ylab=expression("V magnitude"),
main="Can you find the red giants?",
cex=1, cex.lab=1.4, cex.axis=.8, cex.main=1)
axis(2, labels=F, at=0:12, tcl=-.25)
axis(4, at=3*(0:4))

The notches in the boxes, produced using "notch=T", can be used to test for differences in the medians (see boxplot.stats for details). With "varwidth=T", the box widths are proportional to the square roots of the sample sizes. The "cex" options all give scaling factors, relative to default: "cex" is for plotting text and symbols, "cex.axis" is for axis annotation, "cex.lab" is for the x and y labels, and "cex.main" is for main titles. The two axis commands are used to add an axis to the current plot. The first such command above adds smaller tick marks at all integers, whereas the second one adds the axis on the right.

Bivariate plots

The boxplot in the plot above is telling us something about the bivariate relationship between the two variables. Yet it is probably easier to grasp this relationship by producing a scatter plot.

plot(Vmag,B.V)

The above plot is a bit busy because of the default plotting character, set let's use a different one:

plot(Vmag,B.V,pch=".")

Let's now produce the same graph but selecting just for Hyades stars. This open cluster should be concentrated both in the sky coordinates RA and DE, and also in the proper motion variables pm_RA and pm_DE. We start by noticing a concentration of stars in the RA distribution:

plot(Vmag,RA,pch=".")
x1= (RA>50 & RA<100)

Next, we select in the proper motions. (As our cuts through the data are parallel to the axes, this variable-by-variable classification approach is sometimes called Classification and Regression Trees or CART, a very common multivariate classification procedure.)

plot(pmRA[x1],pmDE[x1],pch=20)
plot(pmRA[x1],pmDE[x1],pch=20,
xlim=c(0,200),ylim=c(-200,100))
x2=(pmRA>90 & pmRA<130)
x3=(pmDE>-60 & pmDE< -10) # Space in '< -' is necessary!
x4=x1&x2&x3

After our selection, we replot the HR diagram of Vmag vs. B.V. This shows the Zero Age Main Sequence, plus four red giants, with great precision. Outliers in the original Hipparcos dataset have been effectively removed. However, let's have a final look at the stars we have identified using the pairs command to produce all bivariate plots for pairs of variables. We'll exclude the first and fifth columns (the HIP identifying number and the parallax, which is already known to lie in a narrow band).

pairs(hip[x4,-c(1,5)],pch=20)

Note that indexing a matrix or vector using negative integers has the effect of excluding the corresponding entries.

We see that there is one outlying star in the DE variable, indicating that this star is not actually one of the Hyades, and one outlying star in the e_Plx variable, indicating that its measurements are not reliable. We exclude both points:

x5=x4 & (DE>0) & (e_Plx<5)
pairs(hip[x5,-c(1,5)],pch=20)

Note that the x5 variable, a vector of TRUE and FALSE, may be summed to reveal the number of TRUE's.

sum(x5)

As a final look at these data, let's consider the original plot of Vmag versus B.V but make the 92 stars we just identified look bigger (pch=20 instead of 46) and color them red (col=2 instead of 1):

plot(Vmag,B.V,pch=c(46,20)[1+x5],
col=1+x5)

Linear least-squares regression

Consider the relationship between DE and pmDE among the 92 stars identified above:

plot(DE[x5],pmDE[x5],pch=20)

If we wish to fit a linear regression (least squares) line to this plot, we may do so using the lm (linear model) function. Note the "response ~ predictor(s)" format used in formulas for functions like lm .

m=lm(pmDE[x5] ~ DE[x5])
abline(m,lwd=2,col=2)

The m object just created is an object of class lm. The class of an object in R can help to determine how it is treated by functions such as print and summary.

m # same as print(m)
summary(m)

There is a lot of information contained in m that is not displayed by print or summary:

names(m)
plot(m\$fit,m\$resid)

The residual plot produced above reveals no irregularities. Note that when referring to the items in a list by name, as in m\$fit or m\$resid above, it is only necessary to type enough of the name to uniquely identify it.

We now look at a different dataset, the SDSS quasar dataset. Download this dataset from http://www.astrostatistics.psu.edu/datasets/SDSS_quasar.html.

dim(quas)
names(quas)

We won't need all of these columns. Let's keep only a subset.

quas=quas[,c("z","r_mag","i_mag",
names(quas)

Creating all bivariate plots using pairs would take a bit of time with this size dataset; let's do some exploration using a subset of 2000 rows, which we may obtain using the sample function:

s=sample(nrow(quas),2000)
pairs(quas[s,-c(1,2)],pch=20)

There appear to be some strange outliers in columns like Radio and X.ray. It seems that values of 0, -1, and -9 are intended to signify missing data. Let's remove all 0's and -1's and -9's by changing them to NA, then use attach to create new temporary variables with the same names as the columns and redo the bivariate scatterplots. Remember, any R objects with the same names as one of the columns will override these temporary variables.

quas[quas==0 | quas==-1 | quas==-9]=NA
attach(quas)
pairs(quas[s,],pch=20)

Several of these plots look interesting. Let's take a look at the relationship between z and M_i (using all 46420 points):

plot(z,M_i,pch=".")
abline(lm(M_i~z),col=2)

Clearly a straight line does not describe this relationship well. How about a quadratic relationship? Note: fitting a quadratic curve to data is still linear regression. This is because we are modelling the response (M_i) using a linear combination of two variables (z and z2).

z2=z^2
m2=lm(M_i~z+z2)
x=seq(min(z),max(z),len=200)
y=predict(m2,data.frame(z=x,z2=x^2))
lines(x,y,lwd=2,col=3)

Note that we used the seq function to create a sequence of x-axis (predictor) values to plot. We also used the predict function to obtain the y-axis values predicted by model m2 for the point in the sequence x. Finally, we used the lines function to add line segments to the current plot (in order, in connect-the-dots fashion).

Other types of regression

Let's try a non-linear fit, given by loess or lowess:

m3=lowess(z,M_i)
lines(m3,col=4,lwd=2)

Let's consider another bivariate relationship:

plot(X.ray,i_mag,pch=46)
abline(lm(i_mag~X.ray),col=2)

It is possible to apply weighted least squares. For instance, if the variances of the response variables are known, a typical weight is the reciprocal of variance. in this example, the difference is not large:

m4=lm(i_mag~X.ray,weights=1/sig_i^2)
abline(m4,col=3)

We can also try resistant regression, using the lqs function in the MASS package.

library(MASS)
m5=lqs(i_mag~X.ray)
abline(m5,col=4)

Finally, let's use least absolute deviation regression. To do this, we'll use a function called rq (regression quantiles) in the "quantreg" package. This package is not part of the standard distribution of R, so we'll need to download it. In order to do this, we must tell R where to store the installed library using the install.packages function.

install.packages("quantreg",lib="v:/",
CRAN="http://lib.stat.cmu.edu/R/CRAN")
library(quantreg, lib.loc="v:/")

Assuming the quantreg package is loaded, we may now compare the least-squares fit (red) with the least absolute deviations fit (green). In this example, the two fits are nearly identical:

m6=lm(i_mag~X.ray)
m7=rq(i_mag~X.ray)
names(m7)
plot(X.ray,i_mag,pch=46)
abline(m6,col=2)
abline(m7,col=3)