# Exploratory data analysis

### Introduction

In this tutorial we shall learn to perform simple statistical analysis and plotting of data with R. The parts involving astronomical information are based on the notes by Prof. David Hunter.

### Getting astronomical data

All the data sets used in these tutorials are already made available locally, in a format ready to be fed into R. However, it is instructive to know how to obtain astronomical data sets from the web as well as from raw formats like the Flexible Image Transport System (FITS) format.

#### Data from the web

This section has nothing to do with R. It is just an overview of online astronomical data resources. You may skip this section without any loss of continuity. Also the internet is an ever-changing place. So the URLs may have changed since the time of writing.

The astronomical community has a vast complex of online databases. Many databases are hosted by data centers such as

The Virtual Observatory (VO) is developing new flexible tools for accessing, mining and combining datasets at distributed locations; see the Web sites for the international, European, and U.S VO for information on recent developments.

The VO Web Services and Summer Schools provide helpful entries into these new capabilities.

We shall treat here only input of tabular data such as catalogs of astronomical sources. One of the multivariate tabular datasets used here is a dataset of stars observed with the European Space Agency's Hipparcos satellite during the 1990s. It gives a table with 9 columns and 2719 rows giving Hipparcos stars lying between 40 and 50 parsecs from the Sun. The dataset was acquired using CDS's Vizier Catalogue Service as follows:
1. In Web browser, go to http://vizier.u-strasbg.fr/viz-bin/VizieR?-source=I/239/hip_main
2. Set Max Entries to 9999, Output layout ASCII table.
3. Clear the ``Compute r'' and ``Compute Position'' checkboxes.
4. Set parallax constraint "20 .. 25" to gives stars between 40 and 50 pc.
5. Retrieve 9 properties: HIP, Vmag, RA(ICRS), DE(ICRS), Plx, pmRA, pmDE, e_Plx, and B-V.
6. Submit query.
7. Use ASCII editor (e.g., kwrite, emacs, notepad, but not MSWords or Openoffice) to trim header to one line with variable names.
8. Trim trailer.
9. Indicate missing values by NA (more on this later).
10. Save ASCII file on disk for ingestion into R.

#### Preparing a data file for R

Here we shall deal with only ASCII files (ones with simple English letters, digits and punctutations). As an example consider a file with three variables X, Y and Z.
```
This a sample data file.
X  Y  Z      remark
1  2  3      first case
-9     4      check!
3  1    23
34 23  12 This is a really
long comment.

```
Unfortunately, this file is not ready for ingestion into R. We shall learn to modify this step by step.

Step 1: Remove unnecessary lines (or commnt them out):
```
#This a sample data file.
X  Y  Z      remark
1  2  3      first case
-9     4      check!
3  1    23
34 23  12 This is a really
long comment.

```
The first non-comment line in the file should either give the names of the variables, or be the first line of the data set.

Step 2: Use quotes for strings with spaces in them. Also each case should be in its own line.
```
#This a sample data file.
X  Y  Z      remark
1  2  3      'first case'
-9     4      check!
3  1    23
34 23  12 'This is a really long comment.'

```

Step 3: Indicate missing values by `NA`s:
```
#This a sample data file.
X  Y  Z      remark
1  2  3      'first case'
-9  NA   4      check!
3  1    23    NA
34 23  12 'This is a really long comment.'

```
Now the file is ready to be read into R. We shall soon see how to read this file into R using the function `read.table`.

If this looks like too much work, you may use the comma-separated-values (csv) format like this:
```
#This a sample data file.
X,  Y,  Z,      remark
1,  2,  3,      first case
-9,   ,  4,      check!
3,  1,    23,
34, 23,  12, This is a really long comment.

```
The R function to read a csv file is `read.csv`.

#### From a FITS file

FITS is a standard format to store the image obtained from a telescope. While R does not have built-in facilities to extract data from a FITS file, there are various add-on packages for this purpose. One such is the `FITSio` package, which is available for free on the web. It allows both reading and writing FITS files. Here is a typical example (adapted from the online help of the `FITSio` package):
```library("FITSio") #This loads the package
#assuming that it is

Z <- matrix(1:15, ncol = 3) #Just create a matrix.

writeFITSim(Z, file = "test.fits") #Dump the matrix
#as a FITS image file.
X
```
You'll notice that `X` has much more information in it that `Z` ever had. This is because a FITS file has slots for many pieces of information, which R has filled in with default values. To understand the contents of `X` fully you need to know the FITS format.

### Reading the data into R

In this tutorial we shall mainly work with data sets in the form of astronomical catalogs (and not FITS files). Let us assume that the Hipparcos data set that we just talked about is in
F:\astro\HIP.dat
ready for R. We have already learned how to use the absolute path `F:\astro\HIP.dat` to load the data set into R. Now we shall learn a two step process that is usually easier. First navigate to the correct folder/directory
```setwd("F:/astro") #notice the forward slash
getwd() #just to make sure
```
The function `setwd` means "set working directory".

```hip = read.table("HIP.dat", header=T)
```
 The advantage of using `setwd` is that you have to type the name of the folder/directory only once. All files (data/script) in that folder can then be referred to by just the their names.
After the loading is complete we should make sure that things are as they should be. So we check the size of the data set, the variable names.
```dim(hip)
names(hip)
```
Let us take a look at the first 3 rows of the data set.
```hip[1:3,]
```

 Exercise:What command should you use to see the first 2 columns?

There is a variable called RA in the data set. It corresponds to column 3 in the data set. To see its values you may use either
```hip[,3]
```
or
```hip[,"RA"]
```
Incidentally, the first column is just an index variable without any statistical content. So let us get rid of it.
```hip = hip[,-1]
```

### Summarizing data

The following diagram shows four different data sets along a number line.

 Four data sets shown along a number line

Notice that the points in the red data set (topmost) and the black data set (third from top) are more or less around the same center point (approximately 2). The other two data sets are more or less around the value 0. We say that the red and black data sets have the same central tendency, while the other data sets have a different central tendency.

Again, notice that the points in the red and blue data sets (the topmost two) are tightly packed, while the other two data sets have larger spreads. We say that the bottom two data sets have larger dispersion than the top two.

#### Central tendency

When summarizing a data set we are primarily interested in learning about its central tendency and dispersion. The central tendency may be obtained by either the `mean` or `median`. The median is the most central value of a variable. To find these for all the variables in our data set we `apply` the `mean` and `median` function on the columns.
```apply(hip,2,mean)
```
Have you noticed the mean of the last variable? It is `NA` or ``Not Available'', which is hardly surprising since not all the values for that variable were present in the original data set. We shall learn later how to deal with missing values (`NA`s).

 Exercise: Find the `median` of all the variables.

#### Dispersion

Possibly the simplest (but not the best) way to get an idea of the dispersion of a data set is to compute the min and max. R has the functions `min` and `max` for this purpose.
```apply(hip,2,min)
apply(hip,2,max)
```
In fact, we could have applied the `range` function to find both min and max in a single line.
```apply(hip,2,range)
```
The most popular way to find the dispersion of a data set is by using the variance (or its positive square root, the standard deviation). The formula is
where is the mean of the data. The function `var` and `sd` compute the variance and standard deviation.
```var(hip[,"RA"])
sd(hip[,"RA"])
```
Another popular measure of dispersion is the median absolute deviation (or MAD) which is proportional to the median of the absolute distances of the values from the median. It is given by the following formula.
The constant of proportionality happens to be the magic number 1.4826 for some technical reason. For example, if we have just 3 values 1,2 and 4, then
• the median is 2,
• absolute deviations from median are |1-2|=1, |2-2|=0 and |4-2|=2,
• median of the absolute deviations is 1,
The function `mad` computes this.
```mad(c(1,2,4))
```
For example,
```mad(hip[,1])
```
We want to compute both the median and MAD using one function. So we write
```f = function(x) c(median(x), mad(x))
f(hip[,1])
```

 Exercise: What will be the result of the following? ```apply(hip,2,f) ```

There is yet another way to measure the dispersion of a data set. This requires the concept of a quantile. Some examinations report the grade of a student in the form of percentiles. A 90-percentile student is one whose grade is exceeded by 10% of all the students. The quantile is the same concept except that it talks about proportions instead of percentages. Thus, the 90-th percentile is 0.90-th quantile.

 Exercise: The median of a data set is the most central value. In other words, exactly half of the data set exceeds the median. So for what value of p is the median the p-th quantile?

The R function `quantile` (not surprisingly!) computes quantiles.
```quantile(hip[,1],0.10)
quantile(hip[,1],0.50)
median(hip[,1])
```
The 0.25-th and 0.75-th quantiles are called the first quartile and the third quartile, respectively.

 Exercise: What is the second quartile?

```quantile(hip[,1],c(0.25,0.50,0.75))
```
The difference between first and third quartiles is another measure of the dispersion of a data set, and is called the InterQuartile Range (IQR). There is function called `summary` that computes quite a few of the summary statistics.
```summary(hip)
```

 Exercise: Look up the online help for the functions `cov` and `cor` to find out what they do. Use them to find the covariance and correlation between `RA` and `pmRA`.

### Handling missing values

So far we have ignored the `NA` problem completely. The next exercise shows that this is not always possible in R.

 Exercise: The function `var` computes the variance. Try `apply`ing it to the columns of our data set.

`NA` denotes missing data in R. It is like a different kind of number in R (just like Inf, or NaN). Any mathematics with `NA` produces only `NA`
```NA + 2
NA - NA
```
The function `is.na` checks for presence of `NA`s in a vector or matrix.
```x = c(1,2,NA)
is.na(x)
any(is.na(x))
```
The function `any` reports TRUE if there is at least one TRUE in its argument vector. The `any` and `is.na` combination is very useful. So let us make a function out of them.
```hasNA = function(x) any(is.na(x))
```

 Exercise: What is the consequence of this? ```apply(hip,2,hasNA) ```

This exercise shows that only the last variable has `NA`s in it. So naturally the following commands
```min(B.V)
max(B.V)
mean(B.V)
```
all return `NA`. But often we want to apply the function on only the non-NAs. If this is what we want to do all the time then we can omit the `NA` from the data set it self in the first place. This is done by the `na.omit` function
```hip1 = na.omit(hip)
dim(hip)
dim(hip1)
```
This function takes a very drastic measure: it simply wipes out all rows with at least one `NA` in it.
```apply(hip,2,mean)
apply(hip1,2,mean)
```
Notice how the other means have also changed. Of course, you may want to change only the `B.V` variable. Then you need
```B.V1 = na.omit(hip[,"B.V"])
```

 Exercise: Compute the variances of all the columns of `hip1` using `apply`.

There is another way to ignore the `NA`s without omitting them from the original data set.
```mean(hip[,"B.V"],na.rm=T)
var(hip[,"B.V"],na.rm=T)
```
Here `na.rm` is an argument that specifies whether `NA`s should be removed. By setting it equal to `T` (or `TRUE`) we are asking the function to remove all the obnoxious `NA`s.

You can use this inside `apply` as well
```apply(hip,2,var,na.rm=T)
```

### Attaching a data set

A data set in R is basically a matrix where each column denotes a variable. The `hip` data set, for example, has 8 variables (after removing the first column) whose names are obtained as
```names(hip)
```
To access the RA variable we may use
```hip[,"RA"] # too much to type
```
or
```hip[,3] # requires remembering the column number
```
Fortunately, R allows a third mechanism to access the individual variables in a data set that is often easier. Here you have to first `attach` the data set
```attach(hip)
```
This unpacks the data set and makes its columns accessible by name. For example, you can now type
```RA # instead of hip[,"RA"]
mean(RA)
hasNA(RA)
```
We can of course still write
```hip[,"RA"]
```

### Making plots

Graphical representations of data are a great way to get a ``feel'' about a data set, and R has a plethora of plotting functions.

#### Boxplots

Consider the two data sets shown along a number line.

 Two data sets

When we look at the data sets for the first time our eyes pick up the following details:
• the blue data set (topmost) has smaller spread than the red one
• the central tendency of blue data set is more to the right than the red one
• there are some blue points somewhat away from the bulk of the data.
In other words, our eye notices where the bulk of the data is, and is also attracted by points that are away from the bulk. The boxplot is a graphical way to show precisely these aspects.

 Boxplots for the two data sets

It requires some knowledge to interpret a boxplot (often called a box-and-whiskers plot). The following diagram might help.

 An annotated boxplot

Let us use the `boxplot` function on our data set.
```boxplot(Vmag)
```
Boxplots are usually more informative when more than one variable are plotted side by side.
```boxplot(hip)
```
The size of the box roughly gives an idea about the spread of the data.
 Boxplots are not supposed to be terribly informative, but they are often handy for obtaining a rough idea about a data set.

#### Scatterplots

Next let us make a scatterplot.
```plot(RA,DE)
```
This produces a scatterplot, where each pair of values is shown as a point. R allows a lot of control on the appearance of the plot. See the effect of the following.
```plot(RA,DE,xlab="Right ascension",ylab="Declination",
main="RA and DE from Hipparcos data")
```
You may change the colour and point type.
```plot(RA,DE,pch=".",col="red")
```
Sometimes it is important to change the colours of some points. Suppose that we want to colour red all the points with `DE` exceeding 0. Then the `ifelse` function comes handy.
```cols = ifelse(DE>0,"red","black")
cols
```
This means "`cols` is red if `DE`>0, else it is black".
```plot(RA,DE,col=cols)
```
You may similarly use a vector for `pch` so that different points are shown differently. There are many other plotting options that you can learn using the online help. We shall explain some of these during these tutorials as and when needed.
 To learn about the different plotting options in R you need to look up the help of the `par` function. ```?par ``` It has a long list of options. Before attempting to make your first publication-quality graph with R you should better go through this list.

 Exercise: Make a scatterplot of `RA` and `pmRA`. Do you see any pattern?

Instead of making all such plots separately for different pairs of variables we can make a scatterplot matrix
```plot(hip,pch=".")
```

#### Histograms

Histograms show how densely or sparsely the values of a variable lie at different points.
```hist(B.V)
```
The histogram shows that the maximum concentration of values occurs from 0.5 to 1. The vertical axis shows the number of values. A bar of height 600, standing on the range 0.4 to 0.6, for example, means there are 600 values in that range. Some people, however, want to scale the vertical axis so that the total area of all the rectangles is 1. Then the area of each rectangle denotes the probability of its range.
```hist(B.V,prob=T)
```

#### Multiple plots

Sometimes we want more than one plot in a single page (mainly to facilitate comparison and printing). The way to achieve this in R is rather weird. Suppose that we want 4 plots laid out as a 2 by 2 matrix in a page. Then we need to write
```oldpar = par(mfrow=c(2,2))
```
The `par` function sets graphics options that determines how subsequent plots should be made.
 The `par` function controls the global graphics set up. All future plots will be affected by this function. Every time it is called the old set up is returned by the function. It is a good idea to save this old set up (as we have in a variable called `oldpar`) so that we can restore the old set up later.
Here `mfrow` means multi-frame row-wise. The vector `c(2,2)` tells R to use a 2 by 2 layout. Now let us make 4 plots. These will be added to the screen row by row.
```x = seq(0,1,0.1)
plot(x,sin(x),ty="l")
hist(RA)
plot(DE,pmDE)
boxplot(Vmag)
```
To restore the original ``one plot per page'' set up use
```par(oldpar)
```

Sometimes we want to add something (line, point etc) to an existing plot. Then the functions `abline`, `lines` and `points` are useful.
```plot(RA,DE)
abline(a=-3.95,b=0.219)
```
This adds the line y = a + bx to the plot. Also try
```abline(h=0.15)
abline(v=18.5)
```
To add curved lines to a plot we use the `lines` function.
```x = seq(0,10,0.1)
plot(x,sin(x),ty="l")
lines(x,cos(x),col="red")
```
We can add new points to a plot using the `points` function.
```points(x,(sin(x)+cos(x))/2,col="blue")
```
There are more things that you can add to a plot. See, for example, the online help for the `text` and `rect` functions.

Sometimes we have to work with only a subset of the entire data. We shall illustrate this next by selecting only the Hyades stars from the data set. To do this we shall use the facts
 This are borrowed from Prof Hunter's notes, where he uses astronomy knowledge to obtain these conditions by making suitable plots. The interested reader is encouraged to look into his notes for details.
that the Main Sequence Hyades stars have
• `RA` in the range (50,100)
• `DE` in the range (0,25)
• `pmRA` in the range (90,130)
• `pmDE` in the range (-60,-10)
• `e_Plx` <5
• `Vmag` >4 OR `B.V` <0.2 (this eliminates 4 red giants)
Let us see how we apply these conditions one by one. First, we shall `attach` the data set so that we may access each variable by its name.
```attach(hip)
```
Next we shall apply the conditions as filters.
```filter1 =  (RA>50 & RA<100 & DE>0 & DE<25)
filter2 = (pmRA>90 & pmRA<130 & pmDE>-60 & pmDE< -10)
filter3  =  filter1 & filter2 & e_Plx<5
HyadesFilter = filter3 & (Vmag>4 |B.V <0.2)
```
The `&` denotes (as expected) logical AND while the vertical bar `|` denotes logical OR.

We are going to need this filter in the later tutorials. So it is a good idea to save the following lines in a script file called, say, hyad.r. (Do not type these in R all over again now! They are for future use. Just copy and paste them to a file called `hyad.r`.)
```hip = read.table("HIP.dat",head=TRUE)
attach(hip)
filter1 =  (RA>50 & RA<100 & DE>0 & DE<25)
filter2 = (pmRA>90 & pmRA<130 & pmDE>-60 & pmDE< -10)
filter3  =  filter1 & filter2 & (e_Plx < 5)
HyadesFilter = filter3 & (Vmag>4 | B.V<0.2)
```

By the way, the filters are all just vectors of `TRUE`s and `FALSE`s. The entry for a star is `TRUE` if and only if it is a Hyades star.

Now we shall apply the filter to the data set. This produces a new (filtered) data set which we have called `hyades`. Finally we `attach` this data set.
```hyades = hip[HyadFilter,]
You'll see a bunch of warning messages when you `attach` the filtered data set. This is because the old (unfiltered) variables are now being superseded by the new (filtered) variables of the same name.
```dim(hyades)