Start learning R with an astronomical dataset

Obtaining astronomical datasets

The astronomical community has a vast complex of on-line databases.  Many databases are hosted by data centers such as NASA's archive research centers, the Centre des Donnees astronomiques de Strasbourg (CDS), the NASA/IPAC Extragalactic Database (NED), and the Astrophysics Data System (ADS).   The Virtual Observatory (VO) is developing new flexible tools for accessing, mining and combining datasets at distributed locations;  see the Web sites for the international, European, and U.S. VO for information on recent developments.  The VO Web Services, Summer Schools, and Core Applications provide helpful entries into these new capabilities.

We initially treat here only input of tabular data such as catalogs of astronomical sources.  We give two examples of interactive acquisition of tabular data.  One of the multivariate tabular datasets used here is a dataset of stars observed with the European Space Agency's Hipparcos satellite during the 1990s.  It gives a table with 9 columns and 2719 rows giving Hipparcos stars lying between 40 and 50 parsecs from the Sun. The dataset was acquired using CDS's Vizier Catalogue Service as follows:

   In Web browser, go  http://vizier.u-strasbg.fr/viz-bin/VizieR?-source=I/239/hip_main
   Set Max Entries to 9999, Output layout ASCII table with |-separated values
   Remove "Compute r" and "Compute Position" buttons
   Set parallax constraint "20 .. 25" to gives stars between 40 and 50 pc
   Retrieve 9 properties: HIP, V, RA, Dec, Plx, pmRA, pmDE, e_Plx, and B-V
   Submit Query
   Use ASCII editor to trim header to one line with variable names
   Trim trailer
   Save ASCII file on disk for ingestion into R


ASCII data input into R

Enter R by typing "R" (UNIX) or double-clicking to execute Rgui.exe (Windows).  In the commands below, we start by extracting some system and user information, the R.version you are using, and some of its capabilities.  There are more advanced functions to create, open and close connections to (compressed) files, pipes, urls, and sockets.  Files can be opened for reading or writing in text or binary modes.  citation tells how to cite R in publications.  R is released under the GNU Public Licence, as indicated by copyright.

   Sys.info()
   R.version
   capabilities()
   citation()
   ?copyright

Note: R is case-sensitive. The up-arrow key can retrieve previous commands, which may be edited.

Second, we navigate R to the work directory containing the ASCII dataset.  list.files has a variety of useful options, and is only one of several utilities interfacing to the computer's files. In the setwd command, note that in Windows, path (directory) names are not case-sensitive and may contain either forward slashes or backward slashes; in the latter case, a backward slash must be written as "\\" when enclosed in quotation marks.

   setwd("c:/data_directory")
   getwd()
   list.files()

A user-provided ASCII table is converted into R's `data frame' using read.table.   Here are two examples which read the table and convert it into an array named "x".:

   # These are examples only; they will (probably) produce errors:
   x = read.table("MyData1", header=TRUE, comment.char="%")
   x = read.table("c:\\documents and settings\\MyName\\Desktop\\MyData.tsv", sep="|", na.strings="-9")


The "#" mark is used for internal comments within the R commands.  The default column separators are spaces or tabs. If header=TRUE, variable names are assumed to occupy the first line.  Variable names may not have embedded spaces, and non-alphanumeric characters (such as "_") are converted to ".".  Full astronomical headers (e.g. FITS, VOTable) should be commented out with comment.char.  Empty cells, or cells with na.strings values, are considered to be Not Available (NA), a T/F logical constant. This can be tested later using is.na(x), or removed from consideration using na.omit(x).  Mistakes in the data file are often returned as Not a Number (NaN).

As preparation for the exercises to follow, we now download a cleaned-up version of the Hipparcos dataset described above into R. The first step is to copy the file HIP_star.dat, found at http://astrostatistics.psu.edu/datasets/HIP_star.html, into some directory and then use the setwd command to change the working directory to this directory. Then, type

   x = read.table("HIP_star.dat",header=T,fill=T)


Characterize the dataset

The following R commands list the dimensions of the dataset and print the variable names (from the single-line header).  Then we list the first row, and list the first 20 rows for the 7th column.  Finally, we find the sum of the 3rd column. 

   dim(x)
   names(x)
   x[1,]
   x[1:20,7]
   sum(x[,3])

List the maximum, minimum, mean, median, and mean absolute deviation (similar to standard deviation)  of each column.  First we do this using a for-loop, which is a slow process in R.  c is a generic R function that combines its arguments into a vector. Then we do it in a more efficient fashion using the apply command.  Here the "2" refers to columns in the x array; a "1" would refer to rows.  print is a generic R command that prints the result. colMeans is faster than the apply command.

   for(i in 1:ncol(x)) {
      print(c(max(x[,i]), min(x[,i]), median(x[,i]), mad(x[,i])))
   }
   apply(x, 2, max)
   apply(x, 2, min)
   apply(x, 2, median)
   apply(x, 2, mad)
   colMeans(x)

Note that the curly braces {} in the for loop above are optional because there is only a single command inside.

A vector can be sorted using the Shellsort or Quicksort algorithms; rank returns the order of values in a numeric vector; and order returns a vector of indices that will sort a vector. The last of these functions, order, is often the most useful of the three, because it allows one to reorder all of the rows of a matrix according to one of the columns:

   sort(x[1:10,3])
   x[order(x[1:10,3]),]

Each of the above lines gives the sorted values of the first ten entries of the third column, but the second line reorders each of the ten rows in this order. Note that neither of these commands actually alters the value of x, but we could reassign x to equal its sorted values if desired.


Learning R syntax

Arithmetic in R is straightforward.  Some common operators are: + for addition, - for subtraction, * for multiplication, / for division, %/% for integer division,  %% for modular arithmetic, ^ for exponentiation.  The help page for these operators may accessed by typing, say,

   ?'+'

Some common built-in functions are exp for the exponential function, sqrt for square root, log10 for base-10 logarithms, and cos for cosine. The syntax resembles "sqrt(z)".   Comparisons are made using < (less than), <= (less than or equal), == (equal to) with the syntax "a >= b".  To test whether a and b are exactly equal and return a TRUE/FALSE value (for instance, in an "if" statement), use the command identical(a,b) rather a==b.  Compare the following two ways of comparing the vectors a and b:

   a=c(1,2);b=c(1,3)
   a==b
   identical(a,b)

Also note that in the above example, 'all(a==b)' is equivalent to 'identical(a,b)'.

R also has other logical operators such as & (AND), | (OR), ! (NOT). There is also an xor (exclusive or) function. Each of these four functions performs elementwise comparisons in much the same way as arithmetic operators:

   a=c(TRUE,TRUE,FALSE,FALSE);b=c(TRUE,FALSE,TRUE,FALSE)
   !a
   a & b
   a | b
   xor(a,b)

However, when 'and' and 'or' are used in programming, say in 'if' statements, generally the '&&' and '||' forms are preferable. These longer forms of 'and' and 'or' evaluate left to right, examining only the first element of each vector, and evaluation terminates when a result is determined. Some other operators are listed here.

It is important to understand how R uses the "=" sign.  The syntax "y <- 22" means "y is assigned the value 22".  Recently, R has been changed to accept the syntax "y = 22" to have the same meaning; we use this latter terminology here, although most documentation is older and uses the "<-" terminology.  The expression "y == x^2" evaluates as TRUE or FALSE, depending upon whether y equals x squared, and performs no assignment.

Let's continue with simple characterization of the dataset: find the row number of the object with the smallest value of the 4th column using which.min.  A longer, but instructive, way to accomplish this task creates a long vector of logical constants (tmp), mostly FALSE with one TRUE, then pick out the row with "TRUE". 

   which.min(x[,4])
   tmp= (x[,4]==min(x[,4]))
   (1:nrow(x))[tmp]    #   or equivalently,
   which(tmp)

The cut function divides the range of x into intervals and codes the values of x according to which interval they fall.  It this is a quick way to group a vector into bins.  Use the "breaks" argument to either specify a vector of bin boundaries, or give the number of intervals into which x should be cut.  Bin string labels can be specified.  Cut converts numeric vectors into an R object of class  "factor" which can be ordered and otherwise manipulated; e.g. with command levels.  A more flexible method for dividing a vector into groups using user-specified rules is given by split.

   table(cut(x[,"Plx"],breaks=20:25))

The command above uses several tricks. Note that a column in a matrix may be referred to by its name (e.g., "Plx") instead of its number. The notation '20:25' is short for 'c(20,21,22,23,24,25)' and in general, 'a:b' is the vector of consecutive integers starting with a and ending with b (this also works if a is larger than b). Finally, the table command tabulates the values in a vector or factor.


Dealing with blank cells

We may want to characterize the blank cells for which R has inserted the value NA.  First, we use is.na to produce a matrix, the same size as x, with TRUE entries where NA appears.  The sum command here will coerce TRUE to 1 and FALSE to 0, and add the matrix to give the total number of NA's in x. 

   sum(is.na(x))

Now we print the whole row wherever an NA is found.  First, we create a function called 'f' and use it in the apply command analogously to the built-in functions mean and median above.  tmp here is a vector of TRUE and FALSE, where a TRUE indicates that the corresponding row in x has an NA. The last command displays all such rows.

   f = function(a) any(is.na(a))
   tmp=apply(x,1,f)
   x[tmp,]

We can do the same operation without using the "tmp" array in a more elegant, but less obvious, fashion:

   x[apply(x,1,function(t) any(is.na(t))),]

The write command is used to create a text file of an object. Alternatives to the write command include write.table, and write.matrix.  write.matrix is most efficient for large datasets, but requires that you load the MASS package.  All of these commands produce text versions of R objects. To save R objects solely for loading later into R, use the function save. R objects that have been saved can later be reacquired using load.

   library(MASS)
   write.matrix(x[tmp,],file="Rows_with_NA.dat")

The last command created a file in the current directory; feel free to verify its contents!

The user ends an R session by typing the q() command. At this point, R will ask whether to save the workspace image. If the user chooses "yes", then all of the objects currently in memory are stored in a file called .RData and the commands typed during the session are appended to the .Rhistory file. These objects are automatically loaded when R is next invoked from that directory, and the old commands may be accessed using the up-arrow key.


R scripts

While R is often run interactively, one often wants to carefully construct R scripts and run them later.  A file containing R code can be run using the source command. In addition, R may be run in batch mode.

The editor Emacs, together with "Emacs speaks statistics", provides a nice way to produce R scripts.