Overview of Summer School in Statistics for Astronomers IV

June 9-14, 2008

G. Jogesh Babu


This is an overview of statistical concepts and methods covered in the summer school. Eric Feigelson starts with an overview of astrostatistics giving a brief description of modern astronomy and astrophysics. He describes how the roots of many statistical concepts originated in astronomy, starting with Hipparchus in 4th c. BC. He discusses:

  • Relevance of statistics in astronomy today
  • State of astrostatistics today
  • Methodological challenges for astrostatistics in 2000s

Derek Young starts the computer lab session with an introduction to R programing language and Descriptive statistics. R is an integrated suite of software facilities for data manipulation, calculation and graphical display. Descriptive statistics describe the basic features of data in an observational study and provide simple summaries about the sample and the measures. Various commonly used techniques such as, graphical description, tabular description, and summary statistics, are illustrated through R.

Derek Young also presents exploratory data analysis (EDA). It is an approach to analyzing data for the purpose of formulating hypotheses worth testing, complementing the tools of conventional statistics for testing hypotheses. EDA is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to:

  • maximize insight into a data set
  • uncover underlying structure
  • extract important variables
  • detect outliers and anomalies
  • test underlying assumptions
  • develop parsimonious models, and
  • provide a basis for further data collection through surveys or experiments

Mosuk Chow introduces basic principles of probability theory, which is at the heart of statistical analysis. The topics include conditional probability, Bayes theorem (on which the Bayesian analysis is based), expectation, variance, standard deviation (which helps in constructing units free estimates), density of a continuous random variable (as opposed to density defined in physics), normal (Gaussian) distribution, Chi-square distribution (not to be confused with Chi-square statistic), and other important distributions. They also include some probability inequalities and the Central Limit Theorem.

Mosuk Chow also lectures on correlation & regression, including correlation coefficient, the underlying principles of linear and multiple linear regression, least squares estimation, ridge regression, and principal components among others. This lecture is followed by a discussion of linear regression issues in astronomy by Eric Feigelson. He compares different regression lines used in astronomy, and illustrates them with Faber-Jackson relation.

Descriptive statistics are typically distinguished from inferential statistics. While the lab sessions on descriptive statistics provide tools to describe what the data shows, the inferential statistics helps to reach conclusions that extend beyond the immediate data alone. For instance, statistical inference is used to make judgments of an observed difference between groups is a dependable one or one that might have happened by chance in a study. James Rosenberger's lecture on statistical inference focusses on methods of point estimation, confidence intervals for unknown parameters, and basic principles of testing of hypotheses.

Likelihood is the hypothetical probability that an event that has already occurred would yield a specific outcome. The concept differs from that of a probability in that a probability refers to the occurrence of future events, while a likelihood refers to past events with known outcomes. Maximum likelihood estimation (MLE) is a popular statistical method used for fitting a mathematical model to data. Modeling real world data by estimating maximum likelihood offers a way of tuning the free parameters of the model to provide a good fit. Thomas Hettmansperger's lecture includes maximum likelihood method for linear regression, an alternative to least squares method. He also presents Cramer-Rao inequality, which sets a lower bound on the error (variance) of an estimator of parameter. It helps in finding the `best' estimator. Hettmansperger also discusses analysis of data from two or more different populations by considering mixture models. Here the likelihood calculations are difficult, so he introduces an iterative device called EM algorithm. Derek Yung illustrates likelihood computations and EM algorithm using R.

Thomas Hettmansperger's second lecture is on Nonparametric statistics. Non-parametric (or distribution-free) inferential statistical methods are procedures which, unlike parametric statistics, make no assumptions about the probability distributions of the population. Here, the model structure is not specified a priori but is instead determined from data. As non-parametric methods make fewer assumptions, their applicability is much wider than the corresponding parametric methods. In this lecture he describes some simple non-parametric procedures such as sign test, Mann-Whitney two sample test and Kruskal-Wallis test for comparing several samples.

Bayesian inference is statistical inference in which evidence or observations are used to update or to newly infer the probability that a hypothesis may be true. The name "Bayesian" comes from the frequent use of Bayes' theorem. As evidence accumulates, the degree of belief in a hypothesis ought to change. In a many statistical problems, failure to take prior knowledge into account can lead to inferior conclusions. Of course, the quality of Bayesian analysis depends on how best one can convert the prior information into mathematical prior probability. Thomas Loredo describes various methods for parameter estimation, model assessment etc, and illustrates them with examples from astronomy.

The lecture on Multivariate analysis by James Rosenberger introduces the statistical analysis of data containing observations of two or more variables that may depend on each other. The methods include principle components analysis, to reduce the number of variables and canonical correlation. The lecture covers many important topics including testing of hypotheses, constructing confidence regions for multivariate parameters, multivariate regression, and discriminant analysis (supervised learning). Derek Young covers computational aspects of Multivariate analysis in an interactive lab session.

G. J. Babu introduces a resampling procedure called bootstrap. It is essentially about how to get most out of repeated use of the data. Bootstrap is similar to Monte Carlo method but the `simulation' is carried out from the data itself. It is a very general, mostly non-parametric procedure, and is widely applicable. Applications to regression, cases where the procedure fails, and where it outperforms traditional procedures are also discussed. The lecture also covers curve fitting (model fitting or goodness of fit) using bootstrap procedure. This procedure is important as the commonly used Kolmogorov-Smirnov procedure does not work in multidimensional case, or when the parameters of the curve are estimated. Some of these procedures are illustrated using R in a lab session on Hypothesis testing and bootstrapping by Derek Young.

The lecture on Model selection, evaluation, and likelihood ratio tests by Bruce Lindsay covers model selection procedures starting with Chi-square test, Rao's score test and likelihood ratio test. The discussion also includes cross validation.

The two lectures on Time Series & Stochastic Processes by John Fricks and Eric Feigelson provide an overview of Time series analysis and, more generally stochastic processes, including time domain procedures, state space models, kernel smoothing and illustrations with examples from astronomy. The first lecture also includes a number of commonly used examples, such as Poisson processes and focuses on spectral methods for inference. A brief discussion of Kalman filter is also included.

Monte Carlo methods are a collection of techniques that use pseudo-random (computer simulated) values to estimate solutions to mathematical problems. In the tutorial on MCMC, Murali Haran discusses Monte Carlo for Bayesian inference. In particular, MCMC method for the evaluation of expectations with respect to a probability distribution is illustrated. Monte Carlo methods can also be used for a variety of other purposes, including estimating maxima or minima of functions (as in likelihood-based inference). MCMC procedures are successfully used in the search for extra-solar planets.

In his lecture on Spatial Statistics, Murali Haran teaches spatial point processes, intensity function, homogeneous and inhomogeneous poisson processes, and estimation of Ripley's K function (statistic useful for point pattern analysis).

Jia Li covers data mining techniques, classifying data into clusters (including k-means, model clustering, single (friends of friends) and complete linkage clustering algorithms in her lecture on Cluster analysis.