Testing and Estimation
A simple example
Consider the computer-generated picture below which is supposed to
photograph taken by a low resolution telescope. Is it reasonable
to say that there are two distinct stars in the picture?
The next image is obtained by reducing the resolution of the
telescope. Each star is now more blurred. Indeed, their distance
in the picture is almost overwhelmed in the blur, and they appear
to be an elongated blur caused by a single star.
The last picture is is of the same two stars taken through a very
low resolution telescope. Based on this image alone there
is just no reason to believe that we are seeing two stars.
What we did just now is an informal version of a statistical test
of hypothesis. Each time we tried to answer the question ``Are
stars or one?" based on a blurry picture. The two possibilities
``There is just one star'' as opposed to
``There are two stars''
are called two hypotheses. The simpler of the two gets the
name null hypothesis while the other is called the
alternative hypothesis. Here we shall take the first one
as the null, and the second one as the alternative hypothesis.
The blurry picture we use to make a decision is called the
statistical data (which always contains random
errors in the form of blur/noise). Notice how our final verdict
changes with the
amount of noise present in the data.
Let us follow the thought process that led us to a conclusion
based on the first picture. We first mentally made a note of the
of blur. Next we imagined the centers of the bright blobs. If
there are two stars then the are most likely to be here. Now we
compare the distance between these centers and the amount of
blur present. If the distance seems too small compared to the
blur then we pass off the entire bundle as a single star.
This is precisely the idea behind most statistical tests. We
shall see this for the case of the two sample t-test.
Do Hyades stars differ in color from the rest?
Recall in the Hipparcos data set we had 92 Hyades stars and 2586
non-Hyades stars. We want to test the null hypothesis
``The Hyades stars have the same color as the non-Hyades stars''
versus the alternative hypothesis
``They have different colors.''
First let us get hold of the data for both the groups. For this
we shall use the little script we had saved earlier. This script
is going to load the entire Hipparcos data set, and extract the
Hyades stars. So you must make sure that both
HIP.dat and hyad.r are saved on your
desktop. Right-click on the links in the last line, choose
"Save link as..." or "Save target as..." (and be careful that the
files do not get saved a .txt files). Then you must navigate R to
the desktop. The simplest way to achieve this is to use the File
> Change dir... menu item. This will open a pop up like this:
Click on Desktop, and then OK.
|Changing directory in R (Windows version)
color = B.V
H = color[HyadFilter]
nH = color[!HyadFilter & !is.na(color)]
m = length(H)
n = length(nH)
In the definition of
nH above, we needed to exclude the
H is a list of m numbers and
is a list of n numbers.
First we shall make an estimate of the ``blur'' present in the
data. For this we shall compute the pooled estimate of standard
blur.H = var(H)
blur.nH = var(nH)
blur.pool = ((m-1)*var(H) + (n-1)*var(nH))/(m+n-2)
Next we shall find the difference of the two means:
meanDiff = mean(H)-mean(nH)
Finally we have to compare the difference with the blur. One way
is to form their ratio.
(meanDiff/sqrt(blur.pool))/sqrt(1/m + 1/n)
This last factor (which is a constant) is there only for
technical reasons (you may think of it as a special constant to
make the ``units match'').
The important question now is ``Is this ratio small or large?''
For the image example we provided a subjective answer. But in
statistics we have an objective way to proceed. Before we see
that let us quickly learn a one-line shortcut to compute the above
ratio using the
t.test(H,nH,var.eq=T) #we shall explain the "var.eq" soon
Do you see the ratio in the output? Also this output tells us
whether the ratio is to be considered small or large. It does so
in a somewhat diplomatic way using a number called the
p-value. Here the p-value is near 0, meaning
if the colors were really the same then then chance of observing
a ratio this large (or larger) is almost 0.
Typically, if the p-value is smaller than 0.05 then we reject the
null hypothesis. So we conclude that the mean of the color of
stars is indeed different from that of the rest.
A rule of thumb: For any statistical test (not just
t-test) accept the null hypothesis if and only if the
p-value is above 0.05. Such a test fails to recognize a true
null hypothesis at most 5% of the time.
var.eq=T option means we are assuming that the
colors of the Hyades and non-Hyades stars have more or less the
same variance. If we do not want to make this assumption, we
should simply write
t.test is for comparing means.
Chi-squared tests for categorical data
Suppose that you are to summarize the result of a public
examination. It is not a reasonable to report the grades obtained
by each and every student in a summary report. Instead, we break
the range of grades into categories like A,B,C etc and
then report the numbers
of students in each category. This gives an overall idea about
the distribution of grades.
cut function in R does precisely this.
bvcat = cut(color, breaks=c(-Inf,0.5,0.75,1,Inf))
Here we have broken the range of values of the
variable into 4 categories:
(-Inf, 0.5], (0.5, 0.75], (0.75, 1] and (1,Inf).
The result (stored in
bvcat) is a vector that
records the category in which each star falls.
It is possible to tabulate this information for Hyades and
non-Hyades stars in the same table.
To perform a chi-squared test of the null hypothesis that the
true population proportions falling in the four categories are
the same for both the Hyades and non-Hyades stars, use the
Since we already know these two groups differ with respect to the
B.V variable, the result of this test is not too
it does give a qualitatively different way to compare these two
distributions than simply comparing their means.
The test above is usually called a chi-squared test of
homogeneity. If we observe only one sample, but we wish to test
whether the categories occur in some pre-specified proportions, a
similar test (and the same R function) may be applied. In this
case, the test is usually called the chi-squared test of
goodness-of-fit. We shall see an example of this next.
Consider once again the Hipparcos data.
We want to know if the stars in the
Hipparcos survey come equally from all corners of the sky. In
fact, we shall focus our attention only on the
First we shall break the range of RA into 20 equal intervals (each of
width 18 degrees), and find how many stars fall in each bin.
count = table(cut(RA,breaks=seq(0,360,len=20)))
There is yet another way (a better way in many situations) to
perform the same test. This is called the Kolmogorov-Smirnov
punif is the name of the distribution with
which we are comparing the data.
punif denoted the
uniform distribution, the range being from 0 to 360. Thus, here
we are testing if
RA is taking all values from 0
to 360 with equal likelihood, or are some values being taken more
or less frequently.
The Kolmogorov-Smirnov test has the advantage that we do not need
to group the data into
categories as for the chi-squared test.
ks.test are for comparing
Finding (or, rather, guessing about) the unknown based on
approximate information (data) is the aim of statistics. Testing
one aspect of it where we seek to answer yes-no questions about
the unknown. The problem of estimation is about guessing the
values of unknown quantities.
There are many methods of estimation, but most start off with a
statistical model of the data. This is a statement of how
the observed data set is (probabilistically) linked with the
unknown quantity of interest.
For example, if I am asked to estimate p based on the data
Head, Head, Head, Tail, Tail, Head, Head, Tail, Head, Tail, Tail, Head
then I cannot make head-or-tail of the question. I need to link
this data set with p through a statement like
A coin with probability p was tossed 12 times and the data
set was the result.
This is a statistical model for the data. Now the problem of
estimating p from the data looks like a meaningful one.
Statistical models provide the link between the observed data and
the unknown reality. They are indispensable in any statistical
analysis. Misspecification or over-simplification of the
statistical model is the most frequent cause behind misuse of
Estimation using R
You might be thinking that R has some in-built tool that can
solve all estimation problems. Well, there isn't. In fact, due to
the tremendous diversity among statistical models and estimation
methods no statistical software can have tools to cope
with all estimation problems. R tackles estimation
in three major ways.
To see estimation in action let us load the Hipparcos data set.
- Many books/articles give formulas to estimate various
quantities. You may use R as a calculator to implement them.
With enough theoretical background you may be able to come up
with your own formulas that R will happily compute for you.
- Sometimes estimation problems lead to complicated
equations. R can solve such equations for you numerically.
- For some frequently used statistical methods (like
regression or time series analysis) R has the estimation methods
built into it.
hip = read.table("HIP.dat",head=T)
If we assume the statistical model that the
variable has a normal distribution with unknown mean μ and
unknown variance σ2 then it is known from the literature
that a good estimator of μ is and a
95% confidence interval is
This means that the true value of μ will lie in this
interval with 95% chance. You may think of μ as
a peg on the wall and the confidence interval as a hoop thrown at
it. Then the hoop will miss the peg in only about 5% of
Next we shall see a less trivial example.
The data set comes from NASA's Swift satellite. The statistical
problem at hand is modeling the X-ray afterglow of gamma ray
bursts. First, read in the dataset GRB.dat
(right-click, save on your desktop, without any .txt extension).
Find the estimate and confidence interval for μ based
on the observed values of
Vmag using R as a calculator.
dat = read.table("GRB.dat",head=T)
flux = dat[,2]
Suppose that it is known that the
flux variable has
an Exponential distribution. This means that its density function
is of the form
Here λ is a parameter, which must be positive. To
get a feel of the density function let us plot it for different
values of λ
x = seq(0,200,.1)
y = dexp(x,1)
Now let us look at the histogram of the observed
problem is estimating λ based on the data may be
considered as finding a value of λ such that the
the density is as close as possible to the histogram.
We shall first try to achieve this interactively. For this you
need to download interact.r on your desktop first.
This should open a tiny window as shown below with a slider and a
Move the slider to see how the density curve moves over the
histogram. Choose a position of the slider for which the density
curve appears to be a good approximation.
|Screenshot of the tiny window
It is known from the theory of Exponential distribution that
the Maximum Likelihood Estimate (MLE) of λ
is the reciprocal of the sample
Compute this estimate using R and store it in a variable
Draw the density curve on top of the
histogram using the following commands.
y = dexp(x,lambdaHat)
Some nonparametric tests
The remaining part of this tutorial deals with a class of
called nonparametric inference. These may be skipped without loss
of continuity. Also the theory part for this will be covered in a
later theory class. So you may like to save the rest of this lab
until that time. I have tried to make the lab largely
self-explanatory though, with a little theoretical discussion to
explain terms and concepts.
In statistics we often come across questions like
``Is a given data from such-n-such distribution?''
Before we can answer this question we need to know what is meant
by a distribution. We shall talk about this first.
In statistics, we work with random
variables. Chance controls their values.
The distribution of a random variable is the rule by which
chance governs the values.
If you know the distribution then you know the
chance of this random variable taking value in any given range.
This rule may be expressed in various ways. One popular way is
via the Cumulative Distribution Function (CDF), that we
illustrate now. If a random variable (X) has distribution with CDF
F(x) then for any given number a the chance that
X will be ≤ a is F(a).
So you see how powerful a CDF is: in a sense it stores as much
information as the entire
Suppose that I choose a random star from the Hipparcos data
set. If I tell you that its
Vmag is a random variable
with the following CDF then what is the chance that it takes values
≤ 10? Also find out the value x such that
the chance of
Vmag being ≤ x is 0.4.
For the first question the required probability is F(10)
which, according to the
graph, is slightly above 0.8.
In the second part we need F(x) = 0.4. From the graph it
seems to be about 7.5.
Just to make sure that these are indeed meaningful, let us load
the Hipparcos data set and check:
hip = read.table("HIP.dat",head=T)
n = length(Vmag) #total number of cases
count = sum(Vmag<=10) #how many <= 10
count/n #should be slightly above 0.8
This checks the answer of the first problem. For the second
count = sum(Vmag<=7.5) #how many <= 7.5
count/n #should be around 0.4
Vmag data. It is a
common practice in statistics to regard the distribution as the
ultimate truth behind the data. We want to infer about the
underlying distribution based on the data. When we look at a data
set we actually try to look at the underlying distribution
through the data set!
R has many standard distributions already built into it. This
basically means that R has functions to compute their CDFs.
These functions all start with the letter p.
For every p-function there is a
q-function that is basically its inverse.
A random variable has standard Gaussian distribution. We know
that R computes its CDF using the function
pnorm. How to
find the probability that the random variable takes values
The answer may be found as follows:
OK, now that we are through our little theory session, we are
ready for the nonparametric lab.
A random variable has standard Gaussian distribution. Find
x such that the random variable is ≤ x with
Now we shall use the function
One sample nonparametric tests
In a sense this is the simplest possible of all tests. Here
we shall consider the data set LMC.dat that stores the measured
distances to the Large Magellanic Cloud. (As always, you'll need
to save the file on your desktop.)
LMC = read.table("LMC.dat",head=T)
data = LMC[,2] #These are the measurements
We want to test if the
measurements exceed 18.41 on an average. Now, this does
mean whether the average of the data exceeds 18.41, which is trivial
to find out. The question here is actually about the underlying
distribution. We want to know if the median of the underlying
distribution exceeds 18.41, which is a less trivial question,
since we are not given that distribution.
We shall use the sign test to see if the median is 18.41 or larger.
First it finds how many of the
observations are above this value:
abv = sum(data>18.41)
Clearly, if this number is large, then we should think that the
median (of the underlying distribution) exceeds 18.41. The
question is how large is ``large
enough''? For this we consult the binomial distribution to get
the p-value. The rationale behind this should come from
the theory class.
n = nrow(LMC)
pValue = 1-pbinom(abv-1,n,0.5)
We shall learn more about p-values in a later
tutorial. For now, we shall use the following rule of thumb:
If this p-value is below 0.05, say, we shall conclude that
the median is indeed larger than 18.41.
Wilcoxon's Signed Rank Test
As you should already know from the theoretical class, there is a
test called Wilcoxon's Signed Rank test that is better (if a little
more complicated) than the sign test. R provides the function
wilcox.test for this purpose. We shall work with the
Hipparcos data set this time, which we have already loaded.
We want to see if the median of the distribution of the
pmRA variable is 0 or not.
Since the p-value is pretty large (above 0.05, say) we
shal conclude that the median is indeed equal to 0. R itself
has come to the same conclusion.
Incidentally, there is a little caveat. For Wilcoxon's Rank Sum Test
to be applicable we need the underlying distribution to be
symmetric around the median. To get an idea about this we should
draw the histogram.
Well, it looks pretty symmetric (around a center line).
But would you apply Wilcoxon's
Rank Sum Test to the variable
No, this is not at all symmetric!
Two-sample nonparametric tests
Here we shall be taking about the shape of distributions. Let us
first make sure of this concept. Suppose that I make a list of
all the people in a capitalist country
and make a histogram of their incomes.
I should get a histogram like this
But if the same thing is done for a socialist country we shall
see something like this (where almost everybody is in the middle
|Income histogram of a capitalist
Clearly, the shapes of the histograms differ for the two
populations. We must be careful about the term ``shape'' here. For
example, the following two histograms are for two capitalist
countries, one rich the other poor.
|Income histogram of a socialist
Here the histograms have the same shape but differ in
location. You can get one from the other by applying a
shift in the location.
This is a very common phenomenon in statistics. If we have two
populations with comparable structure they often have similar
shapes, and differ in just a location shift. In such a situation
we are interested in knowing about the amount of shift. if the
amount of shift is zero, then the two populations behave
Wilcoxon's rank sum test is one method to learn about the amount
of shift based on samples from the two populations.
We shall start with the example done in the theory class, where
we had two samples
|Income histograms two capitalist
x = c(37,49,55,57)
y = c(23,31,46)
m = length(x)
n = length(y)
We are told that both the samples come from populations with the
same shape, though one may be a shifted version of the
Our aim is to check if indeed there is any shift or not.
We shall consider the pooled data (i.e., all the 7 numbers
taken together) and rank them
pool = c(x,y)
r = rank(pool)
The first m=4 of these ranks are for the first
sample. Let us
H = sum(r[1:m])
If the the two distributions are really identical (i.e.,
if there is no shift) then we should have (according to the
theory class) H close to the value
m*(m+n+1)/2 #Remember: * means multiplication
Can the H that we computed from our data be considered
``close enough" to this value? We shall let us R determine that
So R clearly tells us that there is a shift in location. The
output also mentions a W and a
tutorial. But W is essentially what we had called
H. Well, H was 21, while W is 11. This
is because R has the habit of subtracting
from H and calling it W. Indeed for our data set
m(m+1)/2 = 10,
which perfectly accounts for the difference between our H
and R's W.
You may recall from the theory class that H is called
Wilcoxon's rank sum statistic, while W is the Mann-Whitney
statistic (denoted by U in class).
Now that R has told us that there is a shift, we should demand to
estimate the amount of shift.
The output gives two different forms of estimates. It
gives a single value (a point estimate) which is 16.
Before it gives a 95% confidence interval: -9 to
basically means that
we can say with 95% confidence that
the true value of the shift is between -9 and 34.
We shall talk more about confidence intervals later. For now let
us see how R got the poit estimate 16. It used the so-called
Hodges-Lehman formula, that we have seen in the theoretical
class. To see how this method works we shall first form the following
Here the rows are headed by the first sample, and columns by the
second sample. Each entry is obtained by subtracting the column
the row heading (e.g., -9 = 37-46). This table, by the
way, can be created very easily with the function
| 23 31 46
37| 14 6 -9
49| 26 18 3
55| 32 24 9
57| 34 26 11
Now we take the median of all the entries in the table:
outer function is a very useful (and somewhat
tricky!) tool in R. It takes two vectors x, and y,
say, and some function f(x,y). Then it computes a matrix
where the (i,j)-th entry is
This gives us the Hodges-Lehman estimate of the location shift.