Random numbers in R

It is often necessary to simulate random numbers in R. There are many functions available to accomplish this and related tasks such as evaluating the density, distribution function, and quantile function of a distribution.

Distributions intrinsic to R

R handles many common distributions easily. To see a list, type

help.search("distribution",package="stats")

Let's consider the well-known normal distribution as an example:

?Normal

The four functions 'rnorm', 'dnorm', 'pnorm', and 'qnorm' give random normals, the normal density (sometimes called the differential distribution function), the normal cumulative distribution function (CDF), and the inverse of the normal CDF (also called the quantile function), respectively. Almost all of the other distributions have similar sets of four functions. The 'r' versions are rbeta, rbinom, rcauchy, rchisq, rexp, rf, rgamma, rgeom, rhyper, rlogis, rlnorm, rmultinom, rnbinom, rnorm, rpois, rsignrank, rt, runif, rweibull, and rwilcox (there is no rtukey because generally only ptukey and qtukey are needed).

As an example, suppose we wish to simulate a vector of 10 independent, standard (i.e., mean 0 and standard deviation 1) normal random variables. We use the rnorm function for this purpose, and its defaults are mean=0 and standard deviation=1. Thus, we may simply type

rnorm(10)

Suppose we wish to simulate a large number of normal random variables with mean 10 and standard deviation 3, then check a histogram against two normal density functions, one based on the true parameters and one based on estimates, to see how it looks. We'll use 'col=2, lty=2, lwd=3' to make the curve based on the true parameters red (color=2), dashed (line type=2), and wider than normal (line width=3). Also note that we are requesting 100 bins in the histogram (nclass=100) and putting it on the same vertical scale as the density functions (freq=FALSE).

z=rnorm(200000, mean=10, sd=3)
hist(z,freq=FALSE,nclass=100)
x=seq(min(z),max(z),len=200)
lines(x,dnorm(x, mean=10, sd=3),col=2, lty=2, lwd=3)
lines(x,dnorm(x,mean=mean(z),sd=sqrt(var(z))))

We can find out what proportion of the deviates lie outside 3 standard deviations from the true mean, a common cutoff used by physical scientists. We can also see the true theoretical proportion:

sum(abs((z-10)/3)>3)/length(z)
2*pnorm(-3)

In the first line above, we are using sum to count the number of TRUE's in the logical vector (abs((z-10)/3)>3). This works because logical values are coerced to 0's and 1's when necessary.
The function dnorm has a closed form: With mean=0 and sd=1, dnorm(x) equals exp(-x2/2)/sqrt(2*pi). By contrast, the CDF, given by pnorm, has no closed form and must be numerically approximated. By definition, pnorm(x) equals the integral of dnorm(t) as t ranges from minus infinity to x. To find a p-value (i.e., the probability of observing a statistic more extreme than the one actually observed), we use pnorm; to construct a confidence interval (i.e., a range of reasonable values for the true parameter), we use the inverse, qnorm.

pnorm(1:3)-pnorm(-(1:3))
qnorm(c(.05,.95))

The first line above summarizes the well-known 68, 95, 99.7 rule for normal distributions (these are the approximate proportions lying within 1, 2, and 3 standard deviations from the mean). The second line gives the critical values used to construct a 90% confidence interval for a parameter when its estimator is approximately normally distributed.

Let us now briefly consider an example of a discrete distribution, which means a distribution on a finite or countably infinite set (as opposed to a continuous distribution like the normal). The Poisson distribution, which has a single real-valued parameter lambda, puts all of its probability mass on the nonnegative integers. A Poisson distribution, often used to model data consisting of counts, has mean and variance both equal to lambda.

k=0:10
dpois(k,lambda=2.5) # or equivalently,
exp(-2.5)*2.5^k/factorial(k)

Next, simulate some Poisson variables:

x=rpois(10000,lambda=2.5)
table(x)
mean(x)
var(x)

The power-law or Pareto distribution

A commonly used distribution in astrophysics is the power-law distribution, more commonly known in the statistics literature as the Pareto distribution. There are no built-in R functions for dealing with this distribution, but because it is an extremely simple distribution it is easy to write such functions.

The density function for the Pareto is f(x)=aba/x(a+1) for x>b. Here, a and b are fixed positive parameters, where b is the minimum possible value. As an example, consider the log N = -1.5 * log S relationship, where S is the apparent brightness and N is the number of standard candles randomly located in transparent space. Thus, a Pareto distribution with (a+1) = 1.5 is a reasonable, if simplistic, model for the brightness of observed standard candles in space. The b parameter merely reflects the choice of units of measurement.

It turns out that a Pareto random variable is simply b*exp(X), where X is an exponential random variable with rate=a (i.e., with mean=1/a). However, rather than exploiting this simple relationship, we wish to build functions for the Pareto distribution from scratch. Our default values, which may be changed by the user, will be a=0.5 and b=1.

dpareto=function(x, a=0.5, b=1) a*b^a/x^(a+1)

Next, we integrate the density function to obtain the distribution function, which is F(x)=1-(b/x)a for x>=b (and naturally F(x)=0 for x < b):

ppareto=function(x, a=0.5, b=1) (x > b)*(1-(b/x)^a)

Note that (x > b) in the above function is coerced to numeric, either 0 or 1.

Inverting the distribution function gives the quantile function. The following simplistic function is wrong unless 0<u<1, so a better-designed function should do some error-checking.

qpareto=function(u, a=0.5, b=1) b/(1-u)^(1/a)

Finally, to simulate random Pareto random variables, we use the fact that whenever the quantile function is applied to a uniform random variable, the result is a random variable with the desired distribution:

rpareto=function(n, a=0.5, b=1) qpareto(runif(n),a,b)

Creating functions in R, as illustrated above, is a common procedure. Note that each of the arguments of a function may be given a default value, which is used whenever the user calls the function without specifying the value of this parameter. Also note that each of the above functions consists of only a single line; however, longer functions may be created by enclosing them inside curly braces { }.

A few simple plots

The commands below create plots related to the four functions just created.

par(mfrow=c(2,2))
x=seq(1,50,len=200)
plot(x,dpareto(x),type="l")
plot(x,ppareto(x),type="l",lty=2)
u=seq(.005,.9,len=200)
plot(u,qpareto(u),type="l",col=3)
z=rpareto(200)
dotchart(log10(z), main="200 random logged Pareto deviates",
cex.main=.7)
par(mfrow=c(1,1))

The above commands illustrate some of the many plotting capabilities of R. The par function sets many graphical parameters, for instance, 'mfrow=c(2,2)', which divides the plotting window into a matrix of plots, set here to two rows and two columns. In the plot commands, 'type' is set here to "l" for a line plot; other common options are "p" for points (the default), "b" for connected dots, and "n" for nothing (to create axes only). Other options used: 'lty' sets the line type (1=solid, 2=dashed, etc.), 'col' sets color (1=black, 2=red, 3=green, etc.), 'main' puts a string into the plot title, and 'cex.main' sets the text magnification.
Type '? par' to see a list of the many plotting parameters and options.