smooth.spline {stats}  R Documentation 
Fits a cubic smoothing spline to the supplied data.
smooth.spline(x, y = NULL, w = NULL, df, spar = NULL, cv = FALSE, all.knots = FALSE, nknots = NULL, keep.data = TRUE, df.offset = 0, penalty = 1, control.spar = list())
x 
a vector giving the values of the predictor variable, or a list or a twocolumn matrix specifying x and y. 
y 
responses. If y is missing, the responses are assumed
to be specified by x . 
w 
optional vector of weights of the same length as x ;
defaults to all 1. 
df 
the desired equivalent number of degrees of freedom (trace of the smoother matrix). 
spar 
smoothing parameter, typically (but not necessarily) in
(0,1]. The coefficient λ of the integral of the
squared second derivative in the fit (penalized log likelihood)
criterion is a monotone function of spar , see the details
below. 
cv 
ordinary (TRUE ) or “generalized” crossvalidation
(GCV) when FALSE . 
all.knots 
if TRUE , all distinct points in x are used as
knots. If FALSE (default), a subset of x[] is used,
specifically x[j] where the nknots indices are evenly
spaced in 1:n , see also the next argument nknots . 
nknots 
integer giving the number of knots to use when
all.knots=FALSE . Per default, this is less than n, the
number of unique x values for n > 49. 
keep.data 
logical specifying if the input data should be kept
in the result. If TRUE (as per default), fitted values and
residuals are available from the result. 
df.offset 
allows the degrees of freedom to be increased by
df.offset in the GCV criterion. 
penalty 
the coefficient of the penalty for degrees of freedom in the GCV criterion. 
control.spar 
optional list with named components controlling the
root finding when the smoothing parameter spar is computed,
i.e., missing or NULL , see below.
Note that this is partly experimental and may change with general spar computation improvements!
spar is only searched for in the interval
[low, high].

The x
vector should contain at least four distinct values.
Distinct here means “distinct after rounding to 6 significant
digits”, i.e., x
will be transformed to
unique(sort(signif(x, 6)))
, and y
and w
are
pooled accordingly.
The computational λ used (as a function of
spar
) is
lambda = r * 256^(3*spar  1)
where
r = tr(X' W X) / tr(Σ),
Σ is the matrix given by
Sigma[i,j] = Integral B''[i](t) B''[j](t) dt,
X is given by X[i,j] = B[j](x[i]),
W is the diagonal matrix of weights (scaled such that
its trace is n, the original number of observations)
and B[k](.) is the kth Bspline.
Note that with these definitions, f_i = f(x_i), and the Bspline basis representation f = X c (i.e., c is the vector of spline coefficients), the penalized log likelihood is L = (y  f)' W (y  f) + λ c' Σ c, and hence c is the solution of the (ridge regression) (X' W X + λ Σ) c = X' W y.
If spar
is missing or NULL
, the value of df
is used to
determine the degree of smoothing. If both are missing, leaveoneout
crossvalidation (ordinary or “generalized” as determined by
cv
) is used to determine λ.
Note that from the above relation,
spar
is spar = s0 + 0.0601 * log(lambda),
which is intentionally different from the Splus implementation
of smooth.spline
(where spar
is proportional to
λ). In R's (log λ) scale, it makes more
sense to vary spar
linearly.
Note however that currently the results may become very unreliable
for spar
values smaller than about 1 or 2. The same may
happen for values larger than 2 or so. Don't think of setting
spar
or the controls low
and high
outside such a
safe range, unless you know what you are doing!
The “generalized” crossvalidation method will work correctly when
there are duplicated points in x
. However, it is ambiguous what
leaveoneout crossvalidation means with duplicated points, and the
internal code uses an approximation that involves leaving out groups
of duplicated points. cv=TRUE
is best avoided in that case.
An object of class "smooth.spline"
with components
x 
the distinct x values in increasing order, see
the Details above. 
y 
the fitted values corresponding to x . 
w 
the weights used at the unique values of x . 
yin 
the y values used at the unique y values. 
data 
only if keep.data = TRUE : itself a
list with components x , y and w
of the same length. These are the original (x_i,y_i,w_i),
i=1,...,n, values where data$x may have repeated values and
hence be longer than the above x component; see details.

lev 
leverages, the diagonal values of the smoother matrix. 
cv.crit 
crossvalidation score, “generalized” or true, depending
on cv . 
pen.crit 
penalized criterion 
crit 
the criterion value minimized in the underlying
.Fortran routine ‘sslvrg’. 
df 
equivalent degrees of freedom used. Note that (currently)
this value may become quite unprecise when the true df is
between and 1 and 2.

spar 
the value of spar computed or given. 
lambda 
the value of λ corresponding to spar ,
see the details above. 
iparms 
named integer(3) vector where ..$ipars["iter"]
gives number of spar computing iterations used. 
fit 
list for use by predict.smooth.spline , with
components

call 
the matched call. 
The default all.knots = FALSE
and nknots = NULL
entails
using only O(n^{0.2})
knots instead of n for n > 49. This cuts speed and memory
requirements, but not drastically anymore since R version 1.5.1 where
it is only O(nk) + O(n) where nk is
the number of knots.
In this case where not all unique x
values are
used as knots, the result is not a smoothing spline in the strict
sense, but very close unless a small smoothing parameter (or large
df
) is used.
R implementation by B. D. Ripley and Martin Maechler
(spar/lambda
, etc).
This function is based on code in the GAMFIT
Fortran program by
T. Hastie and R. Tibshirani (http://lib.stat.cmu.edu/general/),
which makes use of spline code by Finbarr O'Sullivan. Its design
parallels the smooth.spline
function of Chambers & Hastie (1992).
Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S, Wadsworth & Brooks/Cole.
Green, P. J. and Silverman, B. W. (1994) Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall.
Hastie, T. J. and Tibshirani, R. J. (1990) Generalized Additive Models. Chapman and Hall.
predict.smooth.spline
for evaluating the spline
and its derivatives.
attach(cars) plot(speed, dist, main = "data(cars) & smoothing splines") cars.spl < smooth.spline(speed, dist) (cars.spl) ## This example has duplicate points, so avoid cv=TRUE lines(cars.spl, col = "blue") lines(smooth.spline(speed, dist, df=10), lty=2, col = "red") legend(5,120,c(paste("default [C.V.] => df =",round(cars.spl$df,1)), "s( * , df = 10)"), col = c("blue","red"), lty = 1:2, bg='bisque') detach() ## Residual (Tukey Anscombe) plot: plot(residuals(cars.spl) ~ fitted(cars.spl)) abline(h = 0, col="gray") ## consistency check: stopifnot(all.equal(cars$dist, fitted(cars.spl) + residuals(cars.spl))) ## artificial example y18 < c(1:3,5,4,7:3,2*(2:5),rep(10,4)) xx < seq(1,length(y18), len=201) (s2 < smooth.spline(y18)) # GCV (s02 < smooth.spline(y18, spar = 0.2)) plot(y18, main=deparse(s2$call), col.main=2) lines(s2, col = "gray"); lines(predict(s2, xx), col = 2) lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3) ## The following shows the problematic behavior of 'spar' searching: (s2 < smooth.spline(y18, con = list(trace=TRUE,tol=1e6, low= 1.5))) (s2m < smooth.spline(y18, cv = TRUE, con = list(trace=TRUE,tol=1e6, low= 1.5))) ## both above do quite similarly (Df = 8.5 + 0.2)