silhouette {cluster} R Documentation

Compute or Extract Silhouette Information from Clustering

Description

Compute silhouette information according to a given clustering in k clusters.

Usage

```silhouette(x, ...)
## Default S3 method:
silhouette  (x, dist, dmatrix, ...)
## S3 method for class 'partition':
silhouette(x, ...)
## S3 method for class 'clara':
silhouette(x, full = FALSE, ...)

sortSilhouette(object, ...)
## S3 method for class 'silhouette':
summary(object, FUN = mean, ...)
## S3 method for class 'silhouette':
plot(x, nmax.lab = 40, max.strlen = 5,
main = NULL, sub = NULL, xlab = expression("Silhouette width "* s[i]),
col = "gray",  do.col.sort = length(col) > 1, border = 0,
cex.names = par("cex.axis"), do.n.k = TRUE, do.clus.stat = TRUE, ...)
```

Arguments

 `x` an object of appropriate class; for the `default` method an integer vector with k different integer cluster codes or a list with such an `x\$clustering` component. Note that silhouette statistics are only defined if 2 <= k <= n-1. `dist` a dissimilarity object inheriting from class `dist` or coercible to one. If not specified, `dmatrix` must be. `dmatrix` a symmetric dissimilarity matrix (n * n), specified instead of `dist`, which can be more efficient. `full` logical specifying if a full silhouette should be computed for `clara` object. Note that this requires O(n^2) memory, since the full dissimilarity (see `daisy`) is needed internally. `object` an object of class `silhouette`. `...` further arguments passed to and from methods. `FUN` function used to summarize silhouette widths. `nmax.lab` integer indicating the number of labels which is considered too large for single-name labeling the silhouette plot. `max.strlen` positive integer giving the length to which strings are truncated in silhouette plot labeling. `main, sub, xlab` arguments to `title`; have a sensible non-NULL default here. `col, border, cex.names` arguments passed `barplot()`; note that the default used to be ```col = heat.colors(n), border = par("fg")``` instead. `col` can also be a color vector of length k for clusterwise coloring, see also `do.col.sort`: `do.col.sort` logical indicating if the colors `col` should be sorted “along” the silhouette; this is useful for casewise or clusterwise coloring. `do.n.k` logical indicating if n and k “title text” should be written. `do.clus.stat` logical indicating if cluster size and averages should be written right to the silhouettes.

Details

For each observation i, the silhouette width s(i) is defined as follows:
Put a(i) = average dissimilarity between i and all other points of the cluster to which i belongs (if i is the only observation in its cluster, s(i) := 0 without further calculations). For all other clusters C, put d(i,C) = average dissimilarity of i to all observations of C. The smallest of these d(i,C) is b(i) := min_C d(i,C), and can be seen as the dissimilarity between i and its “neighbor” cluster, i.e., the nearest one to which it does not belong. Finally,

s(i) := ( b(i) - a(i) ) / max( a(i), b(i) ).

`silhouette.default()` is now based on C code donated by Romain Francois (the R version being still available as `cluster:::silhouette.default.R`).

Observations with a large s(i) (almost 1) are very well clustered, a small s(i) (around 0) means that the observation lies between two clusters, and observations with a negative s(i) are probably placed in the wrong cluster.

Value

`silhouette()` returns an object, `sil`, of class `silhouette` which is an [n x 3] matrix with attributes. For each observation i, `sil[i,]` contains the cluster to which i belongs as well as the neighbor cluster of i (the cluster, not containing i, for which the average dissimilarity between its observations and i is minimal), and the silhouette width s(i) of the observation. The `colnames` correspondingly are `c("cluster", "neighbor", "sil_width")`.
`summary(sil)` returns an object of class `summary.silhouette`, a list with components

 `si.summary` numerical `summary` of the individual silhouette widths s(i). `clus.avg.widths` numeric (rank 1) array of clusterwise means of silhouette widths where `mean = FUN` is used. `avg.width` the total mean `FUN(s)` where `s` are the individual silhouette widths. `clus.sizes` `table` of the k cluster sizes. `call` if available, the call creating `sil`. `Ordered` logical identical to `attr(sil, "Ordered")`, see below.

`sortSilhouette(sil)` orders the rows of `sil` as in the silhouette plot, by cluster (increasingly) and decreasing silhouette width s(i).
`attr(sil, "Ordered")` is a logical indicating if `sil` is ordered as by `sortSilhouette()`. In that case, `rownames(sil)` will contain case labels or numbers, and
`attr(sil, "iOrd")` the ordering index vector.

Note

While `silhouette()` is intrinsic to the `partition` clusterings, and hence has a (trivial) method for these, it is straightforward to get silhouettes from hierarchical clusterings from `silhouette.default()` with `cutree()` and distance as input.

By default, for `clara()` partitions, the silhouette is just for the best random subset used. Use `full = TRUE` to compute (and later possibly plot) the full silhouette.

References

Rousseeuw, P.J. (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53–65.

chapter 2 of Kaufman, L. and Rousseeuw, P.J. (1990), see the references in `plot.agnes`.

`partition.object`, `plot.partition`.

Examples

```data(ruspini)
pr4 <- pam(ruspini, 4)
str(si <- silhouette(pr4))
(ssi <- summary(si))
plot(si) # silhouette plot
plot(si, col = c("red", "green", "blue", "purple"))# with cluster-wise coloring

si2 <- silhouette(pr4\$clustering, dist(ruspini, "canberra"))
summary(si2) # has small values: "canberra"'s fault
plot(si2, nmax= 80, cex.names=0.6)

op <- par(mfrow= c(3,2), oma= c(0,0, 3, 0),
mgp= c(1.6,.8,0), mar= .1+c(4,2,2,2))
for(k in 2:6)
plot(silhouette(pam(ruspini, k=k)), main = paste("k = ",k), do.n.k=FALSE)
mtext("PAM(Ruspini) as in Kaufman & Rousseeuw, p.101",
outer = TRUE, font = par("font.main"), cex = par("cex.main"))
par(op)

## clara(): standard silhouette is just for the best random subset
data(xclara)
set.seed(7)
str(xc1k <- xclara[sample(nrow(xclara), size = 1000) ,])
cl3 <- clara(xc1k, 3)
plot(silhouette(cl3))# only of the "best" subset of 46
## The full silhouette: internally needs large (36 MB) dist object:
sf <- silhouette(cl3, full = TRUE) ## this is the same as
s.full <- silhouette(cl3\$clustering, daisy(xc1k))
if(paste(R.version\$major, R.version\$minor, sep=".") >= "2.3.0")
stopifnot(all.equal(sf, s.full, check.attributes = FALSE, tol = 0))
## color dependent on original "3 groups of each 1000":
plot(sf, col = 2+ as.integer(names(cl3\$clustering) ) %/% 1000,
main ="plot(silhouette(clara(.), full = TRUE))")

## Silhouette for a hierarchical clustering:
ar <- agnes(ruspini)
si3 <- silhouette(cutree(ar, k = 5), # k = 4 gave the same as pam() above
daisy(ruspini))
plot(si3, nmax = 80, cex.names = 0.5)
## 2 groups: Agnes() wasn't too good:
si4 <- silhouette(cutree(ar, k = 2), daisy(ruspini))
plot(si4, nmax = 80, cex.names = 0.5)
```

[Package cluster version 1.11.5 Index]