Astrostatistics Image Penn State University Eberly College of Science Center for Astrostatistics Center for Astrostatistics

Multivariate classification & analysis


Classification Society of North America (CSNA)
    Metasite with links to classification meetings, journals, discussion groups, commercial and on-line software.

    Collection of multivariate clustering techniques implemented in the core R package.  DAISY computes dissimilarities between objects with different types of variables.  Partitioning Around Medoids (PAM) partitions the dataset using the k-medoid method which is robust against outliers. Clustering Large Applications (CLARA) partitions large data sets.  Fuzzy Analysis (FANNY) give a fuzze partitioning.  Agglomerative clustering (AGNES) and divisive clustering (DIANA) give hierarchical structures.  Monothetic Analysis (MONA) uses binary variables. This site gives stand-alone Fortran implementations.  From the book Finding Groups in Data: An Introduction to Cluster Analysis by L. Kaufman and P. J. Rousseeuw (1987).

Normal mixture models

Several codes are available that classify and characterize multivariate datasets as mixtures of Gaussian populations via likelihood methods, often using the EM Algorithm and Bayesian principles. Snob uses the minimum message length method of machine learning.

    EMMIX by G. McLachlan of University of Queensland
    MCLUST by C. Fraley and A. Raftery of University of Washington
    AutoClass Cby P. Cheeseman of NASA's Ames Research Center
    Snob by D. Dowe of Monash University
    FastEM by the Auton Lab (CMU) and the PiCA Collaboration

Weka Knowledge Explorer
    Machine learning algorithms for data mining including multivariate classifiers, decision trees, neural nets, GUIs, resampling and more.  In Java

Machine Learning Library in C++ (MLC++)
    Data mining and multivariate classification package including data manipulation, variety of categorizers (on attributes, thresholds, nearest neighbor, perceptron, decision tree ), induction algorithms, and visualization tools of data and trees.  From Silicon Graphics Inc.

GRB Tool Shed
    Interactive environment for the analysis of astronomical gamma-ray bursts from NASA's BATSE experiment.  Emphases multivariate classification including supervised decision trees,  K* nearest neighbor, Naive Bayes,  normal mixtures using the EM Algorithm, K means, COBWEB, backpropagation neutral networks, and Kohonen networks.  Based on the Weka machine learning package.  By Jon Hakkila (College of Charleston) and colleagues.

Feasible solution algorithms
    Algorithms for the common high breakdown estimation criteria, and to find the minimum volume ellipsoid in multivariate datasets. By D. Hawkins, University of Minnesota, and distributed by Statlib.

Oblique classifier 1 (OC1)
    Partitioning of multivariate datasets using oblique and axis-parallel hyperplanes. Written in C by S. Salzbert of Johns Hopkins University.

Software for clustering and multivariate analysis
    Metasite with descriptions of on-line programs and packages.  From Fionn Murtagh (Univ. London)

    Clustering algorithm based on dynamic altering of hierarchies.

Fast Algorithm for Classification Trees"
    Tree-structures classification similar to CART.

    Library of several dozen subroutines from NIST for multivariate clustering algorithm from 1975 monobraph by J. A. Hartigan.

Cluster analysis
    Six programs computing dissimilarities, partitioning using medoids, k-medoid clustering, fuzzy clustering, agglomerative and divisive hierarchical clustering, clustering of binary data.

    Average-linkage hierarchical clustering.

Hierarchical clustering
    Algorithm for agglomerative clustering using various criteria (Ward's minimum variance, single linkage, average linkage, complete linkage, McQuitty's method, median method, centroid method).

Hierarchical clustering
    Algorithm for single-linkage and minimum intra-cluster variance clustering.  Applied Statistics algorithm #58.

k-means clustering,
    k-means clustering minimizing intra-cluster variance.

Multivariate analysis

R Package
    Package in Pascal developed for ecological spatio-temporal multivariate datasets based on monograph by L. & P. Legendre (1983). Functionalities include autocorrelation using correlograms (Moran's I and Geary's c indices), hierarchical agglomerative clustering, k-means clustering, chronological clustering for multivariate time series, analysis of variance, geometrical connectors, (nearest neighbor, Gabriel's connection, Delaunay triangulation), Mantel's two-sample statistic, multidimensional scaling by principal coordinates analysis, univariate periodogram.  [This package should not be confused with the enormous R statistical package modeled after S-Plus.]

    Large multivariate analysis and graphical display package designed for ecologists and geographers. Includes principal components analysis with instrumental variables, correspondance analysis, coinertia analysis, contingency tables, discriminant analysis,fuzzy correspondance analysis, Rao's diversity coefficient, Moran's I and Geary'c randomization tests for spatial autocorrelation, Wartenberg's multivariate spatial correlation analysis, partial triadic analysis of k-tables.  From the bioinformatic group at Universite de Lyon for Macintosh and Windows 95 platforms.

Fast Minimum Covariance Determinant (MCD)
    This is a highly robust estimator of multivariate location and scatter based on the subset of points whose covariance matrix has the lowest determinant.  Efficient method for large datasets.  By P. Rousseeuw and K. Van Driessen of University of Antwerp.

Minimum Volume Ellipsoid (MINVOL)
    Computes highly robust location and scatter matrix.  By P. Rousseeuw of University of Antwerp.

Multivariate data analysis software
    Collection of subroutines for principal components analysis, partitioning, hierarchical clustering. discriminant analyses (linear, multiple, k-nearest neighbors), correspondence analysis, multidimensional scaling, Sammon mapping, Kohonen self-organizing feature map. From Fionn Murtagh (Univ. London).

    Self-contained data management and analysis system well-adapted to very large multivariate datasets.   Includes fast searches and data minin, ANOVA, linear modeling, clustering, life table analysis. For Windows.

    Interactive Projection Pursuit, providing 1- and 2-dimensional projections of multivariate data for interactive discovery of structure. The user chooses and graphically investigates interesting projections. From Case Western Reserve University. C and Fortran algorithms installed as a library for S-Plus.

Projection pursuit
    Two-dimensional exploratory projection pursuit.

Multivariate skewness and kurtosis
Probabilities of R2
    Distribution function of the square multiple correlation coefficient

Linear dependency analysis for multivariate data
Multivariate linear regression by least median of squares.
Minimum volume ellipsoid estimator
    Robust estimator of multivariate location and dispersion.

    Hypothesis testing for means and spreads for multivariate Gaussian data.