Summary: This course- the first in a series of three - provides a foundation for understanding the field of cluster analysis in unlabeled data. The target audience for this course comprises undergraduate and graduate students majoring in engineering and science, as well as practicing engineers and scientists interested in either research about or applications of clustering to real world problems such as data mining, image analysis and bioinformatics. The subject matter is widely available in a number of standard textbooks given in the references below. The course begins with a discussion of the general nature of clustering. Three problems are identified: tendency assessment, partitioning and validation. Two types of data are discussed: object vector data, and pair wise objects relational data. Next, I develop the mathematical structure needed to carry clustering algorithms, discussing the notions of similarity, label vectors, partition matrices (U) and point prototypes (V). The second part of the course contains a description (and pseudo code) for one algorithm each from the four major categories of clustering methods. Specifically, I discuss and illustrate with a numerical example: (i) the U only model for single linkage clustering; (ii) the V only model for clustering with Kohonen's self-organizing map; (iii) the (U,V) model for clustering with the hard and fuzzy c-means models; and (iv) the (U,V,+) model for clustering using the expectation-maximization algorithm for Gaussian mixture decomposition.
Summary: This last module in the series discusses just one approach to the interesting and important problem of clustering in very large (VL) data. The target audience is graduate students majoring in engineering and science, and practicing engineers and scientists interested in either research about or applications of clustering applied to very large real world problems that occur in data mining, image analysis and bioinformatics. Almost none of the subject matter in this course is available in textbooks; almost all of it is the object of (my own) current research, and as such, it reflects my own bias, prejudices, background and interests. I have supplied references that contain pointers to many nice papers on these topics that use related or competitive methods that have been proposed and studied by others. I begin with a characterization of VL data. For me, this means any data set that you cannot load into your computer. Not an objective definition, but a definition that is easy to understand and practical, because there is a data set too big for any computer you use, and hence, VL for you. There are two main approaches to clustering in VL data; distributed clustering, and progressing sampling followed by extension. I discuss the first approach briefly, but it seems much more difficult to me than the second approach. Next, I define progressive sampling followed by (non-iterative) extension. This idea is pretty general: it can accelerate most (but not all) iterative algorithms that estimate parameters with loadable data (this is true for both clustering and classifier design!), and, it provides a means for approximating the outputs of many algorithms for unloadable data. So, one of the main points of this third course is to establish the basic ideas of progressive sampling and extension. The method of clustering in VL data by (sampling + extension) is developed and illustrated with four clustering algorithms: (i) extended fast fuzzy c-means (eFFCM) for segmentation of VL images; generalized fast fuzzy c-means (geFFCM) for clustering in VL object data (VL sets of feature vectors in p dimensions); (iii) generalized fast expectation maximization (geFEM) for clustering by Gaussian mixture decomposition in VL object data; and (iv), extended non-Euclidean relational fuzzy c-means (eNERF) for clustering in VL (square) relational data. These four methods are presented in the spirit of active research - i.e., parts of them clearly need improvement and more testing, and I expect much of this material to be replaced by better approaches as our understanding of clustering using this approach matures.
Summary: This course - the second in a series of three - discusses several approaches to the first and third problems of clustering identified in module I - viz., pre-clustering tendency assessment and post-clustering cluster validation. The target audience comprises advanced undergraduate and graduate students majoring in engineering and science, and practicing engineers and scientists interested in either research about or applications of clustering to real world problems such as data mining, image analysis and bioinformatics. Some of subject matter in this course is available in textbooks (most notably some of the material about cluster validity functionals), and some of the subject matter is the object of (my) current research. The references contain pointers to some excellent papers on these topics, and on a number of related or competitive methods that have been proposed and studied by others. I begin with a simple numerical example that establishes the necessity for both assessment and validity. Then, I discuss the visual assessment of tendency family of algorithms (VAT, sVAT and coVAT). These algorithms produce images that enable a user to make useful guesses about the number of clusters to seek in relational data before proceeding with a partitioning method for finding the clusters. Since object data can always be converted to relational form by computing pair wise distances, these methods are well defined for all types of unlabeled numerical data. The coVAT algorithm provides a means for estimating the number of clusters in each of the four problems associated with rectangular relational data: row clusters, column clusters, joint (pure) clusters, and mixed co-clusters. The second half of this course presents some examples of cluster validation using scalar measures or indices of cluster validity. Several examples from each of the three major categories (crisp, fuzzy and probabilistic) of indices are presented. This course concludes with a numerical example that compares 23 indices of all three types on clusters in 12 sets of data drawn from mixtures of Gaussian distributions having either 3 or 6 components. (SOME) indices of all three types do pretty well in this example, while others do very badly. I don't think this problem has a general "solution", but since we use clustering in many, many applications, we keep trying to find good indices to validate algorithmic outputs.