You are here

Cluster Analysis and Data Mining

Ronald S. King
Mercury Learning and Information
Publication Date: 
Number of Pages: 
Paperback with CDROM
[Reviewed by
Robert W. Hayden
, on

Readers may be familiar with regression analysis or analysis of variance. Those types of analysis are techniques for solving certain classes of statistical problems. Cluster analysis, on the other hand, is a name for a class of problems, for which many (partial?) solutions exist. As an example, imagine that you are an automobile manufacturer planning advertising for a plain but reliable and inexpensive model. A first look at the data shows that the average age of buyers of this model is 45 years. A cluster analysis would look for identifiable groups (or clusters) within the data that might help you target your advertising more precisely. Here perhaps there are major clusters of buyers in their mid-twenties and in their seventies. These would be groups making far less than their career peak income who appreciate reliable and inexpensive transportation. There may in fact be very few buyers with ages near 45.

The above example also illustrates that cluster analysis is usually an exploratory technique. Typically we do not have a hypothesis about the clusters, we simply wish to discover if there are any. The usual next step is to act on the basis of what we find. The auto manufacturer might now target advertising at 22-year-olds and 75-year-olds rather than 45-year-olds. The proof of the pudding is in how this affects sales rather than in a formal hypothesis test.

It takes a fairly large data set to support searching for clusters, especially if there is no a priori bound on how many might be present. This, and its exploratory nature, has led to cluster analysis often being used in data mining with “big data,” as the title of the book at hand suggests.

Now we come to the sad subject of whether this is the best book from which you might learn more about cluster analysis. No, it is not. The author has certainly done his homework in finding relevant references, and those might be of value, and indeed might be better books to read. The main problem with this one is a lack of careful thought about communication and pedagogy. Specifics are too numerous to mention, so two brief examples may serve as representative.

The book begins with four pages of reasonable text. The stated prerequisites are “elementary statistics plus a brief exposure to data structures.” Yet on page 5 we encounter a formal definition of an abstract metric space. After some more scary stuff page ten seems to be a worked example. But no context is given and we just have a table of numbers. By some process this generates a sequence of four additional tables but no calculations are shown, only the results. In addition, the (meaningful) column labels, which change at each step, are all illegible. (It appears the table was once in color and was here reduced to black and white without anyone checking the results. This is but one of far too many signs of no editorial supervision and no proofreading.)

A second example is Chapter 8 on cluster validity — determining whether the clusters found are real in some sense. This has sections that review hypothesis testing, Monte Carlo methods, and random number generation — all in 34 pages and in far more detail than are needed in the final six pages that are actually about cluster validity. There the author gives very little detail, rather referring the reader to the cited references.

The result is that it is hard to identify an appropriate audience for this work. Beginners will be lost while the well-prepared may find the main content skimpy and hard to follow.

After a few years in industry, Robert W. Hayden ( taught mathematics at colleges and universities for 32 years and statistics for 20 years. In 2005 he retired from full-time classroom work. He now teaches statistics online at and does summer workshops for high school teachers of Advanced Placement Statistics. He contributed the chapter on evaluating introductory statistics textbooks to the MAA's Teaching Statistics.

The table of contents is not available.