You are here

Making Sense of Data II: A Practical Guide to Data Visualization, Advanced Data Mining Methods, and Applications

Glenn J. Myatt and Wayne P. Johnson
John Wiley
Publication Date: 
Number of Pages: 
[Reviewed by
Patricia Humphreys
, on

This monograph is aimed at professionals in business, government, or other professions who might be thrust into a data mining/statistical analysis project with no knowledge of the field. As the title implies, it is the second in the series and goes more deeply into the topics of visualization, clustering, and predictive analysis than the first volume. The mathematics is kept at a very elementary level, so the book should be accessible to all its intended readers. The intent is really to make the readers conversant with what might be involved in such a project, not make them technical experts.

The first chapter provides an overview and introduces the steps typically required to prepare data for analysis of any type (“cleaning” the data, transforming variables, etc). The second chapter gives good descriptions and examples of the types of graphics one should examine in a data mining project. The third and fourth chapters describe analytic clustering and predictive modeling. The fifth chapter gives the reader some examples as to how data mining might be employed in certain industries as well as two fairly detailed examples — one involving microRNA analysis in the detection of cancers and another with credit scoring.

The biggest strength of the book is the good descriptions, easily accessible to a nontechnician, of the logic behind such complex topics as cluster analysis (methods for classifying “like” observations based on similarities and differences in the set of variables), multiple and logistic regression, as well as discriminant analysis and principal components analysis. The targeted “manager” will understand the basic concept, while most details are left to the experts.

Appendix B is really a user’s manual for the Traceis software, which will perform all the analyses described in the book. One can obtain the software free from the author’s website (using the package requires obtaining a license key from the authors). Commercial software of this type would cost a user hundreds of dollars; it is entirely possible that Traceis won’t be free for long.

My quibbles with the book are few; they mostly concern a couple of the authors’ comments on data preparation and cleaning. For example, while they do suggest bivariate data visualizations such as scatterplot matrices (useful in themselves for identification of outliers in the data — something not mentioned), they suggest computing correlation matrices. These can be extremely misleading without accompanying plots, as correlations are useful measures of association only if the relationship is linear.

All in all, this would be a very useful reference or introduction to data mining for someone new to the field.

Patricia Humphrey is an Associate Professor of Statistics in the Department of Mathematical Sciences at Georgia Southern University. She has been a member of Project NExT-SE since 1998, and has served as a co-leader for Section NExT for several years. She was Chair of the SIGMAA-StatEd for 2008 and is a member of the MAA-ASA Joint Committee on Statistics Education. She is the author of several ancillary technology manuals for introductory statistics.


1. Introduction.

1.1 Overview.

1.2 Definition.

1.3 Preparation.

1.3.1 Overview.

1.3.2 Accessing tabular data.

1.3.3 Accessing unstructured data.

1.3.4 Understanding the variables and observations.

1.3.5 Data cleaning.

1.3.6 Transformation.

1.3.7 Variable reduction.

1.3.8 Segmentation.

1.3.9 Preparing data to apply.

1.4 Analysis.

1.4.1 Data mining tasks.

1.4.2 Optimization.

1.4.3 Evaluation.

1.4.4 Model forensics.

1.5 Deployment .

1.6 Outline of book .

1.6.1 Overview.

1.6.2 Data visualization.

1.6.3 Clustering.

1.6.4 Predictive analytics.

1.6.5 Applications.

1.6.6 Software.

1.7 Summary.

1.8 Further reading .

2. Data visualization.

2.1 Overview.

2.2 Visualization design principles.

2.2.1 General principles.

2.2.2 Graphics design.

2.2.3 Anatomy of a graph.

2.3 Tables.

2.3.1 Simple tables.

2.3.2 Summary tables.

2.3.3 Two-way contingency tables.

2.3.4 Supertables .

2.4 Univariate data visualization.

2.4.1 Bar chart.

2.4.2 Histograms.

2.4.3 Frequency polygram.

2.4.4 Box plots.

2.4.5 Dot plot .

2.4.6 Stem-and-leaf plot .

2.4.7 Quantile plot.

2.4.8 Q-Q plot.

2.5 Bivariate data visualization.

2.5.1 Scatterplot.

2.6 Multivariate data visualization.

2.6.1 Histogram matrix.

2.6.2 Scatterplot matrix.

2.6.3 Multiple box plot.

2.6.4 Trellis plot.

2.7 Visualizing groups.

2.7.1 Dendrograms.

2.7.2 Decision trees.

2.7.3 Cluster image maps.

2.8 Dynamic techniques.

2.8.1 Data brushing.

2.8.2 Nearness selection.

2.8.3 Sorting and rearranging.

2.8.4 Searching and filtering.

2.9 Summary.

2.10 Further reading.

3. Clustering.

3.1 Overview.

3.2 Distance measures.

3.2.1 Overview.

3.2.2 Numeric distance measures.

3.2.3 Binary distance measures.

3.3.4 Mixed variables.

3.3.5 Others measures.

3.3 Agglomerative hierarchical clustering.

3.3.1 Overview.

3.3.2 Single linkage.

3.3.3 Complete linkage.

3.2.4 Average linkage.

3.3.5 Other methods.

3.3.6 Selecting groups.

3.4 Partitioned-based clustering .

3.4.1 Overview.

3.4.2 k-means.

3.4.3 Worked example.

3.4.4 Miscellaneous partitioned-based clustering.

3.5 Fuzzy clustering.

3.5.1 Overview.

3.5.2 Fuzzy k-means.

3.5.3 Worked examples.

3.6 Summary.

3.7 Further reading.

4. Predictive analytics.

4.1 Overview.

4.1.1 Predictive modeling.

4.1.2 Testing model accuracy.

4.1.3 Evaluating regression models’ predictive accuracy.

4.1.4 Evaluating classification models’ predictive accuracy.

4.1.5 Evaluating binary models’ predictive accuracy.

4.1.6 ROC charts.

4.1.7 Lift chart.

4.2 Principal component analysis.

4.2.1 Overview.

4.2.2 Principal components.

4.2.3 Generating principal components.

4.2.4 Interpretation of principal components.

4.3 Multiple linear regression.

4.3.1 Overview.

4.3.2 Generating models.

4.3.3 Prediction.

4.3.4 Analysis of residuals.

4.3.5 Standard error.

4.3.6 Coefficient of multiple determination.

4.3.7 Testing the model significance.

4.3.8 Selecting and transforming variables.

4.4 Discriminant analysis.

4.4.1 Overview.

4.4.2 Discriminant function.

4.4.3 Discriminant analysis example.

4.5 Logistic regression.

4.5.1 Overview.

4.5.2 Logistic regression formula.

4.5.3 Estimating coefficients.

4.5.4 Assessing and optimizing the results.

4.6 Naïve Bayes classifiers.

4.6.1 Overview.

4.6.2 Bayes theorem and the independence assumption.

4.6.3 Independence assumption.

4.6.4 Classification process.

4.7 Summary.

4.8 Further reading.

5. Applications.

5.1 Overview.

5.2 Sales and marketing.

5.3 Industry-specific data mining.

5.3.1 Finance.

5.3.2 Insurance.

5.3.3 Retail.

5.3.4 Telecommunications.

5.3.5 Manufacturing.

5.3.6 Entertainment.

5.3.7 Government.

5.3.8 Pharmaceuticals.

5.3.9 Healthcare.

5.4 MicroRNA data analysis case study.

5.4.1 Defining the problem.

5.4.2 Preparing the data.

5.4.3 Analysis.

5.5 Credit scoring case study.

5.5.1 Defining the problem.

5.5.2 Preparing the data.

5.5.3 Analysis.

5.5.4 Deployment.

5.6 Data mining non-tabular data.

5.6.1 Overview.

5.6.2 Data mining chemical data.

5.6.3 Data mining text.

5.12 Further reading.

Appendix A. Matrices.

A.1 Overview of matrices.

A.2 Matrix addition.

A.3 Matrix multiplication.

A.4 Transpose of a matrix.

A.4 Inverse of a matrix.

Appendix B. Software.

B.1 Software overview.

B.1.1 Software objectives.

B.1.2 Access and installation.

B.1.3 User interface overview.

B.2 Data preparation.

B.2.1 Overview.

B.2.2 Reading in data.

B.2.3 Searching the data.

B.2.4 Variable characterization.

B.2.5 Removing observations and variables.

B.2.6 Cleaning the data.

B.2.7 Transforming the data.

B.2.8 Segmentation.

B.2.9 Principal component analysis.

B.3 Tables and graphs.

B.3.1 Overview.

B.3.2 Contingency tables.

B.3.3 Summary tables.

B.3.4 Graphs.

B.3.5 Graph matrices.

B.4 Statistics.

B.4.1 Overview.

B.4.2 Descriptive statistics.

B.4.3 Confidence intervals.

B.4.4 Hypothesis tests.

B.4.5 Chi-square test.

B.4.6 ANOVA.

B.4.7 Comparative statistics.

B.5 Grouping.

B.5.1 Overview.

B.5.2 Clustering.

B.5.3 Associative rules.

B.5.4 Decision trees.

B.6 Prediction.

B.6.1 Overview.

B.6.2 Linear regression.

B.6.3 Discriminant analysis.

B.6.4 Logistic regression.

B.6.5 Naïve Bayes.

B.6.6 kNN.

B.6.7 CART.

B.6.8 Neural networks.

B.6.9 Apply model.