You are here

Making Sense of Data: A Practical Guide to Exploratory Data Analysis and Data Mining

Glenn J. Myatt
John Wiley
Publication Date: 
Number of Pages: 
[Reviewed by
Patricia Humphrey
, on

This monograph (not really a textbook, although there are minimal exercises at the end of each chapter) is aimed at professionals in business, government, or other professions who might be thrust into a data mining/statistical analysis project with no knowledge of the field. The writing is very mathematically at a very low level, so the book should be accessible to all its intended readers. The intent seems mostly to make readers conversant with what might be involved in such a project, rather than make them technical experts.

The organization of the book mirrors the steps one should take in such a project: problem definition, preparation of data, constructing tables and graphs, doing statistics, grouping observations into like clusters and forming associative rules, making predictions (via models, neural nets, etc) and finally producing a report or other such deliverable and implementing recommendations. At each step, an overview of what might be undertaken is presented, not an exhaustive catalog of all possible analyses. The details of data cleaning (for example), which is acknowledged to be “one of the most time-consuming parts of a data analysis,” are mercifully left to the experts — technical (IT types) and those with specific subject matter expertise to make necessary judgments.

The biggest strength of the book is a good description of the logic behind such complex topics as cluster analysis, decision/classification and regression trees and neural networks, easily accessible to a nontechnician. The targeted “manager” will understand the basic concept, while details are left to the experts.

There are some errors and some (to this reviewer major) omissions in the text, however. While discussing t-tests for comparing two means, the author describes only the pooled test. While probably correct for his example, this is against current wisdom — with technology, it’s safer to always use the unpooled test. He states that the hypotheses in an analysis of variance are that the sample means are/are not equal (the test concerns the population means). He also states (p.92) that if “r is around 0 then there appears to be little or no relationship between the variables.” This last statement gives rise to much bad statistics — a correlation around 0 simply means there is little or no linear relationship between the variables. One must always plot the data; many curved relationships give correlations around 0. In fact, he examines a linear correlation for variables that exhibit a definite curved relationship.

Now to the major omissions. While correctly stating that a chi-squared test of association does not tell you what sort of relationship might be present, he fails to go back to the table and examine the contributions to the statistic and the expected values; these do give a clear indication of the relationship. The author focuses solely on simple linear regression (or transforming data to linearize it); in most data mining situations, the focus will be on building a multiple regression model, perhaps some form of logistic regression or other categorical model, etc.

Lastly, some of his references (Agresti for categorical data analysis, Kleinbaum et al for applied regression analysis, for example) will be inaccessible to the target audience.

Patricia Humphrey is an Associate Professor of Statistics in the Department of Mathematical Sciences at Georgia Southern University.  She has been a member of Project NExT-SE since 1998, and has served as a co-leader for Section NExT for several years.  She is Chair-Elect of the SIGMAA-StatEd for 2007. She is the author of several ancillary technology manuals for introductory statistics.



1. Introduction.

1.1 Overview.

1.2 Problem definition.

1.3 Data preparation.

1.4 Implementation of the analysis.

1.5 Deployment of the results.

1.6 Book outline.

1.7 Summary.

1.8 Further reading.

2. Definition.

2.1 Overview.

2.2 Objectives.

2.3 Deliverables.

2.4 Roles and responsibilities.

2.5 Project plan.

2.6 Case study.

2.7 Summary.

2.8 Further reading.

3. Preparation.

3.1 Overview.

3.2 Data sources.

3.3 Data understanding.

3.4 Data preparation.

3.5 Summary.

3.6 Exercises.

3.7 Further reading.

4. Tables and graphs.

4.1 Introduction.

4.2 Tables.

4.4 Summary.

4.5 Exercises.

4.6 Further reading.

5. Statistics.

5.1 Overview.

5.2 Descriptive statistics.

5.3 Inferential statistics.

5.4 Comparative statistics.

5.5 Summary.

5.6 Exercises.

5.7 Further reading.

6. Grouping.

6.1 Introduction.

6.2 Clustering.

6.3 Associative rules.

6.4 Decision trees.

6.5 Summary.

6.6 Exercises.

6.7 Further reading.

7. Prediction.

7.1 Introduction.

7.2 Simple regression models.

7.3 K-nearest neighbors.

7.4 Classification and regression trees.

7.5 Neural networks.

7.6 Other methods.

7.7 Summary.

7.8 Exercises.

7.9 Further reading.

8. Deployment.

8.1 Overview.

8.2 Deliverables.

8.3 Activities.

8.4 Deployment scenarios.

8.5 Summary.

8.6 Further reading.

9. Conclusions.

9.1 Summary of process.

9.2 Example.

9.3 Advanced data mining.

9.4 Further reading.

Appendix A Statistical tables.

A.1 Normal distribution.

A.2 Student’s t-distribution.

A.3 Chi-square distribution.

A.4 F-distribution.

Appendix B Answers to exercises.