You are here

Mining Imperfect Data: With Examples in R and Python

Ronald K. Pearson
Publication Date: 
Number of Pages: 
[Reviewed by
Brian Borchers
, on
This is a revised and reorganized edition of a book first published in 2005.  See our review of that earlier edition.
Mining Imperfect Data is a practical guide to statistical approaches to identifying outliers, missing data, and other types of "bad data" that many statistical estimators and machine learning models can be sensitive to.  In addition to techniques for identifying anomalous data points, the book shows how outliers can influence standard statistical estimates and discusses alternative methods that are less sensitive to anomalous data.
The book begins with a chapter that gives a broad overview of the process of gathering data, available software, and types of anomalies that can appear in data sets.  This is followed by chapters on outliers in univariate data, multivariate data, and time series data.  Chapters five and six deal with other anomalies including missing data, inliers (values that appear too frequently), problems with coarsening of continuous data, and other anomalies.  These chapters are organized so that a reader can easily jump to a particular kind of anomalous data.  In the second section of the book, the author describes an approach to analyzing the sensitivity of a data analysis to small numbers of outliers or other anomalous data.  A sampling based approach is used in what the author calls generalized sensitivity analysis.  The book concludes with a chapter of conclusions and a list of seven important open issues in handling imperfect data.
This is a practical and applied book that will be accessible to students and practitioners of data science with minimal background in mathematical statistics.  The book is filled with examples drawn from publicly available data sets.  The authors makes extensive use of R and (to a lesser extent) Python in these examples.  The book is well referenced but does not have exercises and has a skimpy index.  Although it would probably not be suitable for use as a course textbook, Mining Imperfect Data will be a useful reference for researchers, students, and practitioners. 


Brian Borchers is a professor of mathematics at New Mexico Tech and the editor of MAA Reviews.