You are here

Mining Imperfect Data: Dealing with Contamination and Incomplete Records

Ronald K. Pearson
Publication Date: 
Number of Pages: 
[Reviewed by
Maulik A. Dave
, on

As the title suggests, this book is on mining imperfect data. Data mining can be described as use of automated procedures to extract useful information and insight from large datasets. Large datasets can have incomplete data, making the problem of data mining more complicated. The book explains various kinds of imperfection in data. The various mathematical methods to deal with such data form the main subject of the book.

The first chapter introduces the subject. It is summary of other chapters. It introduces outliers, missing data, misalignments, and unexpected structures. It discusses situations, where data anomalies are bad/not bad. After introducing some procedures to deal with anomalies, the generalized sensitivity analysis (GSA) is introduced. The second chapter goes further into the subject by classifying outliers into univariate outliers, multivariate outliers, and time-series outliers. The univariate outliers are described in details, in third chapter. The description mainly contains outliers detection procedures, their performance, and their applications to real datasets.

The next three chapters are on data characterization. Before detailed analysis of datasets using standard procedures, preliminary processing of data is done. This preliminary data processing, referred as data pretreatment, is described in forth chapter, by describing technologies such as noise / non informative variables, imputation strategies, and filters. The data characterization is classified into two: characterization via functional equations, and characterization via inequalities. After describing various characterization techniques, six criteria for “good” characterization, are presented. The author views GSA as a meta heuristic for characterizing the quality of data analysis results. A full chapter is devoted to describe GSA. The description includes guidelines to define scenarios, “exchangeability”, and sampling schemes. This is followed by a case study, and a discussion on extensions to basic GSA framework.

Chapter 7 is on fixed datasets. The chapter consists of discussion on four strategies for sampling fixed datasets. The strategies are random selection, subset deletion, comparisons, and partially systematic sampling. With concluding remarks in the last chapter, four open questions on the subject are found. The bibliography spans 14 pages.

The reader is not expected to have advanced knowledge of mathematics. Understanding functions, summations, statistics, and graph plots is enough background for  the book. The major stress is to develop the conceptual background in the area of imperfect data. The book tends to be oriented towards GSA. There are many practical examples. A data mining professional can find the book useful. A researcher in the beginning stage, can get a good introduction in the subject, and a large number of references for further study.

Dr. Maulik A. Dave received his PhD degree from Indian Institute of Science in 1998. His major areas of interests include compiler, programming langauages, parallel processing, and verification of computer software systems. Contact him at Visit to know more about his work.

Preface; Chapter 1: Introduction; Chapter 2: Imperfect Datasets: Characters, Consequences, and Causes; Chapter 3: Univariate Outlier Detection; Chapter 4: Data Pretreatment; Chapter 5: What Is a "Good" Data Characterization?; Chapter 6: Generalized Sensitivity Analysis; Chapter 7: Sampling Schemes for a Fixed Dataset; Chapter 8: Concluding Remarks and Open Questions; Bibliography; Index