You are here

Modeling With Data: Tools and Techniques for Scientific Computing

Ben Klemens
Princeton University Press
Publication Date: 
Number of Pages: 
[Reviewed by
William J. Satzer
, on

The aim of this book is to show how to carry out analyses of data sets — most particularly, computationally intensive analyses of very large data sets. It is split about half and half between data-oriented computing (using primarily C and the SQL database query language) and statistics. It is not an introductory statistics book. Instead, it is directed toward graduate students or independent researchers as a supplement to the standard first-year statistics texts.

That is not to say that the author does not have strong feelings about teaching statistics. Consider this, from the first chapter: “Statistics has two goals, which directly conflict. The first is to find patterns in static... The second goal is a fight against apophenia, the human tendency to invent patterns in random static.”

As a consequence, the author strongly favors separating the teaching of descriptive statistics from inferential statistics. This is a kind of by-the-way, but it is thought-provoking to those who have taught statistics and pondered the universe of confusions that inferential statistics can create in the minds of students. The author practices his belief by concentrating first on descriptive statistics and emphasizing how one can build statistical models and gain considerable understanding without resorting to inferential tests. There is nothing exotic or trendy here: rarely does the author need anything beyond ordinary least squares, maximum likelihood estimation and bootstrapping. But he makes good and creative use of the basic tools.

The half of the book that focuses on data-oriented computing spends a fair amount of time on teaching the basics of the C language. The author does this competently and with humor, but it will never make for fascinating reading. C code is prominent throughout the book, but things get more interesting once we get to the statistical modeling and see some examples.

The book would benefit from more extended examples and perhaps a more detailed case study or two. Where the author shines is his common sense and the practical tips he offers along the way. I have never seen a better short summary of the common probability distributions than the one that appears on page 235 with the heading “Every probability distribution tells a story.”

This is not a book for everyone. However, if you or your students are interested in getting down and dirty with massive amounts of data, and writing code to make sense of it, then this would be a great book for you or for them.

Bill Satzer ( is a senior intellectual property scientist at 3M Company, having previously been a lab manager at 3M for composites and electromagnetic materials. His training is in dynamical systems and particularly celestial mechanics; his current interests are broadly in applied mathematics and the teaching of mathematics.


Preface xi Chapter 1. Statistics in the modern day 1

Chapter 2. C 17
2.1 Lines 18
2.2 Variables and their declarations 28
2.3 Functions 34
2.4 The debugger 43
2.5 Compiling and running 48
2.6 Pointers 53
2.7 Arrays and other pointer tricks 59
2.8 Strings 65
2.9 *Errors 69
Chapter 3. Databases 74
3.1 Basic queries 76
3.2 *Doing more with queries 80
3.3 Joins and subqueries 87
3.4 On database design 94
3.5 Folding queries into C code 98
3.6 Maddening details 103
3.7 Some examples 108
Chapter 4. Matrices and models 113
4.1 The GSL's matrices and vectors 114
4.2 apo_da t120
4.3 Shunting data 123
4.4 Linear algebra 129
4.5 Numbers 135
4.6 *gsl_matrixand gsl_ve torinternals 140
4.7 Models 143
Chapter 5. Graphics 157
5.1 plot 160
5.2 *Some common settings 163
5.3 From arrays to plots 166
5.4 A sampling of special plots 171
5.5 Animation 177
5.6 On producing good plots 180
5.7 *Graphs--nodes and flowcharts 182
5.8 Printing and LATEX 185
Chapter 6. *More coding tools 189
6.1 Function pointers 190
6.2 Data structures 193
6.3 Parameters 203
6.4 *Syntactic sugar 210
6.5 More tools 214

Chapter 7. Distributions for description 219
7.1 Moments 219
7.2 Sample distributions 235
7.3 Using the sample distributions 252
7.4 Non-parametric description 261
Chapter 8. Linear projections 264
8.1 *Principal component analysis 265
8.2 OLS and friends 270
8.3 Discrete variables 280
8.4 Multilevel modeling 288
Chapter 9. Hypothesis testing with the CLT 295
9.1 The Central Limit Theorem 297
9.2 Meet the Gaussian family 301
9.3 Testing a hypothesis 307
9.4 ANOVA 312
9.5 Regression 315
9.6 Goodness of fit 319
Chapter 10. Maximum likelihood estimation 325
10.1 Log likelihood and friends 326
10.2 Description: Maximum likelihood estimators 337
10.3 Missing data 345
10.4 Testing with likelihoods 348
Chapter 11. Monte Carlo 356
11.1 Random number generation 357
11.2 Description: Finding statistics for a distribution 364
11.3 Inference: Finding statistics for a parameter 367
11.4 Drawing a distribution 371
11.5 Non-parametric testing 375

Appendix A: Environments and makefiles 381
A.1 Environment variables 381
A.2 Paths 385
A.3 Make 387
Appendix B: Text processing 392
B.1 Shell scripts 393
B.2 Some tools for scripting 398
B.3 Regular expressions 403
B.4 Adding and deleting 413
B.5 More examples 415
Appendix C: Glossary 419
Bibliography 435
Index 443