You are here

The Mathematics of Data

Michael W. Mahoney, John C. Duchi, and Anna C. Gilbert, editors
American Mathematical Society/SIAM
Publication Date: 
Number of Pages: 
IAS/Park City Mathematics Series 25
[Reviewed by
John D. Cook
, on

What should you expect from a book titled The Mathematics of Data? Nearly anything. There are numerous elementary books with similar titles that don’t go far beyond showing the reader how to compute the standard deviation. But what if you saw that the book was published by AMS and SIAM? That changes everything. You know it won’t be elementary, and it will probably be high quality, which is indeed the case here.

The Mathematics of Data, edited by Michael Mahoney, John Duchi, and Anna Gilbert, consists of six chapters. Five of these contain advanced but not terribly surprising content, but one chapter is an outlier. The five synoptic chapters are concerned with probability, numerical linear algebra, and optimization. The final chapter looks at applications of homological algebra.

Anyone wanting to work in data science would do well to learn probability, linear algebra, and optimization. Much of data science is constructed from these basic ingredients. For example, linear regression, the workhorse of basic statistics, assumes data contain random errors and solves for the optimal fit to the data using linear algebra. However, The Mathematics of Data combines the basic ingredients of probability, linear algebra, and optimization in some uncommon and interesting ways.

In regression, the linear system represents something involving randomness, but the matrix itself is constant and all manipulations are deterministic. Two of the chapters in The Mathematics of Data, titled Lectures on Randomized Numerical Linear Algebra and Randomized Methods for Matrix Computations, look at using randomization as part of the process to carry out matrix calculations. Randomized matrix multiplication, for example, approximates the product of two matrices by summing the product of random samples of rank-one products.

Two of the chapters, Optimization Algorithms for Data Analysis and Introductory Lectures on Stochastic Optimization, focus on optimization. The former looks at applications coming out of statistics, such as logistic regression, but is itself deterministic. The latter assumes that the objective functions being optimized are corrupted by random noise. Probability is in the background of the former but more in the foreground of the latter.

The last of the synoptic chapter is entitled Four Lectures on Probabilistic Methods for Data Science. This chapter focuses on probability in high dimensions and dimension-reduction results such as the Johnson-Lindenstrauss lemma.

The final chapter is Homological Algebra and Data. This chapter presents a very different approach to data analysis, a new approach that one hopes will yield insights complementary to those coming out of more traditional methods. Homology reveals qualitative information about data. It is quantitative in the sense that homology groups are practical to compute, but it is quantitative at a higher level of abstraction than a traditional method such as regression.

As the author points out, topological methods are more robust than traditional methods, but also weaker. “Topologica data analysis is more fundamental than revolutionary: such methods are not intended to supplant analytic, probabilistic, or spectral techniques.”

What does topology have to do with a discrete set of data? In a nutshell, if you look at small neighborhoods around each data point, these neighborhoods can fuse together to create interesting topological spaces. How large should the neighborhoods be? That is an important question, and the answer is that one should focus on features that are robust to the choice of radius; persistent homology is so-named because it looks for features that persist over a range of radius choices.

It would be interesting to look into the future, say 50 or 100 years, and find a newly written book about the mathematics of data. What topics might it contain? Maybe there would be a chapter on homology, but it would be in the more conventional section of the book. What might be in a more speculative final chapter?

John D. Cook is an independent consultant working in data privacy.