Mathematics of Data Science: A Computational Approach to Clustering and Classification

Author: Daniela Calvetti and Erkki Somersalo
Series: Data Science Book Series
Publisher: SIAM
Publication Date: 01/30/2021
Number of Pages: 189
Format: Paperback
Price: $64.00
ISBN: 978-1-611976-36-6
Category: textbook

This is a textbook on algorithms for clustering and classification in machine learning and their application to mining text and image data. Algorithms discussed in the book include k-means, k-medoids, linear discriminant analysis (LDA), principal components analysis (PCA), self-organizing maps, nonnegative matrix factorization, tree-based classifiers, and support vector machines. The most notable omissions are neural networks and spectral clustering.

The book begins with a very brief chapter reviewing basic facts about orthogonal matrices, eigenvalues and eigenvectors, and the singular value decomposition. This review may be useful for students who have already been exposed to these topics, but it would not be adequate for students who have not had previous exposure. The chapters on the various algorithms share a similar structure. The mathematical formulation of the clustering or classification problem is derived (typically as an optimization problem) and then one or more algorithms for solving the problem are given. The algorithms are then demonstrated on example data sets. Two chapters on text and image mining discuss how to prepare these kinds of data sets for clustering or classification. A final chapter on the page-rank algorithm is somewhat disconnected from the rest of the book.

The presentation of the algorithms in this book is quite clear and uses consistent notation. The book should thus be readily accessible to advanced undergraduate and beginning graduate students. However, the book fails to delve deeply into the analysis of the convergence and other theoretical properties of the methods. Furthermore, the authors haven't really compared the methods and discussed how to determine what method to use for a particular data analysis.

Since the chapters are short and largely independent of each other, the book would make a good reference for practitioners. This structure would also make it easy for an instructor to pick and choose material from different chapters. However, the lack of exercises will be discouraging to many instructors. I expect that this book will find more use as a reference to the various methods than as a course textbook.

Brian Borchers is a professor of mathematics at New Mexico Tech and the editor of MAA Reviews.