You are here

Modern Data Science with R

Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton
Chapman & Hall/CRC
Publication Date: 
Number of Pages: 
Texts in Statistical Science
BLL Rating: 

The Basic Library List Committee suggests that undergraduate mathematics libraries consider this book for acquisition.

[Reviewed by
Robert W. Hayden
, on

Here we have an introduction to data science for students with (the authors say) just an introductory statistics course behind them. Such textbooks are in a situation similar to that of discrete mathematics textbooks 30 years ago, in that the content of the first course is far from settled. This book commands attention because of the three authors involved. Baumer has extensive experience doing data science outside of academia. Kaplan is known for an innovative course in statistics that takes the needs of data science into account. Horton is known for his cutting edge contributions to statistics education. So whether or not the book matches the course one has or envisions, the opinions of these authors are worthy of consideration. Their choice of topics appears to be one reasonable possibility for a first course. Perhaps the most unusual choice is a welcome chapter on professional ethics.

At present, there are different kinds of introductory books on data science. There are many books aimed at managers that read like ad copy and do not contain much real information. There are how-to books, possibly aimed at statisticians and computer scientists in industry who have been asked to do some data science. The book at hand is a textbook for a college course offering a high level overview of the discipline with an emphasis on the kinds of problems data science addresses and the kinds of solutions it offers, without going into great implementation detail for any one method.

The phrase “with R” in the title means that the computing language used (very heavily) here is the statistical programing language R. The other obvious choice would be Python. To a first approximation, R does everything statistical and can be made to perform the non-statistical aspects of data science with add-in packages, while Python is a general purpose programing language that can do statistics and data science with add-in packages. There are many commercial products used in industry as well, but these are not so common in academia.

As a textbook this one offers the advantage of very clear writing. Many examples are intrinsically much more interesting than is usual in introductory statistics textbooks, though your reviewer might have preferred more examples that seemed to solve real problems rather than connect with the things 19-year-olds find amusing. Often the examples seem to be fragments of a whole and the real point of the analysis is unclear (other than to be an example). The analyses are presented in R code, but there is heavy reliance on special add-in packages for data science. It seems doubtful that one would become a good programmer in the underlying language from this textbook, though one could acquire solid data science skills.

The main reservations one might have about this book are likely to be pedagogical, and here it may be more a matter of fitting the book to an audience than the authors’ choices being right or wrong. Like many introductory survey textbooks in the non-mathematical disciplines, this one is very broad in concepts and not very deep in details. The huge number of methods along with all the different R packages and the immense volume of syntax could be overwhelming. In most contexts a programming prerequisite might well be adviseable. The code examples are lightly commented and not always explained in detail. Many students may struggle to get beyond tweaking the example code to do the homework exercises. The mathematical content is also sometimes higher than advertised (introductory statistics often has no mathematics prerequisite so we would seem to be assuming only high school graduation) but here it is more a matter of using notation that is second nature to the authors but may not be to the readers.

In summary, this is a high quality book from respected authors that gives a credible selection of the attitudes and methods of data science. It would certainly be suitable for mathematicians who want to know what data science is. Anyone who has been teaching introductory statistics and is contemplating teaching data science should know this book is almost entirely coding, with bits of statistics here and there.

As a textbook, it may be challenging for students. Adopters should think carefully about the level of programming skills expected coming into, and out of, the course, and adjust assignments and evaluations accordingly. In some contexts the book might provide a tour of data science without expecting students to be able to implement much until they take further courses. To aid them in that, this book includes 14 pages of references with citations of relevant ones at the ends of chapters. There are also 38 pages of indices.

After a few years in industry, Robert W. Hayden ( taught mathematics at colleges and universities for 32 years and statistics for 20 years. In 2005 he retired from full-time classroom work. He contributed the chapter on evaluating introductory statistics textbooks to the MAA’s Teaching Statistics.

Introduction to Data Science

Prologue: Why data science?

Data visualization

A grammar for graphics

Data wrangling

Tidy data and iteration

Professional Ethics

Statistics and Modeling

Statistical foundations

Statistical learning and predictive analytics

Unsupervised learning


Topics in Data Science

Interactive data graphics

Database querying using SQL

Database administration

Working with spatial data

Text as data

Network science

Epilogue: Towards \big data"


Packages used in this book

Introduction to R and RStudio

Algorithmic thinking

Reproducible analysis and workflow

Regression modeling

Setting up a database server