Data Computing: An Introduction to Wrangling and Visualization With R

Daniel Kaplan

Publisher:

Project Mosaic

Publication Date:

2016

Number of Pages:

222

Format:

Paperback

Price:

23.13

ISBN:

9780983965848

Category:

Textbook

MAA Review
Table of Contents

[Reviewed by

Robert W. Hayden

, on

12/13/2016

]

The title of this book may not convey much to mathematicians. We can locate its general neighborhood with buzz words and phrases such as “Big Data,” “Data Mining,” “Data Science,” or “Analytics,” though all of those may be a bit vague. Such topics are likely to be taught in a Computer Science or Business Department, though mathematicians may become involved if they already provide a service introductory statistics course to such departments.

It may be helpful to contrast the content of this book with what we cover in such an introductory statistics course. There we learn about carefully designed surveys and experiments that employ random sampling or assignment. Careful design and implementation is often costly, so we end up with relatively small batches of numbers. This increases the possibility that what we see could just be a fluke of this small set of observations. So we use probability models for random sampling or assignment to predict how large an effect those processes might have had. Effects larger than that are attributed to something else, such as the population or the effect of a treatment.

In contrast, this book is more concerned with the analysis of large quantities of data, typically collected for some other purpose. For example, a company may decide to see if the customer records they have kept for the purpose of filling orders can be used to guide future advertising campaigns. There is no random assignment or sampling, and no care was taken to collect the data most useful to improving advertising. Nor is the data likely to be in a form that can be used for that purpose. Our book looks at issues like reformatting the data to answer the question at hand, cleaning the data to remove errors and inconsistencies, and connecting the data to other data sources. For example, we may have no data on the income of our customers, but the ZIP code in their address might serve as a link to a database that provides income statistics for the general area, if not individual customers. Usually we are not seeking to estimate a parameter or test a prior hypothesis but rather looking for patterns. We might place our ads in different media depending on whether our customers seem to be seniors or yuppies.

Tools for finding patterns are generally graphical rather than computational. Having found patterns, we may then hone our graphics to best display them. And that may be the end of the story. Formal inference usually plays a minor role, as we are not trying to make inferences about a larger population nor to determine causality. Instead, we may just want to make predictions about the future, based on the optimistic assumption that that future will be like the past.

The process described involves lots of processing of lots of data, and so we will want to use a computer. The “with R” phrase in the subtitle means that this book uses the statistical programing language R. That is a reasonable choice though some will prefer Python. Both are free, so software costs should not be an issue. Instruction in R here is a bit sketchy, often limited to numerous examples. The emphasis is usually on the task at hand, and R add-ons geared to such tasks are often used. This is a reasonable choice, but a reader should not expect to become proficient with the core features of unadorned R here.

This book has no prerequisites in statistics or computing. In particular, it dos not assume an introductory statistics course. In fact, this book might serve as a useful prerequisite to such a course for students who will actually be working with raw data. Said course really serves two audiences: preparation of researchers to do statistics, and preparation of citizens to evaluate statistics done by those researchers. The latter group may never work with raw data, and would find this book of little use, but researchers who do work with real data often find themselves wishing they had somewhere learned the lessons this book teaches. You can find those researchers waiting in line outside the offices in the Statistics Department or ITS.

The content of this book is relevant to a sizeable audience, and there is not much competition in terms of books providing this information with no prerequisites. So the book can be highly recommended to anyone working with (or helping others work with) even moderate sized batches of numbers.

As a textbook for students, however, this often seems abstract and unmotivated, a failing it shares with too many mathematics textbooks. For example, the first chapter discusses some issues in designing a database to hold data. The beginner is likely to have little appreciation of why this might be a good idea, and almost certainly did not sign up for the course to learn this. An alternate approach might be to present some real or realistic examples of wanting to extract information from data for a specific purpose. Examples could begin with carefully sanitized data sets as found in introductory statistics textbooks, and then move on to less friendly layouts, and end with a discussion of data storage desiderata.

In addition, many of the examples are of the form “given this data the following code produces this output” without much explanation of why we wanted that output nor how we figured out how to get it or what we might do with it when we have it. Our author has an excellent reputation as a teacher, and so presumably he either handles this issue in class, or has students happy with this abstract approach. Persons thinking of adopting this as a textbook might want to see more on the printed page to address this issue.

The only other caveat is that the version under review here seems quite preliminary, with may minor typos and other errors. Of these the only ones likely to cause significant misunderstanding are some marginal graphics too small to read and some others that depend on color but are here printed in black and white. Otherwise, with the reservations stated above, this book is very highly recommended.

After a few years in industry, Robert W. Hayden (bob@statland.org) taught mathematics at colleges and universities for 32 years and statistics for 20 years. In 2005 he retired from full-time classroom work. He now teaches statistics online at statistics.com and does summer workshops for high school teachers of Advanced Placement Statistics. He contributed the chapter on evaluating introductory statistics textbooks to the MAA's Teaching Statistics.