You are here

Introduction to Data Science for Social and Policy Research

Jose Manuel Magallanes Reyes
Cambridge University Press
Publication Date: 
Number of Pages: 
[Reviewed by
William J. Satzer
, on

Data science could be defined as that interdisciplinary field whose goal is to collect, explore, analyze and draw useful conclusions about data. This is pretty vague but it just might capture most of the things that people now refer to by that name. By that definition, the title of this book is a misnomer, as the author essentially admits on the second page of his introduction.

This is a pretty good book all the same; the title just doesn’t tell the story. The author’s goal is to teach “the first steps to becoming a user of tools” to deal effectively with data. This is a good introduction to that basic task, but it has its limitations. The “for social and policy research” part of the title is also somewhat restrictive, but it is the author’s primary interest and obviously influences the sources and kinds of data that he considers.

Essentially no data analysis is taught here. Instead the author focuses on the preliminary steps of what needs to be done before analysis and modeling can begin. This has five steps: identify relevant data sources and confirm that they are trustworthy; get, clean and format the data; then integrate the multiple sources of data and save the results.

To carry out these tasks, programming tools of some kind are essential. The author chooses to teach the basics of the R and Python languages because they are accessible, well documented and now widely used. He patiently takes the reader through installation and set-up procedures, and then introduces data structures and the elements of programming. He does this in detail, but to follow along the reader must install software packages to match the environment that the author uses. (These include RStudio and Python from Anaconda.)

Getting the data (importing data into one’s workspace) is a good deal easier once programming tools are in place, but there are complications. Data come from many sources and in a variety formats. The author offers suggestions and examples for handling many of them. While data in pdf files or spreadsheets are common enough, other data have special formats that are most accessible via APIs (application program interfaces). Here he also specifically considers data sources and formats of particular value to policy analysts. Most of these are not terribly different from the kinds used in science and engineering applications.

Next comes dealing with dirty data, something that always seems to surprise those new to data science. Data cleaning is a topic often swept under the rug, but no one can really begin data analysis until it happens. The author’s treatment here is particularly good and thorough. Then, clean data in hand, he describes various techniques for integrating and storing the data in way that enables easy access for analysis.

The book is well organized and clearly written. It does occasionally have a whiff of the computer manual genre, but that is probably unavoidable given the subject. There are plenty of worked-through examples but no exercises. The book’s only index is to R and Python commands. It needs a full index.

Bill Satzer ( was a senior intellectual property scientist at 3M Company. His training is in dynamical systems and particularly celestial mechanics; his current interests are broadly in applied mathematics and the teaching of mathematics.

Part I. Get Started:
1. Introduction
2. Setting up the tools
3. Basics of R and Python
Part II. Collect and Clean:
4. Collecting data
5. Cleaning data
Part III. Format and Storage:
6. Formatting the 'clean' data
7. Integrating and storing.