Finding the equation of a line that gives the best fit to a set of data points is a common calculus exercise. You simply write down what it means to minimize the squared vertical distance between a line and the data, take partial derivatives with respect to the slope and intercept variables, and set them to zero.

When I first heard that there were entire courses in regression, I was dumbfounded. How could you possibly stretch regression out to fill a semester? Little did I know you could not only make a semester-long course out of regression, you could make a *career* of it.

The calculus exercise is a caricature of regression. It does give the parameters for the least-squares line, but it tells you nothing about the degree of uncertainty in those parameters. The Gauss-Markov theorem says that the parameters found in the homework problem are the same as the parameters that answer a much more sophisticated question, one that does account for uncertainty. Even for the simplest instance of regression, linear regression in one variable, the mathematics is not trivial. But more importantly there are questions about modeling which are orthogonal to purely mathematical concerns.

- What kind of probability distribution adequately describes the difference between predicted and actual values?
- Is the error distribution close enough to Gaussian that it can safely be assumed to be Gaussian?
- Are some of the data points suspicious?
- Do any points have undue influence on the analysis?
- Should one minimize squared error, as is conventional, or consider a more robust alternative?
- How should one diagnose how well the model fits?
- How does uncertainty in model parameters translate into uncertainty in projections using the model?

The next steps after linear regression in one variable are not hard to imagine: you can look at non-linear models with several variables. The questions above remain, but new questions arise. In what way do you want to allow non-linearity? That is, what kind on non-linearities can you allow that fit data better while remaining tractable to work with? And what if your data are not independent? What if your data are binary or categorical? Things quickly become far more complicated than the homework problem we started with.

Derek Young’s book *Handbook of Regression Methods* gives statisticians guidance in selecting and evaluating regression models. Someone with no experience with regression would not get much out of the book, because it addresses questions they haven’t thought of asking.

The variety of sophisticated statistical models available is more than any statistician can thoroughly understand, remember, and have experience with. A book like Young’s *Handbook* is valuable when you need to use a model that isn’t top of mind, giving mathematical details but also practical advice regarding considerations that are not obvious from a mathematical description.

The *Handbook* begins with linear regression and ANOVA (ANalysis Of VAriance, though that term is somewhat misleading) first in one variable and then several variables. It then goes through advanced diagnostic methods for addressing concerns such as the bulleted questions above. Finally, the book covers a variety of advanced regression models such as generalized linear models and other nonlinear models, semi-parametric and non-parametric regression, multi-level models, etc.

In short, Young’s *Handbook of Regression Methods* is a valuable resource for using advanced regression methods in practice.

John D. Cook is a consultant working in applied mathematics and statistics.