My response to “When Is It Crucial to Standardize the Variables in a Regression Model?”


I came to this blog on Minitab the other day on LinkedIn: When Is It Crucial to Standardize the Variables in a Regression Model? To my great surprise, the author states that it is “when your regression model contains polynomial terms or interaction terms” because of “multicollinearity”.

You should standardize the variables when your regression model contains polynomial terms or interaction terms. While these types of terms can provide extremely important information about the relationship between the response and predictor variables, they also produce excessive amounts of multicollinearity.

This is in direct contrast to what other people suggested. For example, in this blog “When Can You Safely Ignore Multicollinearity?” (written by Paul Allison who is Professor of Sociology at the University of Pennsylvania, where he teaches  statistics) it was suggested that:

2. The high VIFs are caused by the inclusion of powers or products of other variables. If you specify a regression model with both x and x2, there’s a good chance that those two variables will be highly correlated. Similarly, if your model has x, z, and xz, both x and z are likely to be highly correlated with their product. This is not something to be concerned about, however, because the p-value for xz is not affected by the multicollinearity.  This is easily demonstrated: you can greatly reduce the correlations by “centering” the variables (i.e., subtracting their means) before creating the powers or the products. But the p-value for x2 or for xz will be exactly the same, regardless of whether or not you center. And all the results for the other variables (including the R2 but not including the lower-order terms) will be the same in either case. So the multicollinearity has no adverse consequences.

The author of the Minitab blog however insists that the coefficient for the lower order term changes so if a researcher is interested in this effect then centering should be done. The reason to the changes is because of multicollinearity he claimed.

Well it is easy to show that the coefficient for the lower order term does change after centering. Suppose this is our true model: y=ax^2+bx+c, and the mean of x is m. After centering x, the model would become y=a'(x-m)^2+b'(x-m)+c’. It can be easily shown that the above equation is equivalent to: y=a’x^2+(b’-2a’m)x+a’m^2+c’. It is then clear that the coefficient for the higher order term does not change, i.e. a’=a. The coefficient for the lower order term b’=b+am. So yes, the coefficient for the lower order term does change, even these are two equivalent models after and before centering and b value is correct on the original scale of x.

Why this is happening? Well, if we look at the effect of x on y it would become clear. Before centering, it is not difficult to show that dy/dx=2ax+b. So b is the effect of x on y when x=0. After centering, dy/dx=2a(x-m)+b’. Now it is clear that b’ is actually the effect when x=m, i.e. when x is at its mean value.

The author of the Minitab blog says the coefficient and its p-value for lower order term x changes. Of course it changes! He is effectively saying that the effect of x on y when x=0 is different than the effect of x on y when x is at its mean. Of course it is different! It is a parabola! He is comparing apples with oranges, which has nothing to do with multicollinearity.

Still, one may say, if I am interested in the effect of x on y when x is at its mean, then centering is a good method. This is true, but it is not “crucial” to do so. You can still derive the revised equation after fitting a model without centering the data!

It is also not very clear why it is so important to show the effect of x on y when x is at its mean. I understand that mean can be a better representative of a variable than 0, but the consequence of centering your data is that your results are not directly comparable to others. Say Researcher 1 and Researcher 2 conducted two separate studies using two separate samples. Unless their samples are both extremely large, then it is likely that mean value in Sample 1 is different than Sample 2. Say the former is 50, the latter is 60. Now if they center their data and do the model, then the coefficient for the lower term x represents: effect of x on y when x=50 vs. effect of x on y when x=60, respectively. Obviously they are not directly comparable. Without centering, the interpretation however is the same in both studies: the coefficient for the lower term x just represents the effect of x on y when x=0, which is comparable, regardless of differences in means of x in different samples. You see, centering does not give you better model at all but causes confusions and making modelling results more difficult to interpret. Also why are we so obsessed with mean anyway? If I model age against the chance to get a cancer, I would be less interested in the effect of mean age, but want to focus on old age, when most people get cancer.

Minitab seems to be a reasonably popular statistics software though I never used it. It is just disappointing that its blog is of such low quality and misleading.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.