Thursday, September 24, 2009

Rescaling continuous predictors in regression models

Michael A. Babyak, PhD
Duke University Medical Center


If a friend asked me how far I travel to work every day, I could tell them that I live about 26,400 feet away from my office. I am far more likely, however, to say that I live 5 miles from my office. And if that friend were from a nation that used the metric system, I would likely say I lived about 8 kilometers away. I prefer expressing the distance in miles or kilometers because the scale is more immediately meaningful to the person asking the question. All three of those values, expressed in feet, miles, or kilometers, refer to exactly the same distance. They are simply rescaled versions of one another.

In regression models, the scale we choose for the variables under study can be recast in a similar fashion. Because the regression coefficient represents the expected change in y for a one unit change in x (the predictor), the magnitude of that coefficient is partly determined the length of the units being used. Just as changing the scale from feet to miles in my initial example made it easier for my friend to comprehend or intuit a distance, we can rescale the variables in a regression model in order to help the others better comprehend the potential meaning of regression coefficients.

In the present brief tutorial, I will show how to rescale a predictor so that the clinical or theoretical meaning of the regression coefficient for that predictor might become clearer. Let me emphasize at the outset that rescaling a predictor in a regression has absolutely no effect on the magnitude of the relation being studied—the slope itself will not change its steepness, nor will the p-values or variance explained be changed. Rescaling is merely a means of communicating the nature of the regression line in different, hopefully more intuitive language. Let’s turn now to an example with real world data in order to demonstrate this idea more concretely.


Some real world data: A regression coefficient in its original scale

I’ve selected a small subset of data from a study our group published a few years ago (1) in order to illustrate rescaling. The data are from 87 men ranging in age from about 40 to 85 years, each of whom had their systolic blood pressure (SBP) measured in a reliable fashion. Let’s suppose we are interested in the (admittedly unimaginative) hypothesis that age is associated with SBP. In our first regression model, we enter the predictor age in its original units, years. In other words, age is scaled such that one unit is equal to one year. The unit of measurement for SBP is millimeters of mercury (mmHg). The regression solution turns out to be SBP = 96.6 + 0.44*Age. We purposely avoid talking about p-values to emphasize the point that linear rescaling has nothing to do with statistical significance. As an aside, the regression coefficient above is typically labeled “unstandardized” which means that the x and y variables have been modeled in their original scaling units.

The interpretation of our regression equation is straightforward: for every unit (one year) increase in age, SBP is expected to be higher by 0.44 units. The relation between scaling and the regression coefficient is represented graphically in Figure 1.




Figure 1. Systolic blood pressure regressed on age in 87 men with known heart disease. Age is scaled in one-year units. The distance between the predicted values for ages one year apart is equivalent to the unstandardized regression coefficient. In this case, a one-year increase in age is associated with a 0.44 mmHg increase in systolic blood pressure.

If we select a one year distance on the x-axis—in this case, let’s say between age 60 and 61—trace the vertical lines from each age to the regression line, and then draw a perpendicular from those intersection points to the y-axis, we find the predicted SBP for a persons aged 60 and 61 years, respectively. These predicted SBP values, of course, correspond precisely to the values we can generate directly using the regression equation. For a 60 year old, the predicted SBP is 96.6 + .44*60 = 123 mmHg; and for a 61 year old, the predicted SBP is 96.6 + .44*61 = 123.44 mmHg. That barely discernible distance between the two horizontal lines crossing the y-axis in Figure 1 is the regression coefficient, or slope, of 0.44 mmHg of SBP. Because the relation is linear we can actually say that for any two people one year apart in age, we would, on average, expect the older person to have a .44 mmHg higher systolic blood pressure than the younger one.


On the face of it 0.44 mmHg seems a very small effect. Less than half of a millimeter of mercury is unlikely to have much clinical relevancein fact, it is well within the range of measurement error of a given single blood pressure assessment. To understand whether the relation between the predictor and outcome might be important, however, we need to examine scaling of the predictor. Specifically, we can ask whether the original one unit distance being compared on the predictor scale is theoretically or substantively meaningful. In our SBP example, one year simply may not be a very long time in terms of the pathogenesis of high blood pressure. We might therefore rescale the age predictor so that the resulting regression coefficient reflects a distance across enough years such that the requisite physiological changes have a chance to occur. In the next sections we’ll see how to perform this rescaling, and also consider a few rescaling strategies.

Rescaling a predictor variable

The scale of a predictor can be changed by performing virtually any mathematical operation on the original predictor (one obvious exception is multiplying or dividing by 1). Dividing the predictor by a constant before performing the regression analysis will suit our present purpose. Let’s say that we consulted experts in blood pressure and they suggested that based on their clinical experience, for the average person, it takes roughly a decade for the aging process to change systolic blood pressure in a meaningful way. To represent this in our regression we would rescale age by dividing the original age variable by 10 and then replace the original age variable with newly created “age divided by 10” variable in the equation. Remember that age is still a continuous variable—we have not created any groups! I mention this latter point because in my experience some researchers seem to believe that creating artificial groups out of continuous variables is a good way to get meaningful coefficients. Many prominent statisticians, however, feel quite strongly that creating groups out of continuous measures is a severely flawed practice (2-6) whose pitfalls would vastly outweigh the benefit of more understandable coefficients. Turning back to our rescaling example, Figure 2 depicts the regression of SBP on age, but this time with the x-axis for age rescaled in units of 10 to reflect our new decade-sized units.




Figure 2. The same data as shown in Figure 1, but with age rescaled such that one unit is equal to ten years. The distance between the predicted values from ages a decade apart is now 4.4. The values just above the x-axis represent the original 1-year age units, while the values below represent the newly scaled decade units.

Notice that the regression line is exactly the same as it was with the original one year units—the only difference is that the meaning of a unit has changedone unit on the new rescaled age variable is now equal to 10 years rather than 1 year. The values in Figure 2 below the x-axis are the new decade-length units, while the original one year units are now above the axis. Again, nothing has changed about the association itself: the slope, the p-value, and the standard error all remain exactly the same. We have merely redefined the units as decades rather than as years, much in the same way I used miles rather than feet to describe the distance from my house to my office.

We again trace vertically from selected values of decades that are one unit apart. In this case, we select the 6th (60 years) and 7th decade (70 years). Tracing across to the y-axis from the point at which these lines intersect with the regression line yields the regression coefficient, this time based on the rescaled age units. Because the slope is linear, the coefficient is exactly 10 times the size of the distance seen when age was in one-year units. Thus, now we have an estimate that reflects the change in blood pressure for a difference in age that is more consistent our expert’s belief about the biology underlying blood pressure changes and aging.

You also might notice by now that we could have taken a different, even simpler approach to rescaling. In the above example, we also could have just multiplied the original regression coefficient, .44, by the scaling constant, 10, to get the 4.4. This simpler technique will work with any model that is linear. In a regression model where the linear composite of the weighted predictors is subject to a nonlinear function, such as in logistic regression or survival models, this approach can still be taken if we understand the nonlinear part of the model. For example, if the outcome was a truly binary event, like a hospitalization, we might develop a logistic regression model with age in years predicting the probability of hospitalization. It might produce an odds ratio of, say, 1.01 for every one year increase in age. To determine the odds for an increase in 10 years we can simply make the following calculation 1.0110 = 1.22. This is a perfectly legitimate way to rescale. My own preference is to rescale at the level of the predictor variable rather than the coefficient because especially with more complex models, such as those with interactions or nonlinear terms, I find that there are fewer ways to make errors or forget what I have done and where.

Other ways to select a scaling distance

Turning back to the issue of selecting a rescaling value, it may be easier to identify a theoretically meaningful distance for some predictor than for others. For example, some clinicians have suggested to me that a patient’s score on the Beck Depression Inventory (BDI) would have to change by at least 4 points before they would consider it a ‘real’ change. Therefore, if the BDI were used as a predictor in a regression model, rescaling the BDI by dividing the original score by 4 might be a useful option. The resulting coefficient would represent the expected difference in the outcome variable given a clinically meaningful increase on the BDI score.

What if we can’t identify a convention for an appropriate distance on the predictor? Perhaps the predictor is a pencil-and-paper psychometric measure with a somewhat arbitrary or relatively unstudied scale. Some opt for using the standard deviation of the predictor as the scaling distance. For example, if the standard deviation of age is 16, age is rescaled to units of 16 years (by dividing the original age variable by 16). Using the standard deviation as the scaling factor is not a bad idea in many cases, but there are instances, which we will discuss in a short while, where it might be less than ideal.

A better alternative might be the interquartile range (IQR). The IQR is defined as the distance between the 25th and 75th percentiles. The advantage of the IQR as a scaling factor is that, unlike the standard deviation, it will always reflect values of the predictor that are relatively well-represented in the sample. The IQR-rescaled predictor also has an attractive intuitive interpretation—the resulting regression coefficient compares a person in the middle of the upper half of the predictor distribution to a person in the middle of the lower half of the distribution. In other words, it compares a person with a typical ‘high’ value on the predictor to a person with a typical ‘low’ value. In our blood pressure and age example, the IQR for age was right about 13 years. We rescale age by dividing it by 13, and find that the new regression coefficient is 5.6. Figure 3 shows the new scaling, comparing the predicted SBP from a person at the 75th percentile of age to that of a person at the 25th percentile. Thus the SBP for a typical 'older' person is expected to be about 5.7 mmHg higher than that of a typical 'younger' person, an expected change that is likely clinically meaningful.
 


Figure 3. Systolic blood pressure and age, with age rescaled to the interquartile range. The resulting regression coefficient is now readily interpretable as comparing the expected blood pressure for a person in the middle of the upper half of the age distribution with the expected blood pressure for a person in the middle of the lower half of the age distribution. Such a scaling is particularly useful when the original units are relatively arbitrary. Again, the values just above the x-axis represent the original 1-year age units, while the values below represent the newly scaled IQR (13 year) units.

One reason that the IQR scaling might be preferable to the SD is the case of a highly skewed predictor variable. If the predictor is highly skewed (and I note here that it is perfectly fine to have a skewed predictor in a regression model provided the range is adequate—there is absolutely no normality assumption with respect to the predictors), the distance over a standard deviation might result in an unlikely or even impossible comparison. For example, if a predictor has a mean of 10 and a standard deviation of 15, there may literally be no interval across the range of the predictor that is actually 15 units apart, thus rendering the regression coefficient relatively meaningless.



Figure 4.  A hypothetical predictor variable with a skewed distribution. The upper panel shows the distribution of a predictor with standard deviation of 11.8, mean of 7.8, and median of 5. The vertical blue line represents the median.  If the standard deviation (11.8 units) is used as the rescaling constant, it may be difficult to find a location on the distribution where real cases exist for both points of comparison. In fact, if we center the comparison interval of 11.8 units (vertical reed lines) around the median, the interval would extend below the actual range of the predictor. The lower panel displays the same distribution, but now the rescaling constant is the interquartile range. The red vertical line on the left  now represents the 25th percentile, while the vertical red line on the right represents the 75th percentile. When the interquartile range is used as a rescaling constant (again using the median as the centerpoint), there will always be at least one location at which the points of comparison fall on representative values of the predictor, irrespective of the shape of the distribution. The regression coefficient rescaled to the interquartile range also has an intuitive interpretationthe regression coefficient now compares a case whose predictor value is exactly in the middle of the upper half of the predictor distribution (the 75th percentile) with a case whose predictor value is exactly in the middle of the lower half (the 25th percentile).


The top panel in Figure 4 displays the distribution for a variable, which, owing to some extreme values on the right hand side, has a very large standard deviation. If we try to compare patients a standard deviation apart, there are apparently very few locations along the range of the predictor where this makes sense in terms of comparing values that are represented by more than a few patients. In fact, if we center the 1 SD distance over the median, the bottom value is actually below the possible values for the scale. In contrast, if we select the IQR as the rescaling constant, we are guaranteed that a comparison can be made between two locations on the distribution that are, by definition, right in the middle of the low scores and right in the middle of the high scores—regardless of the shape of the distribution.

Although the examples I have presented here are from a very simple one-predictor linear regression model, rescaling also is applicable to models with multiple predictors. In addition, as I mentioned earlier, rescaling can be used in models of all types, including popular ones such as logistic and time-to-event models. In logistic regression, for example, the odds ratio compares the odds of the outcome event one unit apart on the predictor. If we used age to predict a truly binary variable, such as the occurrence of a myocardial infarction, the OR from the unscaled regression (age in one year units) would be interpreted as the change in the odds of an event for every one-year increase in age. Rescaling to decades would yield an odds ratio that compares 10 year periods, etc. For the IQR approach, the OR is again nicely intuitive: it represents the odds of the event occurring for a typical person in the middle of the upper half of the distribution compared to the odds of the event for a typical person in the middle of the lower half of the distribution.

Summary

When a variable under study is measured on a continuum, most statisticians recommend that you preserve its continuous form for the analysis. The original scale of the continuous variable, however, may not lend itself to immediately meaningful coefficients. Rescaling the predictor variable may produce more substantively meaningful coefficients while preserving the maximum amount of information from the continuous measure. Rescaling the predictor has no impact whatsoever on the strength of the association or on significance levels. The choice of the scaling distance may be based on subject knowledge of the phenomenon under study or on intuitively convenient distances (such as decades of age). When a theoretical basis is not available for selecting a scaling value, the distance between the 75th and 25th percentile of the predictor may be a useful choice for a scaling distance.



About the Author
Mike Babyak is a Professor of Medical Psychology at Duke University Medical Center.  His professional interests include multivariate modeling and philosophy of science.   He is an editorial board member and former statistical editor of Psychosomatic Medicine.

*** This post was peer-reviewed.  *** 




References

1. Blumenthal JA, Sherwood A, Babyak MA, Watkins LL, Waugh R, Georgiades A, Bacon SL, Hayano J, Coleman RE, Hinderliter A. Effects of exercise and stress management training on markers of cardiovascular risk in patients with ischemic heart disease: A randomized controlled trial. JAMA 2005;293:1626-34.

2. Cohen J. The cost of dichotomization. App Psychol Meas 1983;7:249-53.

3. Harrell FE. Problems Caused by Categorizing Continuous Variables. 2008.

4. MacCallum RC, Zhang S, Preacher K, Rucker D. On the Practice of Dichotomization of Quantitative Variables. Psychol Methods 2002;7:19-40.


5. Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006;25:127-41.

6. Streiner DL. Dichotomization and manipulation of numbers: Reply. [References]. Can J of Psychiat / La Revue canadienne de psychiatrie 2003;48:430.