Multiple regression

Author: Dr Simon Moss

Purpose of multiple regression

Multiple regression analysis is used to examine the relationship between one numerical variable, called a criterion, and a set of other variables, called predictors. In addition, multiple regression analysis is used to investigate the correlation between two variables after controlling another covariate.

To illustrate, consider a researcher who wants to know whether or not the number of jars of Chicken Tonite you consume--the dependent variable or criterion--relates to frequency of psychotic behaviours, frequency of crossing roads, and age--the independent variables or predictors.

Multiple regression serves two functions. First, this technique yields an equation that predicts the dependent variable or criterion from the various independent variables or predictors. This function, however, is seldom utilised in psychology. Second, and more importantly, this technique identifies the independent variables that relate to the dependent variable, after controlling the other variables. For instance, this procedure can determine the relationship between Chicken Tonite and crossing roads in individuals who are equivalent on psychotic behaviours, age, and so on.

Procedure

To demonstrate multiple regression, you should first access SPSS and create a data file that resembles the following.

To subject these data to multiple regression:

Select the "Analyse" menu and choose "Regression" and then "Linear".
Designate "Chicken" as the dependent variable.
Designate "Psycho", "Crossing", and "Age" as the independent variables.
Although optional, you can also press "Statistics", tick "Part and Partial correlations", and then press "Continue".
Press OK. An extract of the output is presented below.

Derivation of the equation

Regression assumes the dependent or criterion variable can be predicted from the independent or predictor variables using an equation. The equation is assumed to resemble the following formula:

Dependent variable = B0 + B1 x iv1 + B2 x iv2 + B3 x iv3 ...

The B values denote numbers, such as 3.6. Using some special formula, multiple regression then estimates these B values. These B values are then reported in the column labelled B. Locate this column in the previous outcome. These B values can then be used to specify the equation. Note that 8.09E-02 denotes 8.09 x 10-2, which equals 0.0809;; that is, the decimal place is moved two places to the left. In this case, the equation is:

Chicken = 4.933 + 0.630 x Psycho + 0.081 x Crossing--0.151 x Age

This equation can be used to predict the dependent variable from the independent variables. To illustrate, suppose that a person scored 2 for "Psycho", 8 for "Crossing", and 10 for "Age". Now, enter these values into the equation. This process yields a 5.33. That is, we predict this participant will score 5.33 on the dependent variable or criterion "Chicken".

The accuracy of this equation

This equation is regarded as accurate if the residuals are minimal. To illustrate the concept of residuals:

Predict the dependent variable for the first five participants in the data file, using the previous equation.
For each of these subjects, compute the difference between the predicted dependent variable and the actual dependent variable.
This difference is called the residual.

Indeed, SPSS can be utilised to compute the predicted dependent variable and the residuals associated with each individual.

Specifically, execute multiple regression as usual, but do not press OK.
Instead, press "Save".
Tick "Unstandardised" in the list associated with "Predicted values".
Tick "Unstandardised" in the list associated with "Residuals".
Press "Continue" and then "OK". This procedure will create two new columns in the datasheet, as shown below.

Furthermore, a statistic called Multiple R represents the correlation between the predicted dependent variable and the actual dependent variable, a value that ideally approaches 1. SPSS also provides the square of this value. This value, R squared, represents the percentage of variance in the dependent variable that is explained by the independent variables. To appreciate this concept:

Suppose that you only considered individuals who were equivalent on the independent variables.
Presumably, the variability of the dependent variable in the entire sample would exceed the variability of the dependent variable in this subsample.
R squared represents this reduction in the variance. These indices are presented below

The value of R squared that was derived from the sample tends to exceed the value of R squared in the population. For instance:

SPSS may suggest that R squared equals 0.394. In the population, R squared may only equal 0.350.
This bias is especially acute in small samples.
To alleviate this problem, SPSS computes an adjusted R squared--an estimate of the population R squared.
The adjusted R squared is always less than R squared.
Unfortunately, this adjusted R squared tends to exceed the genuine population R squared when the sample size is small.

The F value and also the corresponding significance or p value reflects whether or not R squared significantly differs from zero. When this significance value is less than alpha or 0.05, the R squared value significantly differs from 0. That is, the dependent variable is significantly related to the independent variables. When this significance value exceeds alpha or 0.05, the R squared value does not significantly differ from 0. In other words, the dependent variable is not significantly related to the independent variables. This F and p value is presented below.

Controlling other variables

Multiple regression can also ascertain which of the independent variables relate to the dependent variable. You might believe that Pearson's correlation can answer this question. For example, suppose the correlation between "Psycho" and "Chicken" was significant. You might then conclude that "Psycho" significantly relates to "Chicken. However:

Individuals who yield high "Psycho" scores might also yield high "Crossing" scores which, in turn, may influence "Chicken".
The researcher may thus want to know whether "Psycho" relates to "Chicken" after controlling "Crossing".
That is, does "Psycho" relate to "Chicken" even in people who are the same on "Crossing".
Multiple regression can be applied to answer this question.
That is, multiple regression can determine the relationship between a criterion and predictor after controlling the effect of spurious variables or mediators.

To resolve this issue, examine the column labelled "Sig" in the coefficients table, which provides the p values. In particular:

The p value associated with psycho is less than alpha or 0.05, which suggests the corresponding B value significantly differs from 0.
Because this B value differs from 0, we conclude that "Psycho relates to Chicken after controlling for Crossing and Age".

In other words, some element of Psycho that is unrelated to Crossing or Age must be correlated with Chicken. To appreciate this claim, you should recognise that:

Psycho entails several elements or components.
For examples, Psycho might be elevated because individuals are afflicted with schizophrenia, personality disorders, or dementia.
In other words, variability in Psycho across individuals arises from variability in schizophrenia, personality disorders, and dementia.
In this example, however, Psycho is related to Chicken after controlling Crossing and Age.
Hence, the element of Psycho that is related to Age--perhaps dementia--cannot explain this relationship.
Instead, the elements of Psycho that are unrelated to Crossing and Age, perhaps the level of schizophrenia or personality disorders, must be correlated with Chicken.
In short, only the unique element of Psycho--the proportion of variance that is unrelated to the other independent variables--is related to Chicken.

The p value associated with "Crossing" exceeds alpha or 0.05. Hence, the corresponding B value does not differ significantly from 0. We thus conclude that "Crossing does not relate to Chicken after controlling for Psycho and Age".

Interpret the other indices

To reiterate, the B value associated with "Psycho" significantly differed from 0. The next step is to interpret the sign of this B value in the coefficients table, which is presented again below.

In this instance, the B value is positive.
Accordingly, the independent variable is positively related to the dependent variable.
That is, raising "Psycho" could also raise "Chicken"

In some instance, a significant B value may be negative. In this situation,

The independent variable must be inversely related to the dependent variable, after controlling for the other variables.
That is, raising the independent variable could reduce the dependent variable.

Advanced topic: Part correlations

The final step is to interpret the part correlations--often called semi-partial correlations. Specifically:

These values reflect the correlation between each predictor and criterion, after controlling the other variables.
Strictly speaking, to compute this correlation, the unique variance of each predictor--- that is, the element that is unrelated to the other predictors--is derived. The correlation between the criterion and this unique variance is computed.
The square of these part or semi-partial correlations can also be computed.
For instance, the square of 0.591 is 0.35.
This squared semi-partial correlation reflects the rise in R squared after that predictor is added to the equation.
In this instance, R squared increased by 0.35 after "Psycho" is added to "Crossing" and "Age".

To compute the part correlation between two variables Y and A, after controlling another variable B, the following formula would be utilised.

This formula is merely presented to demonstrate the part correlation is elevated if:

The independent variable, A, is highly correlated with the dependent variable, Y
The variable that is controlled, B, is not highly correlated with the dependent variable or the independent variable.

Suppressor variables

Consider the following table that reveals the correlation between each pair of measures. In this instance, Psycho does not seem to correlate significantly with Chicken.

In contrast, consider the following table, which provides the output that emerged from multiple regression. In this instance, Psycho does relate to Chicken. The impact of Psycho on Chicken emerged only after the other predictors were controlled. Researchers utilise the following terminology to describe this instance. They claim that Crossing or Age had suppressed the relationship between Psycho and Chicken. This relationship emerged only after these suppressor variables were controlled.

In short, a predictor might be unrelated to the criterion according to a correlation matrix but related to the criterion according to a multiple regression. This situation can arise if the relationship between two of the variables is negative, but the relationship between all the other variables is positive. In addition, this situation can arise if the relationship between all the variables is negative. To illustrate, suppose the relationships between Chicken, Psycho, and Crossing correspond to the following model.

This model reveals that Psycho influences Chicken via two pathways. According to the top pathway, Psycho directly promotes Chicken. According to the bottom pathway, Psycho indirectly reduces Chicken. That is, Psycho impedes Crossing and thus Chicken. Therefore,

These two pathways might nullify one another.
In other words, Psycho will not be correlated with Chicken.
However, when crossing is controlled, the bottom pathway is effectively obstructed.
Hence, when crossing is controlled, Psycho will be positively related to Chicken.

In short, according to this model, the correlation between Psycho and Chicken should not be significant. In addition, when a multiple regression is conducted, and Crossing is controlled, Psycho should be positively related to Chicken. The previous correlation matrix and regression coefficients thus corroborate this model.

Standardised Bs

Sometimes, several independent variables are significant. You may thus want to identify the independent variable that bestows the greatest impact on the dependent variable. In this endeavour, you may be tempted to utilise the B values. However, the magnitude of these B values depends on both:

The extent to which the independent variable is related to the dependent variable--reflecting the importance of this predictor
The variance or standard deviation of that independent variable.

Therefore, these B values do not represent importance only. To overcome this problem, SPSS converts all of the raw data into z scores, as depicted below, and recomputes the B values. These z scores--which are presented below--are computed by deducting the mean from the original values and then dividing by the standard deviation. Accordingly, the mean of these z scores is zero and the standard deviation is 1. These B values--which are called unstandardised--are displayed in the column labelled Beta. Because these B values were derived from z scores, all the independent variables yield the same variance. Hence, the magnitude of these B values represents the importance of each independent variable. Roughly speaking, higher Beta values tend to reflect more important independent variables.

Dummy coding

In general, the independent variables need to be numerical. Fortunately, multiple regression can also exploit dichotomous independent variables--that is, independent variables that entail two categories. Gender is one example. To this end,

One category is assigned a 0;; the other category is assigned a 1.
To illustrate, in the data file, create a new variable called "gender".
Assign a 0 to the first 20 subjects to represent males.
Assign a 1 to the other subjects to represent females.

Execute the regression and include this variable. An extract of the output is presented below.

To interpret this added predictor:

When the B value significantly exceeds 0, the dependent variable is higher in the group assigned a 1.
When the B value is significantly less than 0, the dependent variable is higher in the group assigned a 0.
In this instance, gender did not significantly impinge on Chicken Tonite.

Dummy coding can be extended to independent variables that entail more than two categories. For example, suppose the first five participants were actually hermaphrodites. To represent this situation, you first need to discard the column labelled "Gender". Add two new columns--"Males" and "Female". In the "Male" column, 1s represent males and 0s represent the others. In the "Female" column, 1s represent females and 0s represent others. An example is presented below.

You might be tempted to include a column labelled "hermaphrodites"& this step would yield a problem called singularity. Specifically, this column would simply equal 1 minus the two other columns, which represents an extreme form of an issue called multi-colinearity. To interpret the outcome of these analyses:

When a dummy variable attains significance, the corresponding category (e.g. males) must differ from the omitted category (ie hermaphrodites).
When the dummy variable does not attain significance, the corresponding category (e.g. males) does not differ significantly from the omitted category (ie hermaphrodites).
In the example below, males do not differ significantly from hermaphrodites.
Likewise, females do not significantly differ from hermaphrodites.

Sometimes, you would prefer to ascertain whether or not each category differs from the average participant. In this instance, you might be interested in whether or not males and females differ from the average participant. In other words, you might not be interested in whether or not males and females differ from hermaphrodites. To fulfil this purpose, you need to adjust the dummy variables. Specifically, replace the 0s with -1 for participants who pertain to the reference category. That is, substitute the 0s with -1 for each hermaphrodite. The following output will then emerge.

According to this output, neither males nor females differ significantly from the average participant. Whether or not hermaphrodites differ significantly from the average participant cannot be determined from these data. Researchers could repeat the analysis and designate another gender, such as males, as the reference category.

Which predictors should be included?

Consider a researcher who has measured 50 predictors of chicken tonight--perhaps five personality variables, eight family variables, six economic variables, and so forth. Unless the sample size is very large, the researcher cannot include all 50 predictors in the same regression analysis. How many regression analyses should this researcher conduct? Perhaps the researcher could:

Undertake one regression analysis in which the predictors are demographics--such as sex, age, and height--and the five personality variables.
Undertake a second analysis in which the predictors are sex, age, and height but with the eight family variables instead.
Undertake a third regression analysis in which the predictors are sex, age, and height but with the six economic variables instead, and so on.

This strategy seems reasonable. Nevertheless, to decide which strategy is optimal, the researcher should reflect upon the benefits of examining several predictors in the same regression analysis. To demonstrate, suppose five personality variables are subjected to the same regression analysis. Thus, by definition, each personality variable is examined after controlling the other personality variables. The question, therefore, becomes what are the benefits of controlling other variables. Four benefits of controlling variables can be differentiated:

Purity of measures and treatments
Spurious variables
Mediators
Power

In this instance, the researcher most likely decided to:

Include variables that correspond to one category, such as personality factors, in the same analysis to ensure only the unique--or purified--facet of each measure is examined
Include demographic variables, such as sex and age, in the analysis because these demographics might be spuriously related to the independent and dependent variables
Not include any mediators--perhaps because the study was not designed to examine mediation
Not include variables to increase power--perhaps because the researcher could not identify a predictor that is strongly related only to the dependent variable.

In short, to decide which predictors to include in the multiple regression analysis, you should consider whether you want to purify measures, control spurious factors, examine mediators, or increase power. Nevertheless, in practice, researchers tend to include variables that correspond to a similar category in the same analysis. They also control demographics that seem to correlate with both the dependent and independent variables.

Illustration of the format used to report multiple regression.

A multiple regression analysis was used to examine whether psychotic behaviour, crossing roads, and age was related to the number of jars of Chicken Tonite that individuals consume. The B and t values that emerged are presented in Table 1. Table 1 reveals that psychoticism increases the volume of Chicken Tonite that individuals consume after controlling age and the frequency with which they crossed roads. None of the other predictors achieved significance.

Table 1. Output of the regression that related consumption of Chicken Tonite to psychoticism, crossing roads, and age

Academic Scholar?
Join our team of writers.
Write a new opinion article,
a new Psyhclopedia article review
or update a current article.
Get recognition for it.