Sample syntax for regression analyses

Introduction

To execute regression analyses in SPSS, researchers often prefer to create a syntax file rather than select the various menus. If researchers use the syntax, they can repeat their analysis with other datasets efficiently, without needing to select the various menus and options again. This article presents a sample syntax, which illustrates the various phases that researchers often complete, such as checking missing data, recoding items, checking internal validity, testing assumptions, and then conducting the regression analyses.

How to use the syntax

To use the syntax:

• Open SPSS& select "New" and "Syntax" from the "File" menu--to open a word processor
• Follow the instructions below to create the syntax or code for the analyses
• If a line of code protrudes into the right hand side of the screen, press enter. Then, indent the code a few spaces. SPSS sometimes neglects code that appears too far into the right hand side of the screen
• Lines that begin with an asterisk are remarks& SPSS neglects these lines
• When ready to execute the code, simply highlight the relevant sections. Use the 'Run' menu then to execute this code.
• Some error messages might appear. Sometimes, these messages appear because of unnecessary spaces between lines or after full stops. Omit these spaces.

Step 1. Reverse score or recode items

First, identify items that need to be reverse scored. For example, suppose 4 items are used to assess anxiety. Suppose high scores on two of the items reflect high anxiety. Suppose low scores on the other items reflect high anxiety--the scores on these two items will need to be reverse scored. That is, researchers need to ensure that high scores on each item correspond to high levels of that variable. Copy and paste the syntax below.

 RECODE item3 (1=5) (2=4) (3=3) (4=2) (5=1) INTO item3r. RECODE item4 (1=5) (2=4) (3=3) (4=2) (5=1) INTO item4r. EXECUTE.

In this code, for example, the first line changes all the 1s to 5s, the 2s to 4s, the 4s to 2s, and the 5s to 1s in item 3. This line then creates a new column in the data file called item3r. To reverse score:

• Simply change the names of each item
• For example, you might need to reverse score a column called qn4 rather than item3
• Therefore, replace item3 with qn4.
• Also, replace item3r with qn4r or whatever you want to call the new column
• you need to reverse score more than 2 items, simply copy and paste one of these line, and change the labels accordingly

Other adjustments might be necessary

• If you used a 7 point scale, you need to change the numbers
• The numbers would become (1=7) (2=6) and so forth
• Finally, highlight each line as well as the execute command
• Use the 'run' menu to execute this syntax
• Examine the end of your data file to observe the outcome

Step 1b. Manage missing data

The next step is often to substitute missing data with accurate estimates, using a technique called expectation maximization (for information about the underlying rationale, see Expectation Maximization). To undertake this technique, which is only necessary if participants have not answered all of key questions:

• Identify which of the variables that you want to examine are numerical--that is, variables in which everyone is assigned a real number, such as a Likert scale
• Identify which of the variables that you want to examine are categorical--that is, variables in which everyone is assigned a category or name, such as hair color

Type the following syntax

 MVA VARIABLES=height weight item1 item2 item3 item4 gender haircolor /MAXCAT=25 /CATEGORICAL=gender haircolor /EM(TOLERANCE=0.001 CONVERGENCE=0.0001 ITERATIONS=25 OUTFILE= 'C:My '+ 'DocumentsSample data file.sav').
• Instead of the variables height weight item1 item2 item3 item4 gender haircolor in the first row, specify the names of your variables--that is, the names of your columns in SPSS
• Instead of the variables gender and haircolor in the third row, specify the name of your categorical variables
• Instead of the name 'C:My '+ 'DocumentsSample data file.sav', type the name of a new file in which you would like the updated data stored--that is, the data after the missing data has been substituted with estimates.

Use the updated data, unless the missing data is not missing at random. That is, this updated file is not applicable if the missing data is related to values on one variable even after controlling other variables. Hence:

• In the output that appears, determine whether the p values associated with Little's MCAR test is significant
• If this p value is non significant, the data are called "completing missing at random"--which implies that expectation maximization is suitable
• If this p value is significant, proceed to the table labeled Separate Variance t Tests
• If all the p values are not significant, the data are called "missing at random"--which also implies that expectation maximization is suitable
• If expectation maximization is not suitable, you need to recognize that your data could be biased. Furthermore, substitution of missing data with estimates is not appropriate

Step 2. Examine alpha reliability or internal consistency

The next step is to compute Cronbach's alpha for each of your scales. Copy and paste the follow syntax.

 RELIABILITY /VARIABLES = item1 item2 item3 /FORMAT=NOLABELS /SCALE(ALPHA) = ALL/MODEL=ALPHA /SUMMARY=TOTAL.
• Identify the items associated with one your scales or subscales.
• For example, in your data file, perhaps qn1, qn23, qn3r, and qn4r pertain to anxiety
• Replace 'item1 item2 item3' with 'qn1 qn2 qn3r qn4r' in the previous syntax
• Use the 'run' menu to execute this syntax
• Examine the end of your data file to observe the outcome

Alpha values that exceed 0,7 are considered acceptable (e.g., Nunnally & Bernstein. 1994). Suppose the alpha value is less than 0.7. In this instance:

• Proceed to the column called 'Alpha if item deleted'
• This column represents what the alpha would be if each item was deleted
• Suppose the alpha rises to 0.8 after qn1 is deleted
• This finding suggests that qn1 should be deleted from now on
• You should then repeat the process but exclude qn

You can continue removing items even if alpha exceeds 0.7. However, if the scale is popular, only remove items if necessary. That is, only remove items if alpha rises considerably.

Then, repeat this process with other scales or subscales. That is, copy and then paste these five lines. Replace the items with the next subscale, and continue.

Step 3. Compute the scale scores

You now need to compute the scores for each scale or subscale. For example, suppose that qn1, qn2, qn3r, and qn4r pertain to anxiety. However, suppose that qn1 was deleted, because this item reduces alpha. You thus need to create a new column that is the average of qn2, qn3r, and qn4r. This column would represent the level of anxiety. Accordingly, copy and paste the following syntax:

 COMPUTE anxiety = mean(qn2, qn3r, qn4r). EXECUTE.

In this syntax, each line below presents the code that creates these new columns. The word after 'COMPUTE' is merely the label for your new column. For example:

• Suppose you want to create a scale that reflects openness
• You would then replace 'anxiety' with 'openness'
• Suppose the alpha rises to 0.8 after qn1 is deleted
• The labels in the brackets represent the items associated with this scale.
• Remember to utilize the recoded items where applicable
• Column labels must begin with a letter and comprise letters, numbers, or the underscore_

Continue this process for each scale or subscale. That is, copy and paste the COMPUTE line. If your data includes 6 scales, then you need 6 COMPUTE commands. Highlight these lines, as well as the EXECUTE command, and then run.

Step 4. Assess multicolinearity

You now need to examine the correlation between all scales and other key variables. To create a correlation matrix, copy and paste the following syntax:

 CORRELATIONS /VARIABLES=anxiety openness height weight gender /PRINT=TWOTAIL NOSIG /MISSING=PAIRWISE.

Replace 'anxiety openness height weight gender' with all your scales and subscales. You can also include other numerical variables, such as height, as well as dichotomous variables, such as gender. The two categories in these dichotomous variables should be labeled as 0 and 1 respectively in the data file. Finally, highlight and then execute this syntax

Correlations that exceed 0.8 or so indicate multicolinearity. That is, these correlations suggest the two variables overlap unduly. These two variables should not be included in the same analysis, unless one is the dependent variable and the other is the independent variable. You might want to collapse these two variables into the one scale.

Indeed, if the sample size is small, such as less than 100, correlations that exceed 0.7 might indicate multicolinearity. Regardless, the correlations often provide some insight into the hypotheses as well.

Sometimes, the correlation matrix comprises too many variables and is thus unwieldy. You could potentially divide the matrix into three tables. That is:

• The first table examines the correlations between one set of variables--perhaps the dependent measures.
• The second table examines the correlations between the remaining variables.
• The third table demonstrates how the first set correlates with the second set.
• This final matrix is created by placing the term 'WITH' between these sets, as illustrate below.
 CORRELATIONS /VARIABLES=anxiety openness WITH gender height weight /PRINT=TWOTAIL NOSIG /MISSING=PAIRWISE.

Step 5. Examine outliers

You can next identify multivariate outliers--individuals whose profile of scores diverges appreciably from a typical person. These outliers might indicate the individual is not a member of your target population. Alternatively, these outliers might represent errors or compromise normality. To identify these outliers, copy and paste the following syntax:

 REGRESSION /MISSING PAIRWISE /STATISTICS COEFF OUTS R ANOVA /NOORIGIN /DEPENDENT other /METHOD = ENTER anxiety openness gender /SAVE MAHAL.
• Replace 'anxiety openness gender' with your scales, subscales, numerical variables, and dichotomous variables.
• In addition, replace 'other' with a variable that does not appear in the list next to ENTER.
• For example, perhaps you have specified an 'ID' number for each person. You would then replace 'other' with 'id'

Sometimes, all the variables appear in the list next to ENTER. If this instance, create a new column of numbers in the data file. Replace 'other' with the name of this column. Highlight and execute this syntax.

Execute this syntax, which creates a new column in the data file called mah_1. High values represent potential multivariate outliers. To identify outliers:

• Open Microsoft Excel. Type "=CHIINV(0.01, 50)" in one of the cells--that is, type everything that appears within these quotation marks.
• Change 50 to the number of variables that appeared after 'ENTER' before--which represents the degrees of freedom
• A value should appear in the cell. Mahalanobis values that appreciably exceed this value are outliers.
• Outliers should usually be deleted.

Step 6. Specify possible regressions.

Before you proceed, specify the regressions you plan to undertake. For example, you might need to know how personality affects IQ. Suppose your study includes 5 measures of personality. Suppose your study includes 2 measures of IQ: verbal and spatial.

In this example, you would undertake two regressions. For the first regression, the dependent variable would be verbal IQ. The independent variables would be five personality traits. For the second regression, the dependent variable would be spatial IQ. The independent variables would again be 5 personality traits.

In short, for each regression, you need to specify the dependent variable. Then, you need to specify the independent variables. Sometimes, this step is simple& sometimes, this step demands some creativity. Mediated and moderated models will be discussed later.

Step 7. Undertake your first regression, partly as practice.

To undertake your first regression analysis, copy and paste the following syntax:

 REGRESSION /MISSING PAIRWISE /STATISTICS COEFF OUTS R ANOVA /NOORIGIN /DEPENDENT verbaliq /METHOD = ENTER extravrt neurot openness /SAVE PRED COOK RESID. DESCRIPTIVES VARIABLES=res_1 /STATISTICS=SKEWNESS KURTOSIS. GRAPH /SCATTERPLOT(BIVAR)=pre_1 WITH res_1 /MISSING=LISTWISE.
• Replace 'verbaliq' with the name of your dependent variable, such as "anxiety'.
• Replace 'extravrt neurot openness' with the name of your independent variables

Before you examine the output of this regression, you need to assess the assumptions:

• Switch to the data file, and locate the column labeled Coo_1
• These values represent Cook's distances& values that exceed 1, or are significantly greater than almost all the other Cook's distances, represent influential cases--individuals who greatly affect the output.
• These individuals should be deleted, and the regression should be executed again.
• However, before you ever execute a regression analysis, delete the columns Coo_1, Res_1, and Pre_1

Second, after you remove these influential cases, you need to examine whether the residuals, which is the column labeled Res_1, are normal.

• In the output, locate the skewness and kurtosis of the residuals.
• Divide the skewness by the corresponding standard error
• Divide the kurtosis by the corresponding standard error
• If neither of the answers exceeds about 3.29, the residuals are normal, at alpha of .001
• You can conclude "No consequential departure from the assumption of normal residuals was observed".
• If either of the answers exceed 3.29, the residuals are not normal.
• Perhaps, as a consequence, use a more conservative alpha, of perhaps .025 or even .01
• That is, findings are significant only if p < 0.025 or p < 0.01.

Third, you should examine the assumptions of homoscedasticity and linearity.

• The scatterplot that appears can be utilized to examine linearity and homoscedasticity
• If the scatterplot yields a U or inverted U shape, linearity might be violated
• You might not need to reconsider the model or transform variables in some way
• If the scatterplot yields a fan shape, homoscedasticity might be violated
• That is, the points might become more dispersed at one end of the x axis
• This violation is not especially consequential, but might indicate that some moderator should be included.
• Perhaps, as a consequence, use a more conservative alpha, of perhaps .025 or even .01
• That is, findings are significant only if p < 0.025 or p < 0.01.

Finally, examine the output of the regression. Specifically, examine the ANOVA table to ascertain whether the R is significant. Examine the coefficients table to ascertain which variables are significant.

Once you have completed the first regression, switch to the data file. Delete the columns Coo_1, Res_1, and Pre_1. Ensure that you delete these columns after each regression. Finally, you can conduct other regression analyses as well.

Step 8. Undertake logistic regression

Sometimes, your dependent variable is dichotomous, such as gender. In these instances, you should undertake logistic regression rather than multiple regression. Copy and paste the following syntax:

 LOGISTIC REGRESSION VARIABLES gender /METHOD=ENTER height weight /CRITERIA=PIN(.05) POUT(.10) ITERATE(20) CUT(.5).
• Replace 'gender' with the name of your dependent variable, which should comprise two categories.
• Replace 'height weight' with the name of your independent variables, which should be numerical or dichotomous

Step 9. Conduct moderated regression analyses.

If you want to examine whether one numerical variable moderate or changes the relationship between two other variables, moderated regression is often useful. For example, age might affect the relationship between personality and IQ. To undertake moderator regression analyses, first ascertain the mean of all your independent variables and moderators. In this example, the researcher would compute the means of personality and age. Specifically, copy and paste the following syntax, but replace 'extravrt neurot openness age' with the name of your independent variables and moderators.

 DESCRIPTIVES VARIABLES= extravrt neurot openness age /STATISTICS=MEAN.

Next create new columns that, in essence, combine each independent variable with each moderator. In this example, the researcher would create one column that combines extravrt with age. The researcher would then create another column that combines neurot with age and so forth. That is, copy and paste the following syntax.

 COMPUTE ext_x_age = (extravrt - 1.76)*(age - 1.45). COMPUTE neu_x_age = (neurot - 2.46)*(age - 1.45). COMPUTE ope_x_age = (openness - 2.54)*(age - 1.45). EXECUTE.
• In this syntax, the label after 'COMPUTE' is simply a name for the new column
• The numbers reflect the mean of the corresponding column. For example, 1.45 represents the mean of age
• Replace extravrt by the name of an independent variable& replace age by the name of your moderator
• Replace 1.76 by the mean of this independent variable& replace 1.45 by the mean of your moderator
• The number of COMPUTE lines should equal the number of independent variables multiplied by the number of moderators
• Change the other lines to include your other independent variables or moderators.

Finally, include these new columns--called products or interactions--into the regression after the ENTER command. Whenever you include these products, also include the constituent independent variables and moderators. Copy and paste the following example.

 REGRESSION /MISSING PAIRWISE /STATISTICS COEFF OUTS R ANOVA /NOORIGIN /DEPENDENT verbaliq /METHOD = ENTER extravrt neurot openness age ext_x_age neu_x_age ope_x_age /SAVE PRED COOK RESID. DESCRIPTIVES VARIABLES=res_1 /STATISTICS=SKEWNESS KURTOSIS. GRAPH /SCATTERPLOT(BIVAR)=pre_1 WITH res_1 /MISSING=LISTWISE.
• Instead of extravrt neurot openness age ext_x_age neu_x_age ope_x_age, include the name of your products, together with the constituent independent variables and moderators.
• Execute the analysis.
• Suppose a product term is significant. For example, suppose ext_x_age is significant.
• This finding suggests age affects the relationships between extraversion and IQ (Aiken & West, 1991)
• You would then need to create a graph to explore this finding further.

Sometimes, these moderated regression analyses include too many variables. If so, you could examine each independent or moderator separately. In this instance, you could examine extravrt, neurot, and openess separately.

Step 10. Conduct mediation analyses.

Two main techniques can be undertaken to examine mediation analyses (for more information on these techniques, see Mediation analyses& for more information on analyses with multiple mediators, see Mediation analyses with multiple mediators).

The most common, but not necessarily the most effective, technique is called the Baron and Kenny approach or the causal steps approach (see Baron & Kenny, 1986). For example, suppose your study includes the following hypothesis: The association between personality and IQ is mediated by anxiety and depression. To assess this hypothesis, you need to undertake three to four steps.

First, show the independent variables are associated with the dependent variables. You could undertake two regression analyses:

• Regression 1: Dependent = verbaliq& Independents = extravrt neurot openness
• Regression 2: Dependent = spatiaiq& Independents = extravrt neurot openness
• Utilize the instructions in Section 7 to conduct these regression analyses

Second, show the independent variables are associated with the mediators

• Regression 1: Dependent = anxiety& Independents = extravrt neurot openness
• Regression 2: Dependent = deprssn& Independents = extravrt neurot openness
• Utilize the instructions in Section 7 to conduct these regression analyses

Third, show the mediators are associated with the dependent variables, after controlling the independent variables

• Regression 1: Dependent = verbaliq& Independents = extravrt neurot openness anxiety deprssn
• Regression 2: Dependent = spatiaiq& Independents = extravrt neurot openness anxiety deprssn
• Utilize the instructions in Section 7 to conduct these regression analyses

Finally, show the independent and dependent variables are unrelated once you control the mediators

• To assess this hypothesis, you undertake a hierarchical regression
• The first step includes the mediators& the second step includes the independent variables
• Regression 1: Dependent = verbaliq& Step 1 = anxiety depressn& Sep 2 = extravrt neurot openness
• Regression 2: Dependent = spatialiq& Step 1 = anxiety depressn& Sep 2 = extravrt neurot opennesss
• Utilize the instructions in Section 7 to conduct these regression analyses

Copy and paste the syntax below. You should simply replace 'anxiety depressn' with your mediators as well as 'extravrt neurot openness' with your independent variables

 REGRESSION /MISSING PAIRWISE /STATISTICS COEFF OUTS R ANOVA CHANGE /NOORIGIN /DEPENDENT verbaliq / METHOD = ENTER anxiety depressn /METHOD = ENTER extravrt neurot openness /SAVE PRED COOK RESID. DESCRIPTIVES VARIABLES=res_1 /STATISTICS=SKEWNESS KURTOSIS. GRAPH /SCATTERPLOT(BIVAR)=pre_1 WITH res_1 /MISSING=LISTWISE. REGRESSION /MISSING PAIRWISE /STATISTICS COEFF OUTS R ANOVA CHANGE /NOORIGIN /DEPENDENT spatiaiq / METHOD = ENTER anxiety depressn /METHOD = ENTER extravrt neurot openness /SAVE PRED COOK RESID. DESCRIPTIVES VARIABLES=res_1 /STATISTICS=SKEWNESS KURTOSIS. GRAPH /SCATTERPLOT(BIVAR)=pre_1 WITH res_1 /MISSING=LISTWISE.

The CHANGE command in the third row yields some vital output. Specifically, locate the p value associated with F change in the second step. If this value is not significant, the independent variables were not significantly related to the dependent variable after the mediators are controlled. In other words, full mediation is operating.

If this value is significant, the independent variables were significantly related to the dependent variable even after the mediators are controlled. In other words, full mediation is not operating.

References

Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Thousand Oaks: Sage.

Baron, R. M., & Kenny, D. A. (1986). The moderator-mediator distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 1173-1182.

Nunnally, J. C., & Bernstein. I. H. (1994). Psychometric Theory. New York: McGraw-Hill.