Tipultech logo

Expectation maximization--to manage missing data

Author: Dr Simon Moss

Overview

Expectation maximization is an effective technique that is often used in data analysis to manage missing data (for further discussion, see Schafer, 1997;; Schafer & Olsen, 1998). Indeed, expectation maximization overcomes some of the limitations of other techniques, such as mean substitution or regression substitution. These alternative techniques generate biased estimates-and, specifically, underestimate the standard errors. Expectation maximization overcomes this problem.

Execution using SPSS

Many statistical packages can now implement expectation maximization. To execute this technique with SPSS

  • Choose Missing Value Analysis from the Analyze menu.
  • Transfer all numerical variables that are related to the study or issue into the box labelled Quantitative Variables. Exclude irrelevant variables, such as ID.
  • Transfer all categorical variables that are related to the study or issue into the box labelled Categorical Variables
  • Select the EM option
  • Press the EM button, and select Save completed data.
  • Choose Write a new data file. Press File and type a filename.
  • Open this new file-which should include the data together with some of the missing data completed.

    Rationale underpinning Expectation Maximization

    Iteration process

    To illustrate expectation maximization, consider the following extract of data. Missing values are observed for depression, age, and height.

    ID

    depression

    age

    height

    wage

    1

    5

    32

    32, 010

    2

    17

    173

    31, 600

    3

    7

    169

    48, 020

    4

    5

    24

    186

    17, 400

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    100

    4

    45

    201

    7, 800

    To undertake expectation maximization, the software package, such as SPSS executes the following steps. First, the means, variances, and covariances are estimated from the individuals whose data is complete. In particular, the computer would generate the following information. Specifically:

    • The mean of depression, age, height, and weight is 4.71, 37.50, 183.21, and 45504.43 respectively
    • These values were derived only from the individuals whose data is complete. Rows 1, 2, and 3, for example, were disregarded
    • Similarly, the variance of depression, age, height, and weight is 3.55, 9.43, 194.43, and 14403.12 respectively and appears in the diagonals
    • Finally, the other numbers represent the covariance between each pair of variables-which is merely the correlation multiplied by the standard deviation of each variable. Hence, covariances are similar to correlations but do not range from -1 to 1.

    depression

    age

    height

    wage

    depression

    3.55

    age

    7.42

    9.43

    height

    184.42

    1643.32

    194.43

    wage

    43042.345

    143254.43

    14425.54

    14403.12

    Mean

    4.71

    37.50

    183.21

    45504.43

    Second, maximum likelihood procedures-a special class of formulas-are used to estimate a regression equations that relate each variable to each other variable. For example, these procedures might generate the formula:

    • Depression = -15.3 + .01 x age + .004 x height + .0005 x wage
    • Age = 7.3 + .34 x depression + .002 x height + .0003 x wage
    • Height = 19.2 + .53 x depression + .021 x age + .0004 x wage
    • Wage = 7.3 + .44 x depression + .031 x age + .0021 x height

    The maximum likelihood procedures are designed to ensure these formulas predict the means, variances, and covariances more accurately than any other formulas (see Dempster, Laird, & Rubin, 1977). That is, suppose the researcher could calculate the probability of generating these means, variances, and covariances if these equations were correct. Suppose the probability is approximately .00004. Any other formulas would generate lower probabilities.

    Third, these formulas can be used to estimate the missing values. To illustrate:

    • Consider the equation Depression = -15.3 + .01 x age + .004 x height + .0005 x wage.
    • This equation can then be used to estimate the depression for individuals who did not provide this information.
    • For the second case, 17, 173, and 31600 would be substituted into this equation.
    • Depression for this person would be 1.362.

    The same process can be used to estimate the missing values associated with the other variables. This process could generate the following data. The estimated data appear in bold

    ID

    depression

    age

    height

    wage

    1

    5

    32

    181.43

    32, 010

    2

    1.362

    17

    173

    31, 600

    3

    7

    19.53

    169

    48, 020

    4

    5

    24

    186

    17, 400

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    100

    4

    45

    201

    7, 800

    Using these data, the means, variances, and covariances are then estimated again. As the following table shows, these estimates might change slightly because more data is included.

    depression

    age

    height

    wage

    depression

    3.35

    age

    7.72

    10.01

    height

    182.42

    1743.82

    194.41

    wage

    43019.315

    125254.93

    15125.51

    14353.11

    Mean

    4.91

    37.87

    179.29

    45504.45

    Again, the regression equations are calculated again, using maximum likelihood procedures. These equations might now be marginally different:

    • Depression = -14.9 + .03 x age + .005 x height + .0010 x wage
    • Age = 7.1 + .24 x depression + .001 x height + .0002 x wage
    • Height = 20.2 + .61 x depression + .022 x age + .0003 x wage
    • Wage = 8.3 + .49 x depression + .041 x age + .0022 x height

    ID

    depression

    age

    height

    wage

    1

    5

    32

    182.93

    32, 010

    2

    1.291

    17

    173

    31, 600

    3

    7

    20.01

    169

    48, 020

    4

    5

    24

    186

    17, 400

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    100

    4

    45

    201

    7, 800

    This sequence of processes-the calculation of means, variances, and covariances, the formulation of regression equations, and the estimate of missing values-is undertaken iteratively. By default, SPSS engages in this process up to 25 times, until the estimates change only negligibly. This default can be increased, however.

    Error distribution

    One of the problems with many techniques, designed to estimate missing values, is the standard error diminishes. To illustrate, consider the estimate of depression for the second person. This estimate was derived from his or her age, height, and wage. Suppose many estimates of depression were derived from the age, height, and wage of the participants. As a consequence, the extent to which depression is related to age, height, and wage would be overestimated.

    This process, therefore, disregards the possibility that depression does not only depend on

    age, height, and wage. Many other random factors could increase or decrease these estimates of depression.

    Therefore, to ensure the estimates are more realistic, the software package introduces some error to the variances and covariances. That is, rather than generate the table...

    depression

    age

    height

    wage

    depression

    3.35

    age

    7.72

    10.01

    height

    182.42

    1743.82

    194.41

    wage

    43019.315

    125254.93

    15125.51

    14353.11

    Mean

    4.91

    37.87

    179.29

    45504.45

    ...from the data, some of these values are modified slightly to...

    depression

    age

    height

    wage

    depression

    3.34

    age

    7.71

    10.00

    height

    182.41

    1743.81

    194.42

    wage

    43019.312

    125254.94

    15125.53

    14353.10

    Mean

    4.91

    37.87

    179.29

    45504.45

    As this table shows, the modification is minor-only affecting the final decimal point in this example. By default, in SPSS, the distribution of these errors follows a normal distribution. Other alternatives, however, can be specified, such as mixed normal, and Student's t, both of which require specification of some parameter.

    Applicability of expectation maximization

    Pattern of missing data

    Expectation maximization is applicable whenever the data are missing completely at random or missing at random-but unsuitable when the data are not missing at random. To illustrate, consider the following extract of data. Conceivably, individuals who do not answer questions about depression tend to be very depressed. In other words, the likelihood of missing data on this variable is related to their level of depression.

    In addition, individuals who do not answer questions about depression might be older-because the stigma if this affective disorder might be more potent in an older generation. Thus, the likelihood of missing data on depression is related to their level of age.

    ID

    depression

    age

    height

    wage

    1

    5

    32

    32, 010

    2

    17

    173

    31, 600

    3

    7

    169

    48, 020

    4

    5

    24

    186

    17, 400

    .

    .

    .

    .

    .

    .

    .

    .

    .

    .

    100

    4

    45

    201

    7, 800

    Suppose the missing data on one variable, such as depression, is unrelated to their actual level on this variable-or to their level on the other measured variables such as age, height, or weight. In this instance, researchers designate the data as missing completely at random, and expectation maximization is applicable.

    Suppose, instead, the missing data on one variable, such as depression, is related to their level on the other measured variables, such as age, height, or weight. However, once these variables are controlled, suppose that missing data on one variable, such as depression, is unrelated to their actual level on this variable. For example, perhaps individuals who do not answer questions about depression might be older. Once age is controlled however-analogous to examining one age group only-missing data on depression might be unrelated to depression. In this instance, researchers designate the data as missing at random-not missing completely at random-and expectation maximization is still applicable.

    Sometimes, however, missing data on one variable, such as depression, is still related to scores on that variable after the other factors are controlled. That is, the most depressed individuals might be least likely to answer-even if only one age, height, and wage is examined. In this instance, researchers designate the data as not missing at random, and expectation maximization is no longer applicable.

    Establishing missing completely at random

    Several procedures can be undertaken to establish whether the data are missing completely at random, missing at random, and not missing at random. First, for each variable, researchers can assess whether the data differs between individuals who responded to some variable and individuals who did not respond to some variable.

    For example, a series of t-tests or a logistic regression analysis can be undertaken to assess whether individuals who generated responses on the depression scale and individuals who did not generate responses on the depression scale differ on age, height, or weight. Non-significant findings indicate that, perhaps, missing data on this variable is random& otherwise, at least one variable should differ between individuals who responded to this variable and individuals who did not respond to this variable.

    Similarly, using SPSS or other packages, individuals could calculate Little's MCAR test. A non-significant finding is consistent with the assumption that data are completely missing at random-and hence expectation maximization is applicable. To conduct this test, undertake expectation maximization as usual, and the test will appear by default.

    Establishing missing at random

    If the data are not missing completely at random, they might nevertheless by missing at random. To establish this possibility, undertake expectation maximization as usual. Proceed to the table labelled Separate Variance t Tests. If all the p values exceed .05 or alpha, the data are missing at random. Expectation maximization is thus warranted.

    References

    Allison, P. D. (2001) Missing Data Thousand Oaks, CA: Sage Publications.Return

    Cohen, J. & Cohen, P., West, S. G. & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.) Mahwah, N.J.: Lawrence Erlbaum. Return

    Dempster, A., Laird, N., & Rubin., D. (1977). Likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39,1-38

    Little, R.J.A. & Rubin, D.B. (1987) Statistical analysis with missing data. New York, Wiley. Return

    Jones, M. P. (1996). Indicator and stratification methods for missing explanatory variables in multiple linear regression. Journal of the American Statistical Association, 91,222-230.

    Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91, 473-489.

    Scheuren, F. (2005). Multiple imputation: How it began and continues. The American Statistician, 59, 315-319.

    Schafer, J. L. (1997) Analysis of incomplete multivariate data. Chapman & Hall, London. Book No. 72, Chapman & Hall series Monographs on Statistics and Applied Probability.

    Schafer, J. L. (1999). Multiple imputation: A primer. Statistical Methods in Medical Research, 8, 3-15.

    Schafer, J. L. & Olsen, M. K. (1998). Multiple imputation for multivariate missing-data problems: A data analyst's perspective. Multivariate Behavioral Research, 33, 545-571.



  • Academic Scholar?
    Join our team of writers.
    Write a new opinion article,
    a new Psyhclopedia article review
    or update a current article.
    Get recognition for it.





    Last Update: 6/26/2016