Properties of excellent measures

Author: Dr Simon Moss

Measurement validity

To substantiate the legitimacy or utility of a study, researchers need to show their measures are valid. They need to demonstrate their measures of performance, personality, cognition, and so forth are suitable and effective. Usually, this property is called measurement validity. Sometimes, this property is called construct validity, although this term is sometimes reserved for a more specific characteristic.

Traditionally, to validate measures, most researchers rely on establishing content validity and examining correlations between the instrument under investigation and alternative scales or variables. Recently, however, researchers have begun to consider other criteria, such structural fidelity, process engagement, generalizability, and consequences

Content validity

To substantiate the utility of measures, researchers must establish <"text-decoration: underline&">content validity (see Messick, 1995). In particular, they must first establish the boundaries or scope of this construct. For example, to develop a measure of work engagement, researchers must decide whether this construct is merely an emotional state--related to enthusiasm and excitement--or also entails cognitive properties, such as a tendency to underestimate the passage of time, as well as behavioral manifestations, such as working extensively without breaks.

Second, and related to this goal, researchers must delineate the various facets of this construct. They might decide that engagement comprises two main facets: low levels of distraction and high levels of energy. In addition, these two facets could be subdivided into three more specific elements: affective, cognitive, and behavioral manifestations. Usually, a variety of activities--such as discussions with experts, observations of employees, examinations of documents, as well as an analysis of theories--are undertaken to establish the boundaries and facets of this construct.

After these activities are completed, researchers need to ensure the measure represents each of these facets. Furthermore, some facets are more important than other facets. For example, perhaps an energetic mood is more inclined to affect key outcomes, such as productivity or commitment to the organization. Researcher should ensure these important facets affect the score on this measure more than do the unimportant facets, called <"text-decoration: underline&">ecological sampling (Brunswick, 1956).

Correlations with other variables and measures: Convergent validity

To verify the validity of measures, researcher examine and report the correlations between their instrument and other procedures or variables (Messick, 1995). These correlations should align with hypotheses derived from theories.

For example, suppose a researcher wants to construct a measure the predicts depression subtly rather than overtly. To measure depression, participants are asked to describe one pleasant event and one unpleasant event, such as their last day at school, as well as their reactions to these two episodes. The number of times they use the words "I", "me", or "my" is determined. The number of times they use the words "we", "us", and "our" or refer to other persons is also calculated. The difference between these calculations represents a measure of depression and loneliness. That is, individuals who use the words "I", "me", or "my"--and neither utilise the terms "we", "us", and "our" nor refer to other persons--are susceptible to loneliness and depression (Pennebaker & Stone, 2003).

To establish the validity of this measure, researchers should conduct studies to show that:

Scores on this measure correlate with scores on other measures of depression or loneliness

Scores on this measure correlate with scores on variables should be positively or negatively related to depression, such as anxiety, self-esteem, and engagement.

When researchers demonstrate the instrument under examination correlates, either positively or negatively, to other meaures of this variable--or to other variables that sholud be related--they establish a criterion that is often called <"text-decoration: underline&">convergent validity (Campbell & Fiske, 1959).

Often, the instrument is constructed for a specific purpose. For example, this measure of depression might be constructed to predict which employees will depart from the organization prematurely. Accordingly, researchers need to measure depression at one time and then assess which employees departed prematurely a year or so later.

When researchers demonstrate the instrument under examination correlates with this key outcome, they establish a standard that is sometimes called <"text-decoration: underline&">criterion validity. Convergent validity and criterion validity are sometimes used interchangeably. However, the term criterion validity primarily applies when the measure correlates with a key outcome--the outcome it was intended to predict.

Correlations with other variables and measures: Discriminant validity

In addition to convergent validity, researchers also need to established discriminant validity (Campbell & Fiske, 1959). That is, they need to show their measure is distinct from other variables or constructs.

To illustrate, perhaps this measure does not reflect depression and loneliness, but represents social anxiety, introversion, or many other variables. To reject these possibilities, and thus to establish discriminant validity, researchers need to conduct studies to show that:

Scores on this measure do not correlate too highly with scores on instruments that reflect social anxiety, introversion, or other variables. Correlations that exceed .7, even after corrections for random error, might be excessive

Scores on this measure correlate with other key outcomes, even after social anxiety, introversion, or other variables are controlled. This finding would reveal the measure is significant, even if highly related to other variables. Multiple regression analysis, discriminant function analysis, or canonical correlation might be conducted to establish this criterion.

Process engagement

To substantiate the validity of measures, researchers should accrue evidence to verify that participants do indeed engage in the processes the instrument or task is purported to reflect. That is, researchers tend to assume that a measure or instrument gauges some cognitive process or operation--but this assumption is often not substantiated directly (see Ebretson, 1983).

For example, in an extensive series of papers, Vrij and his colleagues have identified the mannerisms and vocal characteristics that might reflect deception (e.g., Vrij, Edward, & Bull, 2001). Conceivably, researchers could develop a protocol that practitioners could follow to assess whether someone is acting deceptively. For example, practitioners might assess whether or not respondents pause before they answer a question, commit many speech errors, and speak rapidly--all possible indices of deception, at least in particular contexts (Vrij, Akehurst, & Morris, 1997;; Vrij, A., Edward, K., & Bull, R. (2001;; Vrij, Edward, Roberts, & Bull, 2000).

Nevertheless, to validate this measure, researchers would need to show these mannisms or behaviors do indeed reflect the purported processes that are assumed to underlie these manifestations of deception. For example, rapid speaking is assumed to reflect anxiety. Hence, researchers would need to show that rapid speaking does indeed coincide with other indices of anxiety.

A variety of methods can be applied to demonstrate the measure or instrument does indeed engage or reflect the purported processes. These methods include "think aloud" protocols, in which participants are asked to share their thoughts and images while completing the measure, as well as eye movement records. In addition, examinations of correlations, perhaps using confirmatory factor analysis, together with more complex mathematical modeling of response times can be applied (Messick, 1995;; Snow & Lohman, 1989, for an example, see Schweikert, 1980).

Structural fidelity

To evaluate the validity of measures, researchers often do not consider a key facet of measures--the validity or suitability of the scoring method, sometimes called <"text-decoration: underline&">structural fidelity (Loevinger, 1957). That is, a variety of scoring methods can be applied. Often, researcher merely compute the average or total response on a set of items. In some instances, however, this method is not suitable.

To illustrate, Vagg and Spielberger (1998) constructed an instrument that is intended to measure job stress. Respondents are instructed to estimate the severity and frequency of a series of events, such as "frequent interruptions" and "meeting deadlines" on a rating scale. To represent job stress, researchers could merely calculate the average of the items. Alternatively, they could multiply the frequency and severity of each item and then compute the average of these products. Indeed, theoretically, a variety of different methods could be applied.

As Messick (1995) emphasizes, to ascertain the optimal method, research needs to be undertaken to determine how the various facets combine to generate an overall outcome. Perhaps, for example, research might show that stress primarily depends on the frequency, not severity, of obstacles. Accordingly, researchers should not merely multiply the frequency and severity of each item, and then compute the average of these products& this methid would overestimate the importance of severity.

Consequences as evidence of validity

To evaluate the validity of measures, researchers often neglect to consider the consequences of administering these instruments. For example, after participants complete an instrument that characterizes their abilities, they might be more likely to be allocated to jobs they enjoy& this instrument, thus, might promote some desirable consequences. Nevertheless, after completing this instrument, participants might be inclined to focus their attention on their deficits, denting their confidence and ultimately diminishing their wellbeing. In other words, instruments can evoke both positive and negative consequences.

Researchers, therefore, should establish the instrument or measure does not ultimately compromise the psychological states and capacities of individuals (Messick, 1995). If the measure does provoke damage, researchers must consider the source of this problem& for example, the measure might represent extraneous variance.

Measurement reliability

Assessing item clarity

Results observed by Kulas and Stachowski (2009) imply that a simple technique can be used to establish the clarity of items in scales. In particular, researchers should ensure the rating scale includes an odd number of options, at least while establishing the validity of this measure. That is, a middle option should be included. If the frequency with which individuals choose the middle option is particularly high, the corresponding item might be unclear.

Specifically, Kulas and Stachowski (2009) showed that endorsements of the middle option are relatively slow. Furthermore, when the middle option was often endorsed, clarity was reported to be low. Participants conceptualized the middle option as implying "it depends". Accordingly, if the middle option is often endorsed, the item might need to be contextualized more specifically. That is, a specific example might need to be included.

References

Brunswick, E. (1956). Perception and the representative design of psychological experiments (2nd ed.). Berkeley: University of California Press.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.

Cronbach, L. J., & Gleser, G. C. (1965). Psychological tests and personnel decisions (2nd ed.) Urbana, University of Illinois Press.

Ebretson, S. (1983). Construct validity: Construct representation versus nomothetic span. Psychological Bulletin, 93, 179-197.

Kulas, J. T., & Stachowski, A. A. (2009). Middle category endorsement in odd-numbered Likert response scales: Associated item characteristics, cognitive demands, and preferred meanings. Journal of Research in Personality, 43, 489-493.

Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3, 635-694 (Pt 9).

Messick, S. (1995). Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749.

Pennebaker, J. W., & Stone, L. D. (2003). Words of wisdom: Language use over the life span. Journal of Personality and Social Psychology, 85, 291-301.

Schweikert, R. (1980). Critical-path scheduling of mental processes in a dual task. Science, 209, 704-706.

Snow, R. E., & Lohman, D. F. (1989). Implications of cognitive psychology for educational measurement. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 263-331). New York: Macmillan.

Vagg, P. R., & Spielberger, C. D. (1998). Occupational stress: Measuring job pressure and organizational support in the workplace. Journal of Occupational Health Psychology, 3, 294-305.

Vrij, A., Akehurst, L., & Morris, P. (1997). Individual differences in hand movements during deception. Journal of Nonverbal Behavior, 21, 87-102.

Vrij, A., Edward, K., & Bull, R. (2001). Stereotypical verbal and nonverbal responses while deceiving others. Personality & Social Psychology Bulletin, 27, 899-909.

Vrij, A., Edward, K., Roberts, K. P., & Bull, R. (2000). Detecting deceit via analysis of verbal and nonverbal behavior. Journal of Nonverbal Behavior, 24, 239-263.

Academic Scholar?
Join our team of writers.
Write a new opinion article,
a new Psyhclopedia article review
or update a current article.
Get recognition for it.