Psychological measurement-instruments - a summary for WSRt -of an article by Oostervel & Vorst (2010)

Critical thinking
Article: Oostervel & Vorst (2010)
Psychological measurement-instruments

The construction of measurement-instrument is an important subject.

certain instruments age because theories about human behaviour or because social changes tear down existing instruments
new instruments can be necessary because existing instruments aren’t sufficient enough.
new instruments can be necessary because existing instruments aren’t suitable for an certain target group.

Measurement preferences
Validity and measurement-quality
Threats to validity of measures

Measurement preferences

Measurement preferences of an instrument: the goal of an measurement-instrument.
This is about a more or less hypothetical property.

The domain of human acting

The instrument is usually focussed on measuring an property in a global domain of human acting.
A domain: a wide area of more or less coherent properties.

Observation methods

Every measurement-instrument uses one or more observation methods. For different properties of different domains, usually different observation methods are used.

performance-tests
questionnaires
observation tests

When properties are measured with different observation methods, it is logical that with different methods, different domains of the traits or categories are measured.

Instruments based on one observation method seem to form a common method-factor, which usually is stronger than the common trait-factor of equal traits measured with different observation methods.

Theory

The development of an instrument is usually based on an elaborated theory or insights based on empirical research or ideas based on informal knowledge.
Instruments developed on the base of formal knowledge and an elaborated theory are of better quality than instruments based on informal knowledge and an poorly formulated theory.

Construct

An instrument forms the elaboration of an construct that refers to an combination of properties.
Measurement instruments for specific (latent) traits are of better quality than instruments for global traits or composite traits.

Structure

The structure of an test depends on the properties it measures.

Unstructured observation-methods are the measurement-conditions that aren’t standardized and because of that it’s measurement-results are difficult to compare to other persons and situations. Objective scores are difficult to obtain.

Application possibilities

The application possibilities of an measurement-instrument the researcher wants to achieve can be related to theoretical or describing research.
It is about analysis of an great number of observations.

For individual applications high requirements are placed on realised measurement-preferences.

Costs

An often decisive element in the description of the measurement-preferences of an measurement-instrument are the costs of that instrument.

Dimensionality

An instrument consists of one or more measurement-scales or sub-tests.
More scales refer to more dimensions of the construct and a subdivision in more latent traits or latent categories.

An instrument that is based on a specific latent trait must be one-dimensional.

Reliability

Three kinds of reliability:

Internal consistence-reliability
Mutual cohesion of items that form a scale or sub-tests.
Repeated reliability
Repeated measures with the same instrument
Local reliability
an impression of the reliability of the measurement within a certain wide of scores.

Validity

Does the test measure what it is supposed to measure?

Forms of validity:

Impression-validity
content-validity
criterium-validity
process-validity
construct-validity

Utility

Utility of an instrument: the use of an instrument as becomes apparent from a costs-bate analysis.

Standardization

A psychological measurement-instrument doesn’t lead to absolute results, but to relative ones. The individual scores must be compared to scores of others.
The scores of others form the norm.
Norm-group: the group of people that forms the norm
Norming exists of the calculation of rough score to relative norm-score.

Validity and measurement-quality

Validity and measurement-quality of measurement-instruments

Impression-validity – an subjective judgment of the measurement-quality

Impression-validity: an subjective judgment of the usability of an measurement-instrument on the base of the direct observable properties of the test-material.
The judgement of test-takers and other laics.

Content-validity - content measurement-quality

Content-validity: the judgment about the representativeness of the observations, appointments, and questions for a certain purpose.
This can be determined by offering potential respondents or experts domain-descriptions and the items of the instrument, and then order them to sort items on domain-descriptions.

With big conformity between items and domain-descriptions on judges, the content-validity is high.

Especially important for tests and exams.

Criterium-validity – predicting value of the measurement

Criterium-validity: the (cor)relation between test-score and a psychological or social criterium.
Can be found by researching test-score and criterium-score.

Process-validity: procedural measurement-quality

Process-validity: the manner on which the response is established.
Can be researched with thinking-out-loud protocols or experiments with instructions.

Construct-validity – theoretical measurement-quality

A part of the similarities between the strictly formulated, hypothetical relations between the measured construct, and other constructs and otherwise empirical proved relations between instruments which should measure those constructs.

Convergent validity: if measurement-results from different instruments that research the same construct are coherent or highly correlated.
Divergent validity: if measurement-results from different instruments that test different constructs have a low correlation.

Homogeneity of consistence-reliability

The coherence between separate indicators (items) in a scale.
By a psychological scale, assumed is that the items of which the scale is composed are independent, repeated measured of the same trait.

Homogeneity is determined with different indices:

mean inter item correlations
split-halves reliability
coefficient alpha

The height of homogeneity-indices is usually dependent on the height of the inter item correlations of the number of items.

Generalizability of the measurement-quality

The validity and reliability/measurement-quality is in principle dependent on the population or sample.
For every group of persons who differ on one or more characteristics the validity and reliability/ measurement-quality of an instrument must be determined separately.

Paradoxes and measurement-qualities

Subjective judgments of the measurement-quality

The unarmed judgment about the measurement-quality of a test can be deceiving and doesn’t have to have a relation to the researched measurement-quality.

Content validity

The content validity of an instrument turns out the representative choice of items out of one or more domains of items.
If the content of an instrument is chosen optimally, this can lead to a less homogeneous instrument.

This property is measured with items that are diverse of content.

These items elicit a great diversity of responses, which do not lead to a homogeneous one-dimensional scale or sub-test.
If the constructor also wants a high homogeneity or predictive value of the instrument, this will be at expense of the content representativenes of the instrument.

Predictive value

The quality of an one-dimensional measurement-model requires homogeneous responses on an limited number repeated measurements. That requires homogeneous items.

The consistency-reliability of homogeneous items is lower than that of heterogeneous items. The predictive value of homogeneous items is lower than that of heterogeneous items.

With heterogeneous items one can’t usually form a scale with good measurement-properties, but one can form a predictor.
With homogeneous items, forming a scale with good measurement-properties is possible, but forming a predictor isn’t.

Theoretical measurement-quality

Items must be homogeneous per scale in order to meet the requirements of an one-dimensional measurement-model.

The items of a scale must correlate with (the items of) scales that measure similar properties (convergent relations) and may not correlate highly with (items of) scales that measure dissimilar properties (divergent relations)
Items may not correlate highly with indications of response tendencies

If items meet these criteria, predictive value and content quality of the instrument can’t be optimal

Homogeneity

Homogeneity of a scale is based on the assumption that items of a scale or sub-test form independent, repeated measures of a property.
These repeated measurements must be mutually coherent to form a high consistence-reliability or homogeneity of the scale or sub-test.
Homogeneous scales or sub-tests threaten the maximal content representativeness of the property and the predictive value of the measurement.

Selection of measurement-quality

A test-constructor can’t maximize all the measurement-qualities in one measurement-instrument, and the test-taker shouldn’t expect all the measurement-qualities in an instrument.

If a constructor hadn’t chosen an measurement-quality to maximize beforehand, the instrument will have random measurement-qualities.
The constructor can attain the best results by focusing on one measurement-quality during test-construction.

Optimization, probability capitalization, and cross-validation

Most methods of test-construction have an empirical character in which the constructor attains the best result using optimizing choice-procedures, optimizing solution-strategies, or optimizing analyzing-strategies.
There, coincidence can play a big role
By making optimal choices, or using optimal techniques, the constructor can stack coincidence on coincidence. This is probability capitalization

The constructor can gain insight in the effects of probability capitalization by using cross-validation on the results.
With cross-validation, results of an optimizing strategy become more apparent, and because of this more certainty can be obtained about the (in)stability of the results.

Optimizing procedures: examples

selection of items on the basis of optimal psychometric properties
selection of items on the basis of differences between groups
selection of optimal weights for item-scores and test-scores

Probability capitalization

Typical for optimizing procedures is their empirical character and that they lead to optimizing choices with empirical data.
There are no theoretical or hypothetical considerations that adjust the selection process.

The data on which selection is based is commonly unreliable to a certain extent.

Probability capitalization means that with an optimizing strategy, the choice is make in part because of chance.
By such a choice, no distinction can be made between true variance and false variance or coincidence.

Probability capitalization doesn’t exists when there is only selected on the base of true differences/ true correlations but (also) when there is selected on accidentally big differences/big correlations.
Coincidence here is not systematical or repeatable.

Sorts of optimizing procedures and techniques

Three common forms of optimizing:

optimizing of psychometric characteristics of measurement and/or prediction through selection of items.
optimizing of differences between mean scores of groups by selection of items
optimizing of quality of measuring of accuracy of predicting by giving weights to item-scores or test-scores

These procedures are often executed with optimizing, exploratory analyzing-techniques
The most common techniques are:

Exploratory factor-analysis
Cluster-analysis
multiple regression-analysis
discrimination-analysis

Measurement-model: about what the constructor wants to measure
Structure-model: about what the constructor wants to predict

Cross-validation: control of instability of outcomes

The central idea of cross-validation: repeatedly test/calculate optimal indices

This divides the research-sample in two comparable groups.
Out of every sub-group, two new sub-groups (A and B) are at random composed. The sub-groups then merged to two groups, A and B.
Then, exploratory analysis is done twice.
outcomes of the analysis are compared

If the optimizing technique or procedure brings similar results in both analysis, than probability capitalization is present is such a small degree that outcomes are stable or reliable.

Threats to validity of measures

Threats with observation-methods:

Disruptive influence of the presence and behavior of the observant on the observed person and his or her behavior.
Expectancy-effect
distorted effect of the expectations of the observant on the observations
adjustment-effect
changes in the manner of observation in the course of time
Category-effect
loss of precision due to use of global categories
order-effect
distorted influence of first and last observations on other observations of the series
effect of under-performance
distortion of observations due to under-representation of common behavior or events
effect of event rate
distortion due to missing observations as a result of the rate of events or behaviors
effect of event-complexity
distortion due to missing observations as a result of the complexity of behaviors or events

Threats of rating-methods

Halo-effect
a positive distortion on specific traits as a result of a positive first or general impression
horn-effect
a negative distortion on specific traits as a result of a negative first or general impression
regression to the middle
distortion of judgments as a result of a tendency to give average judgments or a tendency to give little variation in judgments.
contrast-effect
distortion due to a tendency to increase existing differences between people or differences with the judge
willingness-effect
distortion as a result of the tendency to avoid negative judgments or the tendency to give relatively positive judgments
hardness-effect
distortion as a result of the tendency to give relative negative judgments or to give relatively few positive judgments
logical flaw
distortion as a result of assuming traits on that base of psycho-’logical’ connections or assuming cause and effect

Psycho-metrical research

The measurement-preference of a test must be empirically researched with psycho-metrical research.

Access:

Public

Verzekeren bij een faire en solidaire zorgverzekeraar?

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

Check: concept of JoHo WorldSupporter

Concept of JoHo WorldSupporter

JoHo WorldSupporter mission and vision:

JoHo wants to enable people and organizations to develop and work better together, and thereby contribute to a tolerant tolerant and sustainable world. Through physical and online platforms, it support personal development and promote international cooperation is encouraged.

JoHo concept:

As a JoHo donor, member or insured, you provide support to the JoHo objectives. JoHo then supports you with tools, coaching and benefits in the areas of personal development and international activities.
JoHo's core services include: study support, competence development, coaching and insurance mediation when departure abroad.