Summary of A Conceptual Introduction to Psychometrics by Mellenbergh - 1st edition

What is meant by psychometrics? – Chapter 1
How can you develop maximum performance tests? – Chapter 2
What is a typical performance test? – Chapter 3
What are observed test scores? – Chapter 4
How can observed test scores be analysed? – Chapter 5
How can items be scored – Chapter 6

What is meant by psychometrics? – Chapter 1

Test definitions

Psychometric terminology sometimes differs depending on the types of test applications.

A psychological or educational test: an instrument for the measurement of a person’s maximum or typical performance under standardized conditions, where the performance is assumed to reflect one or more latent attributes.

A test is defined to be a measurement instrument. It is for measurement in the first place.
A test is defined to measure performance. Two types of performance:
- Maximum performance tests ask the person to do his or her best to solve one or more problems. The answer can vary incorrectly.
- Typical performance tests ask the person to respond to one or more tasks where the responses are typical for the person. The person’s responses can not be evaluated on correctness, but they typify the person.
Performance is measured under standardized conditions.
Test performance must reflect one or more latent attributes. The test performance is observable, but the latent attributes cannot be observed.

Tests are distinguished from surveys. It is not assumed that survey questions reflect a latent attribute.

Subtest: an independent part of a test.

A (sub)test consists of one or more items.

Item: the smallest possible subtest of a test. The building blocks of a test.

A test consisting of n items is called a n-item test.

One or more latent attributes affect test performance.

The number of latent attributes is the dimensionality of the test.

Dimensionality: equal to the number of latent attributes (variables), which affects test performance.

Unidimensional test: a test that predominantly measures one latent attribute.

Multidimensional test: a test that measures more than one latent attribute.

Two-dimensional test: a test that measures two latent attributes.

Test types

Psychological and educational measurement instruments can be divided into:Mental test: consists of cognitive tasks

Physical test: consists instruments to make somatic or physiological measurements
Maximum performance tests
Performance can be considered in two different respects. If the performance is accurate and if the performance is fast.

Classified according to time:

Pure power test: consists of problems that the maker tries to solve. The test maker has ample time to work on each of the test items, even on the most difficult ones.
Emphasis on measuring the accuracy to solve the problem.
Time-limited power tests: test are constructed so that the majority of test-takers have enough time to solve the problems, and only a small minority needs more time.
Speed test: measures the speed taken to solve problems. Usually, the test consists of very easy items that can be solved by all the test makers.
The test taker is asked to solve the problems as quickly as possible.
The emphasis is on measuring the time taken to solve problems.

Maximum performance tests are also classified according to the attributes which they measure.

Ability test: an instrument for measuring a person’s best performance in an area that is not explicitly taught in training and educational programs.
Achievement tests: an instrument for measuring performance that is explicitly taught in training and educational programs.

Typical performance tests

A typical performance test: an instrument for measuring behavior that is typical for the person.

Frequently, typical performance tests are called questionnaires or inventories.

Three main types:

Personality tests (questionnaires): measure a person’s personality characteristics.
Interest inventories: measure a person’s interests.
Attitude questionnaires: measure a person’s attitude towards something.

Types of test taking situations:

The test taker is the same person as the one who is measured.
A test taker is a person other than the one being measured.

How can you develop maximum performance tests? – Chapter 2

Construct of interest

The test developer must specify the latent variable of interest that has to be measured by the test.

The latent variable is a general term. The term is used when a substantive interpretation is given of the latent variable.

The latent variable (construct) is assumed to affect test makers’ item responses and test scores.

Constructs can vary in many different ways.

Vary in the content of mental abilities, psychomotor skills or physical abilities.
The Construct may vary in scope.For example: from general intelligence to multiplication skill.
Constructs vary from educational to psychological variables.

A good way to start a test development project is to define the construct that has to be measured by the test.

This definition describes the construct of interest, and distinguishes it from other, related, constructs.

Usually, the literature on the construct needs to be studied before the definition can be given. Frequently the definition can only be given when other elements of the test development plan are specified.

Measurement mode

Different modes can be used to measure constructs.

Self-performance mode:
The test taker is asked to perform a mental or physical task.
Self-evaluation mode:
The test taker is asked to evaluate his or her ability to perform the task.
Other-evaluation mode:
Ask others to evaluate a person’s ability to perform a task.

The objectives

The test developer must specify the objectives of the test. Tests are used for many different purposes.

Scientific versus practical.
Individual level versus group level.
Description (describe performances) versus diagnosis (adds a conclusion to a description) versus decision-making (decisions are based on tests).

The population

Target population: the set of persons to whom the test has to be applied.

The test developer must define the target population, and must provide criteria for the inclusion and exclusion of persons.

A target population can be split into distinct subpopulations. The test developer must specify whether subpopulations need to be distinguished. And, if so, they need to define the subpopulations, and to provide criteria to include persons in subpopulations.

The conceptual framework

Test development starts with a definition or description of the construct that has to be measured by the test. But, this is usually not enough to write test items.

A conceptual framework: gives the item writer a handle to write items.

Item response mode

The item response mode needs to be specified before item writing starts.

Distinction:

Free- versus constructed-response
Choice versus selected response

Free response items are further divided into:

Short-answer items
Essay items

Different types of choice modes are used in the achievement and ability testing:

Conventional multiple-choice mode:
Consists of a stem and two or more options. The options are divided into one correct answer and one or more distractors.
Usually, choosing the correct option of a multiple-choice item indicates that test takers’ ability or skill is sufficiently high to solve the item.
Distractors can be constructed to contain specific information on the reasons why the test taker failed to solve the item correctly. The choice of a distractor indicates which deficiency the test taker has and as such can be used for diagnosing specific deficiencies.
A dichotomous item response scale has two ordered categories. An answer is correct or incorrect.
An ordinal-polytomous scale has more than two ordered categories.
Partial ordinal-polytomous response scale: the correct option is ordered above the distractors, but the distractors are not ordered among themselves.

Administration mode

A test can be administered to test takers in different ways:

Oral
The test is presented orally by a single test administrator to a single test maker
Paper-and-pencil
The test is presented in the form of a booklet
Computerized
Test order of the items is the same for each of the test takers. It is presented on a computer.
Computerized adaptive test administration
The test is adaptive. The computer program searches for the items that best fit the test taker.

Item writing guidelines

Focus on one relevant aspect
Each item should focus on a single relevant aspect of the specification in order to guarantee good coverage of the important aspects of the achievement or ability.
Only a single aspect of the specification needs to be measured to guarantee that test takers’ item responses are unambiguously interpretable.
Use independent item content
The content of different items is independent.
Testset: a group of items that may be developed as a single unit that is meant to be administered together.
Avoid overly specific and overly general content
The disadvantage of overly specific item content is that the content may be trivial, and the disadvantage of the overly general content is that the content may be ambiguous
Avoid items that deliberately deceive test takers
Keep vocabulary simple for the population of test takers
Put item options vertically
Minimize reading time and avoid unnecessary information
Use correct language
Use non-sensitive language
Use a clear stem and include the central idea in the stem
Word the item positively, and avoid negatives
Negatively phrased items are harder to understand and may confuse test readers.
Use three options, unless it is easy to write plausible distractors
Use one option that is unambiguously the correct or best answer
Place the options in alphabetical, logical, or numerical order
Vary the location of the correct option across the test
Keep the options homogeneous in length, content, grammar, etc.
Avoid ‘all-of-the-above’ as the last option
Make distractors plausible
Avoid giving clues to the correct option

Item rating guidelines

The responses to free- (constructed-) response items have to be grated by raters.

Important guidelines:

Rate responses anonymously
Rate responses to one item at a time
Provide the rater with a frame of reference
Separate irrelevant aspects from the relevant performance
Use more than one rater
Re-rate the free responses
Rate all responses to an item on the same occasion
Rearrange the order of responses
Read a sample of responses

Pilot studies on item quality

Standard practice is that item writers produce a set of concept items and pilot studies are done to test the quality of these concept items.

Generally, at least half of the concept items do not survive the pilot studies, and items that survive are usually revised several times.

Experts and test takers’ pilot studies need to be done for both free-response and multiple-choice items.

For free-response items pilot studies need to be done on the ratings of test takers’ responses to the items.

Expert’s pilots

The concept items have to be reviewed before they are included in a test.

Items are reviewed on their content, technical aspects, and sensitivity.

The content and technical aspects are assessed by experts in both the field of the test and item writing.

Each of the concept items is discussed by a panel of experts.

A good start for the discussion of a multiple-choice item is to look for distractors that panel members could defend as (partly) correct answers.

The reviewing of the items yields qualitative information that is used to rewrite items or to remove concept items that cannot be repaired.

Revised items should be reviewed again by experts until further rewriting is not needed.

The sensitivity of items also needs to be reviewed.

Usually, the panel for the sensitivity review of the items consists of a person not on the panel reviewing the content and technical aspects of the items.

The sensitivity review panel is composed of members of different groups.

The panel has to be trained to detect aspects of the items and the tests that may be sensitive to subpopulations.

The sensitivity review provides qualitative information that also could lead to rewriting or removal of concept items.

Test takers’ pilots

The concept items are individually administered to a small group test takers from the population of interest.

Each of the test takers is interviewed on their thinking while working on an item.

Two versions of the interview can be applied

Concurrent interview: the test taker is asked to think aloud while working on the item
Retrospective interview: the test taker is asked to recollect his or her thinking after completing the item.

Protocols of the interviews are made and the information is used to rewrite or remove concept items.

Compiling the first draft of the test

The concept items that survived the pilot studies are used to compile a concept version of the test that includes instructions for the test takers.

Usually, the instruction contains some example items that test takers have to answer to ensure that they understand the test instructions.

The concept test may consist of a number of subtests that measures different aspects of the ability or achievement.

The conventional way of assembling a maximum performance test is to start with easy items and to end with difficult items.

The concept test is submitted by a group of experts.

The group can be the same as the group that was used in the experts’ pilot study on item quality. The group has expertise in:

The content of the ability or achievement
Test construction

The experts evaluate two different properties of the concept test.

Whether the test instruction is sufficiently clear for the population of test takers
Whether the test yields adequate coverage of all aspects of the ability or achievement being measured by the test. (content validation)
Whether the test is balanced with respect to multicultural material and references to gender

The comments of the experts are used to compile the first draft of the test.

The first draft of the test is administered in a try-out to a sample of at least 200 test takers from the population of interest.

What is a typical performance test? – Chapter 3

The responses to typical performance tests are not evaluated on their correctness but are considered to typify a person.

At the start of a test development project, the researcher needs information on the construct of interest. This information can be obtained from different sources

A study of the literature on the construct and existing measurement instruments is nearly always needed at the start of a test development project

Different types of research can be done on the construct.

Focus group method
Uses small groups of persons who have experiential knowledge about the construct.
A focus group meets with the test developer to talk about their experiences with the construct.
Key information method
Uses persons who have expert knowledge about the construct of interest. The test developer interviews these key informants about the constructs.
Observation method

The test developer can use information from different sources to define the construct and, later on the test development process, he or she can use this information for item writing.

Measurement mode

Self-report mode
The test taker answers questions on a typical performance construct
Other-report mode
A person answers questions about another person’s construct
Somatic indicator mode
Uses somatic signs to measure constructs
Physical trace mode
Uses traces that are left behind to measure constructs

Each of these four modes can occur in two different varieties

Reactive measurement mode
When test takers can deliberately distort their construct value
Nonreactive measurement mode
When test takers cannot distort their construct value

The reactive/nonreactive distinction is only used for typical performance measurements, and not for maximum performance measurements.

A maximum performance test asks test takers to do the best they can to perform the task.

Each of the four response modes can occur in two versions

Self-report mode

Test takers are asked to respond to questions or stimuli to assess their attitudes, values, interests, opinions, or personality.

Reactive: questionnaires or inventories
Test takers are fully aware of what questionnaires and inventories are measuring, and they can easily distort their construct values
Non Reactive: test takers respond to stimuli, but it is not clear to them what construct is measured

Other-report mode

Uses other people to report on a given person’s typical performance construct.

Reactive: the person is aware that another person reports his or her construct and he or she can try to affect the other’s report
Non Reactive: persons cannot adapt their behavior to affect the other’s report

Somatic indicators mode

Uses somatic signs to assess typical performance constructs.

Reactive: the test taker can suspect what is being measured, and may deliberately affect the measurement
Non Reactive: the person is unaware of being measured by his or her somatic signs

Physical traces mode

Uses traces that persons left behind to assess their typical performance constructs

Reactive: the persons are aware that their traces can be noticed by others and may be used to assess their characteristics
Non Reactive: the person is unaware that his or her traces may be used for measurement purposes.

The objectives

Scientific versus practical
Individual level versus group level
Used for description versus diagnosis versus decision making

Population

The test developer must define the target population and the inclusion and exclusion criteria of persons.

If subpopulations need to be distinguished, the test developer must define these subpopulations and must provide criteria to include persons in these subpopulations.

The conceptual framework

Three broad classes of strategies to construct typical performance tests. In reach of these classes, two specific test development methods are distinguished.

Intuitive class
The relation between the construct and the items is of an intuitive nature.
- Rational method
  Uses a loose description of the construct that is based on the knowledge of experts or members of the target population. The items are written from this intuitive knowledge of the construct.
- Prototypical method
  Asks members of the target population to think of persons having the construct, and to write down their behavior that is typical for the construct. This prototypical information is used to write test items.
Inductive class
The tests are derived from empirical data
- Internal method
  Starts from a broad pool of personality or attitude items. The items are administered to a sample of the target population of persons, and the associations are computed between item scores. The test developer searches for clusters of items that are highly associated, and each of these clusters specifies a construct.
- External method
  Starts from a broad pool of items and criteria that has to be predicted. The items and the criterion are measured in a sample of persons from the target population. The associations between the person’s item and criterion scores are computed, and the items that have the highest item-criterion associations are included in the test.
Deductive class
Start from theoretical or conceptual notions of the construct.
- Construct method: starts from an explicit theory, and items are derived from this theory.
- Facet method: does not use an explicit theory. It starts from a conceptual analysis of the construct.

Construct method

Uses a theoretical framework.

The construct is defined, and it is embedded in a network of other constructs. The theory and its network are used to write items.

The facet design method

Generates items from a conceptual analysis of the construct that has to be measured by the test.

Starts with an inventory of the observable behavior that applies to the construct.

This behavior is classified according to a number of aspects, which are called facets. Each of these facets contains a number of facet elements.

Important facets of the construction of typical performance tests are behavioral and situational facets.

Behavioral facets classify types of behavior
Situational facets classify the situations where the behaviors appear

The facets are crossed, and items are written for each of the combinations of the different facet elements.

Item response mode

A typical performance item consists of a question or statement, and the test taker is asked to answer the question or to react to the statement.

A number of distinctions are made, and these are used to classify the response modes of typical performance items.

Open-ended versies closed-ended
Open-ended: asks the test taker to complete a question or statement.
Closed-ended: consists of a statement or question and a response scale. The test taker is asked to indicate his or her position on the response scale.

The response scales of closed-ended items are divided into:

Frequency response scale:
Asks the test taker to indicate the frequency of occurrence.
Endorsement response scale
Asks the test taker to indicate his or her degree of endorsement of the statement

Endorsement scales are subdivided into:

All-or-none
Asks the test taker to indicate whether he or she endorses the statement or not
Intensity endorsement scale
Asks the test taker to indicate the degree of his or her endorsement of the statement

The intensity endorsement scales are subdivided in

Discrete intensity endorsement scale
Asks the test taker to indicate his or her degree of endorsement by choosing one out of more than two ordered categories
Continuous intensity endorsement scale
Asks the test taker to indicate his or her degree of endorsement at a bounded-continuous scale, such as a line segment (visual analogue scale)

Unipolar scale: a response scale can go from a zero point to one direction only.

Bipolar scale: a response scale can go from a negative pole to a positive pole

Dichotomous scale: a scale with only two categories

Ordinalpolytomous: a scale that has more than two ordered categories

Bounded-continuous scale: a continuous scale that is bounded, for example, with two endpoints

Administration mode

The main modes to administer typical performance tests are:

Oral
- Face-to-face administration
- Telephone administration
Paper-and-pencil
- Personal pencil-and-paper administration
- Mail pencil-and-paper administration
Computerized
Computerized adaptive

Item writing guidelines

Elicit different answers at different construct positions
Focus on one aspect per item
Avoid making assumptions about test takers
Use correct language
Use clear and comprehensible wording
Use non-sensitive language and content
Put the situational or conditional part of a statement at the beginning and the behavioral part at the end
Use positive statements
Use 5-7 categories in ordinal-polytomous response scales
Label each of the categories of a response scale and avoid the use of numbers alone
Format response categories vertically

Item rating guidelines

Rate answers anonymously
Rate the answers to one item at a time
Provide the rater with a frame or a reference
Separate irrelevant from relevant aspects
Use more than one rater
Rerate the answers
Rate all answers to an item on the same occasion
Rearrange the order of answers
Read a sample of answers

Pilot studies on item quality

Pilot studies are necessary to assess the quality of concept items.

Usually, a large number of concept items has to be revised or has to be removed from the pool of concept items.

Three types of pilot studies:

Experts
Test takers
Rated pilot studies

Expert’s pilots

Concept items have to be reviewed by experts.

Three types of expertise are needed:

Substantive expertise on the content of concept items
Technical aspects of the items
Sensitivity experts

Test takers’ pilots

The concept items are administered to a small group of test takers from the target population. Each of the test takers is interviewed about their thinking while answering the items.

Concurrent interview: asks the test taker to think aloud while answering the items
Retrospective interview: asks the test taker to recollect their thinking after completing the items.
The interviews are recorded and this information is used to revise or remove concept items.

Response tendencies

Responses to typical performance items may be affected by response tendencies.

Response tendency: the differential application of the response scale.

Response style: the differential use of the item response scale by different persons.

A response style varies between responses, but it is relatively constant across measurements of different constructs and across measurements of the same construct on different constructs and across measurements of the same construct on different occasions.

It is a person-specific property.

Important response styles are:

Acquiescence
The tendency to agree with an endorsement statement, independently of the content of the statement
Dissentience
The tendency to disagree with an endorsement statement, independently of the content of the statement
Extremity
The tendency to choose extremes of the item response scale
Midpoint
The tendency to choose the middle of the response scale

Response set: the differential use of the item response scale by different persons and different constructs.

The response may differ between persons and between constructs, and is only relatively stable across measurements of the same construct on different occasions. It is a person/construct-specific property.

Response sets:

Social desirability: a person’s tendency to deceive either oneself or others.
Self-deception: the tendency to deceive oneself.
Impression management: the tendency to deceive others by making good or bad impressions on others.

Social desirability is a person-specific property because it varies between persons.

It is also construct-specific because it may vary between constructs.

The best strategy is to assess social desirability with specific measurement instruments.

Self-deception is related to constructs that are mainly relevant for persons themselves
Impression management is related to constructs that are mainly relevant to the person’s social relations.

Acquiescence and dissentience can only occur with endorsement items and not in frequency items.

The extremity and midpoint response styles can occur in both.

Acquiescence, dissentience, the extremity and midpoint styles can occur in both the reactive self-report and the reactive other-report measurement modes.

Social desirability can only occur in the reactive self-report mode.

Acquiesce and dissentience can be detected by including both indicative and contra-indicative items into the questionnaire.

The extremity and midpoint response styles are hard to detect.

Compiling the first draft of the test

The concept items that survived the pilot studies are used to compile the first draft of the test and instructions for test takers are added.

Usually, the instruction contains some example items to guarantee that test takers understand the test items.

Balanced test: consists of about 50% indicative and 50% contra-indicative items.

Social desirability items can also be added to the test.

Usually, indicative, contra-indicative, and social desirability items are arbitrarily mixed in the test.

The concept test is submitted to a group of experts. This group can be the same as the group that was used in the experts’ pilot study on item quality.

The group needs to have expertise in both the construct and test construction. The experts evaluate whether the test instruction is sufficiently clear for the population of test takers.

They study the content validation (whether the test adequately covers all aspects of the construct)

The comments of the experts are used to compile the first draft of the test.

The first draft is administered in a try-out to at least 200 test takers from the target population.

The try-out data are analyzed using methods of classical and modern test theory. s

What are observed test scores? – Chapter 4

The aim of testing is to yield scores of test takers’ maximum or typical performance.

Two main types of test scores are distinguished

Observed test:
Computed after the separate test items are scored.
Derived from the item scores by taking the unweighted or weighted sum of the item scores.
The latent variable is unobserved, and in general, the latent variable is not a simple sum of item scores.
Latent variable (construct) scores:
To compute the latent variable score, a model is needed that specifies the relation between the latent variable and item responses.
The latent variable score is derived from the item responses under the assumption of a latent variable item response model.

Item scoring by fiat

Conventionally, items are scored by assigning ordinal numbers to the responses.

The scoring differs slightly between the maximum and typical performance tests.

Maximum performance items are scored by assigning 0 to the lowest category, and consecutive rank numbers to subsequent categories.
Typical performance items are indicative or contra-indicative of the latent variable that is measured by the test, and the scoring of contra-indicative items has to be reversed with respect to the scoring of indicative items.
Dichotomous indicative typical performance items are scored assigning 0 to the ‘no’ (don’t agree), and 1 to the yes (agree) categories.
Whereas contra-indicative items are scored by assigning 0 to the ‘yes’, and 1 to the ‘no’ category.
The categories of ordinal-polytomous items are scored by assigning rank numbers to the categories
Bounded-continuous items are scored in measurement units, such as centimeters.

Measurement by fiat: the item scores are assigned to a test taker’s responses without any theoretical justification.

(for example, scores 0 and 1 are assigned to a correct and incorrect answer, ad the scores 1, - 5 are based on convention (by fiat) and are not based on psychometric theory)

The sum score

The score of the th test taker on the th item is indicated by Xjk. The conventional test score of the the test taker on an n-item test is the unweighted sum of his (or her) item scores:

Usj = Xj1 + Xj2 +… + Xjn

It may be argued that items differ in importance, and that they should be weighted differently.

The weighted sum score of the jth item on an n-item test is:

Wsj = w1Xj1 + w2Xj2 + … + wnXjn

w1 is the weight assigned to the first item and so on.

A problem with the weighted sum score is that the weights have to be determined before the weighted sum score can be computed.

The weighted and unweighted sum score will be highly correlated if the weights of the weighted sum do not differ substantially from each other.

In modern testing, the unweighted sum score is mostly preferred.

The (un)weighted sum score combines the scores of n different test items.

A sum score only makes sense if the items predominantly reflect the same attribute.

Item writing based on a conceptual framework intends to write items that measure the attribute of interest.

Conceptually based item writing strives for a unidimensional test or a test of unidimensional subtests.

But, conceptually based item writing does not guarantee that the test is unidimensional.

The sum score also requires that each of the items is of sufficient quality.

The sum score is based on item scores that were assigned by fiat.

The observed test score distribution

S is used to indicate both the unweighted and weighted sum score when the analysis or use are the same for both types of score.

US and WS are used when the analysis is different.

If a test is administered to a population of N test takers, the frequency distribution of the observed test scores can be computed.

A frequency distribution can be characterized by different parameters.

Mean
Range
Variance
Standard deviation

How can observed test scores be analysed? – Chapter 5

Measured precision of observed test scores

Test scores are used in practical applications.

Measurement precision has two different aspects:

Information
Applies to the test score of a single person
The within-person aspect of measurement precision
Reliability
Applies to a population of persons.
The between-persons aspect of measurement precision

The concept of measurement precision applies to observed test scores as well as to latent variable scores.

Information on a single observed score

Functional thought experiment: fulfills a function within a theory.

True test score: the expected value of the observed test scores of the repeated test administrations in the thought experiment.

Test taker j’s true test score is the expected value of his (or her) independently distributed observed test scores from (hypothetical) repeated administrations of the test to the test taker.

The observed test score is a variable that varies across repeated test administrations.

The true score is constant.

Error of measurement: the difference between test taker j’s observed test score and his (or her) true score.

Test taker j’s error of measurement on an arbitrary measurement occasion is the difference between his (or her) observed test score and his (or her) true test score.

The expected value of the errors of measurement is 0.

The within-person error variance is an index for the precision of the measurement of a person’s true score.

Test taker j’s standard error of measurement: the square root of his (or her) within-person error variance.

Information: the reciprocal of a person’s within-person error variance.

A small amount of information means that Test taker j’s observed test scores vary widely around j’s true score across repeated test administrations.

A large amount of information means that j’s observed test scores do not vary widely around j’s true score.

Reliability of observed test scores in a population

Reliability: the differentiation of test scores of different test takers from a population.

Psychometrics uses two definitions of reliability

A theoretical definition
Operational definition.
Yields procedures to assess reliability.

Reliability concerns the differentiation between the true test scores of different test takers from a population.

The differentiation is good if the test taker’s true scores can be precisely predicted from their observed test scores. Differentiation is bad if the test taker’s true scores cannot be precisely predicted from their observed test scores.

Theoretical reliability: the reliability of the observed test scores is the squared product moment correlation between observed and true test scores in a population of persons.

Parallel tests: tests that measure (1) the same true score with (2) equal within-person error variance, and (3) uncorrelated errors across (hypothetical) repeated test administrations for each of the test takers of a population.

Operational reliability: the reliability of the observed test score is the product moment correlation between observed test scores of parallel tests in a population of persons.

Some properties of classical test theory

Classical test theory is based on the definitions of Test taker j’s true score, is error measurement and generalization to a randomly selected person from a population of persons.

The standard error of measurement of a test

Standard error of measurement of a test: the square root of the error variance in the population of persons.

Lower bounds to reliability

The importance of a lower bound is that a high value of a lower bound implies that the theoretical reliability is high.

Cronbach’s coefficient alpha.

Test length and reliability

The reliability of a test depends on the number of test items.

Usually, a larger test is more reliable for measuring the same latent variable than a shorter test.

The length of an original n-item test can be doubled to a 2n-item test by adding an n-item parallel test to the original test.

The relation between the reliability of the doubled test and the original test is given by the spearman-brown formula for double test length.

This formula can also be used when a test is shortened. It is assumed that the test is shortened by removing parallel parts from the test.

Correlation corrected for attenuation

The correlation between two tests, is attenuated by the errors of measurement of each of the two observed test scores.

The product moment correlation between the true scores of the two tests in a population of persons.

Signal-to-noise ratio

Signal-to-noise ratio: the ratio of the true score variance and the error variance of the test in a population of persons.

Parameter estimation

The classical theory of psychometrics uses two different types of populations

The population of observed test scores across repeated test administrations.
Infinite and defined for each of the test takers by using a functional thought experiment.
The population of persons
Finite, it exists of N persons.
Characterized by parameters that are also defined at two levels
- Parameters at the level of the individual test taker
- Parameters at the population of a person's level

In statistics, population parameters are estimated from sample data.

The sample data are summarized in a statistic that is used to estimate the corresponding population parameter.

The sample mean is called the estimator of the population mean.

Estimation of population parameters

The population parameters are:

Mean (expected value)
Variance
Reliability
Standard error of measurement of the test

These parameters have to be estimated from a sample of persons from the population of interest.

The number of persons in the sample is Ns.

Number of persons in the population is N.

The theoretical reliability can be estimated in two different ways:

Estimation from parallel test correlation
Estimation from split-half subtest correlation

The standard error of measurement of a test can be estimated in the following ways:

Estimation from the reliability
Estimation from the within-person error variances

How can items be scored – Chapter 6

The conventional way of scoring items is by assigning ordinal numbers to the response categories.

Usually, these item scores are ordered with respect to the attribute that the item is assumed to measure. But, this assignment of these ordinal numbers lacks a theoretical justification.

Usually, the analysis of test scores is supplemented by an analysis of the item scores.

Item score distributions

The scores of a given item have a distribution in a population of N persons.

Location: the place of the scale where item scores are centered
Dispersion: the scatter of the item scores
Shape: the form of the distributions

Classical item difficulty and attractiveness

The location of the item score distribution is used to define the classical item difficulty (maximum performance tests) and classical item attractiveness (typical performance tests) concepts.

Classical item difficulty: a parameter that indicates the location of the item score distribution in a population of persons.
Classical item attractiveness: a parameter that indicates the location of the item score distribution in a population of persons.

The two definitions are the same.

Classical item difficulty and attractiveness are defined in a population of persons.

Population-dependent and may differ between populations.

The mean is mainly used for this.

The mean of a dichotomously scored item is called the item p-value.

Item score variance and standard deviation

The most common parameters that are used in classical item score analysis are the variance and the standard deviation of the item scores.

Items that have a small item score variance, have little effect on the test score variance.

The variance of dichotomous item scores is a function of the item p-value.

For a given sample size, the variance has its maximum value at p=.5.

Classical item discrimination

Location and dispersion parameters yield useful information on the items of a test.

But, these parameters do not indicate the extent to which an item contributes to the aim of a test to assess individual differences in the attribute that is measured by the test.

Classical item discrimination: a parameter that indicates the extent to which the item differentiates between the true test scores of a population of persons.

Defined in a population of persons, may vary between different populations.

The item-test and item-rest correlations

An appropriate index for discrimination between the true scores would be the product moment correlation between the item score and the true score in the population of persons.

Test taker j’s observed score is the estimator of his true score.

The population item-test correlation is estimated by the sample correlation.

Item-rest correlation: the product moment correlation between the item scores and the rest scores of the test, where the studied item is deleted.

The item reliability index

Item reliability index: the kth item reliability index is the product of the item-test correlation and item score standard deviation of the kth item in a population of persons.

Distractor analysis

A maximum performance item presents the test taker with a problem that has to be solved.

The response mode of this type of item is either the free(constructed) response or the choice (selected) response mode.

The most common response mode is multiple choice.

A multiple choice item has one correct option and a number of distractors. If the test taker selects the correct option his item score is 1, if he selects a distractor his item score is 0.
Classical item difficulty and discrimination indices are used for this type of dichotomous scoring, but the test takers’ distractor choices contain information on the multiple choice item.

Item distractor popularity

Classical difficulty of a dichotomously scored multiple-choice item: the proportion of persons of the population who selected the correct answer to the item.

Item distractor popularity: the proportion of persons of a population who selected the distractor.

The item difficulty and the item distractor popularities are estimated in a sample of test takers by the proportions of test takes of the sample who selected the correct answer and the distractors, respectively.

The distractor popularities yield information on the appropriateness of the distractors. An unpopular distractor is selected by only a small proportion of test takers.

Apparently, most test takers know that the distractor is an incorrect answer to the item, which means that the item can do without this distractor.

Item distractor-item correlations

The usual way of scoring multiple-choice items is by assigning a 1 to the choice of the correct option, and assigning a 0 to the choice of a distractor.

Item distractor-rest correlations: the product moment correlations of the separate dichotomous correct answer/distractor variables and the rest score.

The item distractor-rest correlations yield detailed information on item quality.

A positive distractor-rest correlation indicates that the distractor tends to attract test takers who have lower true scores than test takers who selected the correct answer.

A negative distractor-rest correlation indicates that the distractor discriminates in the wrong direction because the distractor tends to attract test takers who have higher true scores than test takers who selected the correct answers.

The internal structure of the test

The observed test score is the unweighted or weighted sum of the item scores.

The basic idea is that in a population of test takers the items that measure the same attribute are more highly associated with each other than associated with items that measure another attribute.

The association between items can be assessed by different coefficients.

The conventional approach is to use a correlation coefficient to assess inter-item association. The internal structure of the test is studied by searching for clusters of items that are highly correlated within clusters and less correlated between clusters.

Analysis of inter-item product moment correlations

The conventional approach is to compute product moment correlations between the item scores, and to look for clusters of items that are highly correlated within the cluster and less correlated with items of other clusters.

The product moment correlation between the scores of two items in a population of persons is estimated by the sample product moment correlation.

Phi coefficient: the product moment correlation coefficient between two dichotomous variables.

The phi coefficient between two items can only reach a maximum of 1 if the p-values are equal.

Access:

Public

Click & Go to more related summaries or chapters

Study Guide for summaries with A Conceptual Introduction to Psychometrics by Mellenbergh

Summary of A Conceptual Introduction to Psychometrics by Mellenbergh - 1st edition

Psychology and behavorial sciences - Theme

Statistics: summaries and study assistance - Theme

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Spotlight: topics

Check how to use summaries on WorldSupporter.org

Submenu: Summaries & Activities

Follow the author: Psychology Supporter

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

Search a summary, study help or student organization

Select any filter and click on Search to see results

Summary of A Conceptual Introduction to Psychometrics by Mellenbergh - 1st edition

What is meant by psychometrics? – Chapter 1

Test definitions

Test types

Classified according to time:

Typical performance tests

Types of test taking situations:

How can you develop maximum performance tests? – Chapter 2

Construct of interest

Measurement mode

The objectives

The population

The conceptual framework

Item response mode

Administration mode

Item writing guidelines

Item rating guidelines

Pilot studies on item quality

Expert’s pilots

Test takers’ pilots

Compiling the first draft of the test

What is a typical performance test? – Chapter 3

Measurement mode

The objectives

Population

The conceptual framework

Construct method

The facet design method

Item response mode

Administration mode

Item writing guidelines

Item rating guidelines

Pilot studies on item quality

Response tendencies

Compiling the first draft of the test

What are observed test scores? – Chapter 4

Item scoring by fiat

The sum score

The observed test score distribution

How can observed test scores be analysed? – Chapter 5

Measured precision of observed test scores

Information on a single observed score

Reliability of observed test scores in a population

Some properties of classical test theory

The standard error of measurement of a test

Lower bounds to reliability

Test length and reliability

Correlation corrected for attenuation

Signal-to-noise ratio

Parameter estimation

Estimation of population parameters

How can items be scored – Chapter 6

Item score distributions

Classical item difficulty and attractiveness

Item score variance and standard deviation

Classical item discrimination

The item-test and item-rest correlations

The item reliability index

Distractor analysis

Item distractor popularity

Item distractor-item correlations

The internal structure of the test

Analysis of inter-item product moment correlations

Contributions: posts

Add new contribution

Spotlight: topics

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

Quicklinks to fields of study for summaries and study assistance