What is reliability? - Chapter 5

What is reliability?
What are the four types of reliability?
How is the size of the measurement error expressed?
What is the 'Domain Sampling Theory'?

What is reliability?

Chapter 5 is about the reliability of a test. Reliability is the extent to which differences in the observed scores of the respondent concerned correspond with differences in his or her true scores. The smaller the difference, the more reliable.

According to the Classic Test Theory (CTT) the reliability can be determined on the basis of observed scores (Xo), true scores (Xt) and random scores (Xe). Random scores are also called measurement errors.

Other factors that cause differences between the observed and the true scores are called sources of error. These cause measurement errors, which create a contradiction between the observed and the true scores.

In addition to "sources of error", there are also temporary or transient factors that can influence the observed scores. Examples of this are the number of hours of sleep, emotional state, physical condition, gambling or misplaced answers. The latter means that if you know the correct answer, you still indicate the wrong answer. These temporary/transient factors decrease or increase the observed scores versus the reliable scores.

To find out whether the observed scores are a function of measurement errors or a function of reliable scores, two questions must be asked:

Which part of the observed scores is a function of reliable inter-individual or intra-individual differences?
Which part of the observed scores is a function of measurement errors?

In other words: Xo = Xt + Xe. You can say that the observed scores are determined by the true scores and the measurement errors. The smaller the value of Xe, the better. It seems that the measurement errors are random (at random), this means that they are independent of the true scores Xt. In other words, a measurement error affects both someone with a high true score and someone with a low true score in the same way and with the same amount. There are two characteristics:

The average of all measurement errors within a test is zero.
Measurement errors do not correlate with true scores, rte = 0.

Instead of saying that reliability depends on the consistency between differences in observed scores and differences in true scores, you can also say: reliability depends on the relationships between the variability of the observed score, variability of the true score, and variability of the measurement error score.

Error score variance: Se² = ∑ (Xe minus average Xe) ² / N. The higher Se², the worse the measurement.
True score variance: St² = ∑ (Xt minus average Xt) ² / N
Observed score variance: So² = ∑ (Xo minus average Xo) ² / N. Or, So² = St² + Se².

This formula should actually be: So² = St² + Se² + 2rte * St * Se.

However, the true scores and the measurement errors are not correlated and therefore rte * st * se = 0. So there remains: So² = St² + Se².

What are the four types of reliability?

1. Reliability in terms of "proportions of variances"

Rxx (reliability coefficient) = St² / So²

Rxx = 0 means that everyone has the same true score. (St² = 0)

Rxx = 1 means that the variance of the true scores is equal to the variance of the observed scores. In other words: there are no measurement errors!

Here is an example of interpretation of Rxx:

Rxx = 0.48 or 48% of the differences in the observed scores can be attributed to the true scores. On the other hand, 1-0.48 = 0.52, so 52% of the differences can be attributed to measurement errors.

2. Reliability in terms of "lack of measurement error"

Rxx (reliability coefficient) = St² / So²

So² = St² + Se² (and therefore also: St² = So² - Se²)

Rxx = (So² - Se²) / So² = (So² / So²) - (Se² / So²)

In other words: Rxx = 1 - (Se² / So²): when (Se² / So²) is small, the reliability is high.

3. Reliability in terms of "correlations"

Rxx = Rot², where Rot² is the squared correlation between the observed scores and the true scores.

Rot = St² / (So * St) = Rot = St / So

Rot² = St² / So².

A reliability of 1.0 indicates that the differences between the observed test scores perfectly match the differences between the true scores. A reliability of 0.0 indicates that the differences between the observed scores and the true scores are totally contradictory.

4. Reliability in terms of "lack of correlation"

Rxx = 1 - Roe², where Roe² is the squared correlation between the observed scores and the error scores.

Roe = Se² / (So * Se) = Se / So

Roe² = Se² / So² so:

Rxx = 1 - Roe² = 1 - (Se² / So²).

If Roe = 0, then Rxx = 1.0

The greater the correlation between the observed scores and the error scores, the smaller Rxx. So reliability will be relatively high if the observed scores have a low correlation with the error scores.

How is the size of the measurement error expressed?

Although reliability is an important psychometric construct, it does not give a direct reflection of the magnitude of the measurement error of a test. Additional coefficients are therefore needed at this point. The standard measurement error displays the average size of the error scores. The greater the standard measurement error, the greater the average difference between observed scores and true scores, and therefore the lesser the reliability of the test.

Standard measurement error = sem

sem = So * √ (1 - Rxx)

If Rxx = 1 then Sem = 0, so: Rxx greater means sem smaller.
sem is never greater than So, so: greater means sem means greater
How is the theory of reliability translated into practice?

The theory of reliability is based on three terms: true scores, observed scores, and error scores. But in practice we do not know whether a score is actually the true score of an individual. We also do not know to what extent measurement errors influence the response of an individual. How then do we translate the theory of reliability into practice?

Although we cannot determine with certainty what the reliability or standard measurement error of a test is, advanced methods have been developed to estimate it. Examples of such techniques are giving two versions of the test, doing the same test twice and so on. In this section, four methods are discussed to estimate the reliability and standard measurement error of a test:

Parallel testing;
the tau equivalent test model;
essentially tau equivalent test model;
congeneric test model.

Each model offers a perspective on how two or more tests are the same.

1. Parallel tests

We speak of parallel tests when two (or more) tests, in addition to the basic assumptions of classical test theory, meet the following three assumptions:

The two tests have the same error variance (se12 = se22).
The intercept between the true scores on both tests is 0 (so a = 0, in Xt2 = a + b (Xt1)).
The slope between the true scores on both tests in 1 (so b = 1, in Xt2 = a + b (Xt1)).

These assumptions have six implications:

This implies that the true scores of test 1 are identical to the true scores of test 2 (Xt1 = Xt2).
Derived from this it means that in case the true score of each participant on test 1 is equal to the true score on test 2, the two sets of true scores correlate perfectly with each other (rt1t2 = 1).
The variances of the value scores of tests 1 and 2 are identical (st12 = st22).
The average of the true scores of test 1 is equal to the average of the true scores of test 2.
The variance of the observed scores of test 1 is equal to the variance of the observed scores of test 2. And finally, sixth, the reliability of the tests is the same (R11 = R22).
When the scores of two tests meet all these assumptions and implications, we speak of parallel tests.

Finally, according to the KTT, there is one further implication that follows from the above: the correlation between parallel tests equals reliability. In formula form: r0102 = R11 = R22. In other words, when two tests are actually (perfectly) parallel, the correlation between the two tests is therefore equal to the reliability of both tests.

The correlation between parallel tests can also be calculated based on the variances of the true and observed scores: r0102 = st2 / so2.

2. The tau-equivalent test model

In addition to the standard assumptions of classical test theory, the tau-equivalent test model is based on the following two assumptions:

The intercept between the true scores of both tests is 0 (so a = 0, in Xt2 = a + b (Xt1)).
The slope between the true scores on both tests in 1 (so b = 1, in Xt2 = a + b (Xt1)).

These two assumptions are the same as the assumptions of parallel tests. The difference lies in the first additional assumption: the tau-equivalent test model does not state the assumption of appropriate error variances. This leads to four implications (the first four that we discussed in parallel testing).

The less strict assumptions mean that the correlation between tau-equivalent tests is not a valid estimate of the reliability. This is in contrast to parallel tests, where the correlation between the tests is therefore a valid estimate of the reliability.

3. The essentially tau-equivalent test model

In addition to the standard assumptions of classical test theory, the essentially tau-equivalent test model is based on one additional assumption:

The slope between the true scores on both tests in 1 (so b = 1, in Xt2 = a + b (Xt1)).
This leads to two implications (the first two discussed in parallel tests), or: rt1t2 = 1 (the correlation between the tests is perfect), and st12 = st22 (the variances of the true scores of both tests are equal).

4. Congeneric test model

The last model is the congeneric test model. According to this model, only the assumptions of classical test theory are accepted. This results in a single implication, namely that the correlation of the true scores between the tests is equal: rt1t2 = 1. This model is therefore the most strict and the most general model. Although this model is more often applicable (this model is conditional for more districted models), it offers limited possibilities for estimating reliability.

What is the 'Domain Sampling Theory'?

According to this theory, reliability is the average size of the correlations between all possible pairs of tests with N items selected from an area ("domain") of test items. The logic of this theory is the foundation of the generalizability theory, this will be discussed extensively in chapter thirteen.

Access:

Public

Check more: click and go to more related summaries or chapters

Summary of Psychometrics: An Introduction by Furr - 3rd edition

What is psychometrics? - Chapter 1

What is important when assigning numbers to psychological constructs? - Chapter 2

What are variability and covariability? - Chapter 3

What is dimensionality and what is factor analysis? - Chapter 4

What is reliability? - Chapter 5

How to empirically estimate the reliability? - Chapter 6

What is the importance of reliability? - Chapter 7

What is validity? - Chapter 8

How to evaluate evidence for convergent and divergent validity? - Chapter 9

What types of response bias are there? - Chapter 10

What types of test bias are there? - Chapter 11

What is a confirmatory factor analysis? - Chapter 12

What is the generalizability theory? - Chapter 13

What is the Item Response Theory (IRT) and which models are there? - Chapter 14

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

Check: concept of JoHo WorldSupporter

Concept of JoHo WorldSupporter

JoHo WorldSupporter mission and vision:

JoHo wants to enable people and organizations to develop and work better together, and thereby contribute to a tolerant tolerant and sustainable world. Through physical and online platforms, it support personal development and promote international cooperation is encouraged.

JoHo concept:

As a JoHo donor, member or insured, you provide support to the JoHo objectives. JoHo then supports you with tools, coaching and benefits in the areas of personal development and international activities.
JoHo's core services include: study support, competence development, coaching and insurance mediation when departure abroad.