How to empirically estimate the reliability? - Chapter 6

Test scores can be used to estimate reliability scores and to estimate the measurement error. In this chapter three methods are discussed for estimating reliability: (1) alternate forms of reliability (also known as a parallel test); (2) test-retest reliability; (3) internal consistency. This chapter also looks at the reliability of the difference scores used for other cognitive growth, symptom reduction, personality change, etc.

What does the alternate forms method include?
How is the test-retest reliability calculated?
How is internal consistency used to estimate reliability?
Which factors can influence the reliability of test scores?
How is the reliability of difference scores determined?

What does the alternate forms method include?

The first method is a parallel test to estimate reliability. In the parallel test there are two tests: the test that people want to perform where scores come out and a second test where scores also come out. With these two scores, the correlation between the test scores and the scores of the parallel test can be calculated. The correlation can then be interpreted as an estimate of reliability. The two tests are parallel if both tests measure the same set of true scores and if they both have the same error variance. The correlation between the two parallel tests is equal to the reliability of the test scores. A practical problem with using a parallel test is that we never know for sure that the parallel test can meet the assumptions of classical test theory. We can never be sure that the true scores of the first form will be the same as the true scores of the parallel form. Different test forms have a different content, which can cause problems with the parallel test. If the parallel test is not entered correctly with the first test, then the correlation is not a good estimate of the reliability.

Another possible problem with the parallel test is transmission of contamination through repeated testing. The test subjects may already be influenced by the best test and the condition of the test subjects may be different during the parallel test. As a result, the performance of the test subjects can be different and the test is less reliable. With classical test theory, the error with every test is accidental. If the test subjects are affected, the error scores of the tests have a correlation with each other, whereas this is not possible according to the traditional test theory. This means that the two tests are not completely parallel to each other.

Two assumptions for a parallel test are that the true scores are the same and that the error variance in both tests is the same. The mean of the observed scores of both tests must also be the same and the tests must have the same standard deviations. If all of this is correct and we really have the idea that the two tests measure the same construct, then we can use this as an estimate of reliability. This estimator of reliability is called the reliability of varying forms.

How is the test-retest reliability calculated?

This method is useful for measuring stable psychological concepts such as intelligence and extraversion. The same people can have the same test performed several times. If the assumptions are correct, the correlation can be calculated between the first scores and the repeated scores. This correlation is then the estimator of the test-retest reliability. The applicability of the test retest test depends on a number of assumptions to ensure good reliability. Just as with the parallel test, the true scores for both tests must be the same. The error variance of the first test must also be the same as the error variance of the second test. If these assumptions are met, we can say that the correlation between the scores of the two test samples is an estimate of the reliability of the score.

The assumption that the true scores are the same for both tests cannot always be pursued. First, some concepts are less stable than others. Humor tests, for example, are less stable than a test of a trait. In a test about feeling, one can feel very happy in the first test while feeling more depressed a little later, in the second test. This gives different true scores and makes the test less reliable. The length of the intermediate periods (intervals) can count as a second factor in the stability of the tests. Larger intervals may have larger psychological changes and the true scores may therefore change. Short interim periods can cause transmission or contamination effects. Many test-retest analyzes have an interval of 2 to 8 weeks. A third factor is the period in which the tests are conducted. One can just go through a period of development (especially with children) between the two tests and even then the true scores are no longer the same.

If the true scores remain the same over the two tests, the correlation between the two tests indicates the extent to which the measurement error influences the test scores. The lower the correlation, the more influence the measurement errors have had and the more unreliable the tests are. A difficulty with the test-retest method is that one is never sure whether the true scores have remained the same in both tests. If the true scores change, the correlation not only reflects the influence of the measurement errors, but also the degree of change of the true scores. This cannot be calculated with simple formulas. It could therefore be that a test-retest correlation is low due to differences in the true scores, while the reliability is high in the test. The parallel method and the test-retest method can be theoretically useful, but in practice they are often difficult. They can be very expensive and time-consuming. That is why these methods are not applied quickly.

How is internal consistency used to estimate reliability?

Internal consistency is a good alternative to the parallel test and the test retest method. The advantage of internal consistency is that you only take one test at a time. A composite score is a score calculated from multiple items and is the total score of the test responses. Internal consistency can therefore be used in tests that have more than one item. The idea with internal consistency is that parts (items or groups of items) of a test can be treated as different forms of a test. Internal consistency is used in many areas of behavioral science. Two factors influence the reliability of the test scores. The first is whether the parts from the test are equal to each other. If these parts correlate strongly with each other, the test is reliable. The length of the test is the second factor that counts. A long test is more reliable than a short test. There are three different ways to investigate internal consistency: the split-half method, the "raw alpha" method and the "standardized alpha" method.

Estimates of split half reliability

The split-half reliability is obtained when the test is split in two and the correlation between the two parts is calculated. In this case, two small parallel tests were actually made. The process to use the split-half method is in three steps. The first step is to divide the scores into two. The second step is to calculate the correlation between the two parts. This split-half correlation (rhh) indicates the degree to which the two parts are equal to each other. The third step is to put the correlation in a formula to calculate the reliability (Rxx) estimate. This is done with the Spearman-Brown formula:

Rxx = 2 * rhh / 1 + rhh

With this correlation, the formula must be used because it concerns half a test and not a whole test like the other methods. Because it is a correlation within a test, this correlation is called the estimator of the reliability of the internal consistency. The two halves in the test must have the same true scores and the same error variance. The averages and standard deviations must also be the same. If the two halves do not meet these criteria, the reliability of the test is less. One can then make a different split of the items, but because the halves are not parallel, a different correlation can result. For this reason, the split-half reliability is not often used.

Measuring reliability through internal consistency has an additional problem with regard to power testing and speed testing. In the power tests, the subjects have plenty of time to answer and the questions differ in difficulty. In the speed tests, the test subjects have a certain time in which as many questions as possible must be answered and the questions are the same in difficulty. If you use the split-half method in a speed test, it will show the reliability of a person's reaction speed. Since all questions are of the same level of difficulty, the subject will have spent approximately the same amount of time on each question. Because of this the reliability is almost always around 1.0 and that is why split-half is almost never used for speed tests.

Cronbach’s Alpha ("raw" coefficient alpha)

When one regards each item as a subtest, one comes a lot further with internal consistency. Calculating the internal consistency at item level is based on two steps. All statistics are calculated in the first step. In the second step, the statistics are applied in calculations to estimate the reliability of the entire test.

The most used method is the Cronbach’s alpha, also known as raw coefficient alpha. For this we first calculate the variance of the scores over the entire test (sx²). The covariance between each pair of items is then calculated. If the covariance of a few items is 0, then it is possible that not every item measures the same construct or that the measurement error has a major influence on that item. This means that the test has some problems. After all covarities have been calculated, they are added together. The larger this number is, the more the items match each other. The next step is to estimate reliability with the following formula:

α = estimated R_xx = (k/k-1) * (∑c_ií / s_x²)

K is the number of items in the test.

There are different formulas for calculating Cronbach’s Alpha. Another formula is: α = estimated R_xx = (k/k-1) * (1- (∑s_i²/ s_x²))

Standardized coefficient alpha

Another method is to use the general Spearman-Brown formula, also known as the standardized alpha estimate. This method gives a roughly the same outcome as the regular Cronbach’s Alpha, it is popular with computer programs such as SPSS and this method gives the clearest picture of reliability. If a test uses standardized scores or z-scores, the standardized alpha gives a better estimate of the reliability. The standardized Alpha is based on correlations. As a first step, we calculate the correlation between each pair of items, just like with the raw Alpha. These correlations reflect the extent to which the differences between the responses of the participants match. We then calculate the average of all correlations (rií) that were obtained. The next step is to introduce this correlation in this more general form of the Spearman-Brown formula:

Rxx = k*r_ií / 1 + (k-1)* r_ií

Raw Alpha for binary items: KR₂₀

Many psychological tests have binary items (you can choose from two answers). For these tests a special formula can be used to estimate the reliability, namely the Kuder-Richardson 20 formula. This is based on two steps. First all statistics are collected.

These are the proportion of correctly answered questions (p) and the proportion of incorrectly answered questions (q). Then the variance of each item is calculated with si² = pq and the variance of all test scores (sx²). The second step is to process these statistics in the Kuder and Richardson formula (KR20):

Rxx = (k / k-1) * (1- (∑pq / sx²))

Omega

The omega coefficient applies to the same types of scales as the alpha (usually continuous items that are combined into a total score on the test). The omega is based on the idea that reliability can be defined as the ratio of signal and noise. In other words: reliability = signal / (signal + noise). We omit a more detailed discussion of the omega, since that goes beyond the purpose of this book.

Assumptions for the alpha and omega

The accuracy of the reliability estimates described above depends on the validity of certain assumptions. In summary, the alpha method only has accurate reliability estimates when the items are essentially tau-equivalent or parallel (see Chapter 5 for a discussion of these models). The omega is more broadly applicable; the omega also provides accurate reliability estimates for congeneric tests.

Theory and reality of accuracy and the use of internal consistency estimators

Many researchers do not look at the assumptions that must be made when calculating the Alpha. Alpha is the method that is usually chosen to calculate reliability. This is because it is easy to calculate and the test subjects are not needed more than once. Not much attention is paid to the assumptions, because the assumptions are less accurate (therefore more quickly satisfied) with the Alpha. If the items are approximately equal to each other, the estimate is reliable. Here the error variations do not have to be the same. If the items are approximately equal to each other, the estimates of the KR20 and of the alpha coefficient are reliable. If the items are not the same, the KR20 and the Alpha will underestimate reliability. The reliability can also be overestimated. This is because only one test is used in the calculation of Alpha and therefore the error variance may be underestimated. In general, the Cronbach’s Alpha is used the most, because it is easiest in terms of assumptions, and gives a good reliable score.

Internal consistency and dimensionality

The internal consistency of items is separated from the conceptual homogeneity (items are one-dimensional) of the items. The reliability of a test can be high, even if the test measures multiple properties (conceptual heterogeneity / multidimensional). So with the reliability of internal consistency, it is not useful to look at the conceptual homogeneity or the dimensions (multiple properties) of the test.

Which factors can influence the reliability of test scores?

There are two factors that contribute to the reliability of internal consistency. The first factor is the equality between the parts of the test. This has a direct effect on the reliability estimate. If the correlation is positive, the parts are consistent with each other. This does depend on the size of the correlation. Items can be removed from the test or rewritten if the items are not good for the correlation. This may result in a higher correlation. This means that there is a higher internal consistency and therefore also a higher reliability.

The second factor that can affect reliability is the length of the test. Long tests are more reliable than short tests. With longer tests, the variance of the true score rises faster than the error variance. Reliability can also be calculated with this formula:

R_xx = s_t² / (s_t² + s_e²)

Here st² is the variance of the true score and se² the error variance and st² + se² = so² (the observed score). If we double the length of the test we get the following formula to calculate the true score variance:

s_t²-_double = 4* s_t²-_{1 part}

From this formula we can conclude that when we double the length of the test, the variance of the true score becomes four times as much. The error variance is given a different formula when extending the test:

s_e²-_double = 2* s_e²-_{1 part}

Here we can see that when the test doubles, the error variance also doubles. After calculating these figures we can enter them in a formula to estimate the reliability:

Rxx-double = 4 (st²-1 part) / (4 (st²-1 part) + 2 (se²-1 part))

This formula can be converted to the following formula:

Rxx double = 2Rxx original / 1 + Rxx original
The general formula for a test that is extended or shortened is a Spearman-Brown formula (prediction formula):

Rxx extended or shortened = n * Rxx original / 1 + (n-1) Rxx original or

Rxx = k * ŕií / 1 + (k-1) * ŕií

N is the factor used to extend or shorten the test. Rxx original is the reliability estimate of the original version of the test. In the second formula, K is the number of items in the new version of the test. rii is the average inter-item correlation.

The average inter-item correlation can be calculated if we know the standardized Alpha and the number of items:

ri1 = Rxx / k- (k-1) Rxx

It is therefore useful for reliability to extend a test, but on the other hand the new items that are added must be exactly parallel to the items that are already in the test. With longer tests it is less useful to add more items than with tests that are less long.

Heterogeneity and general reliability

Another factor that influences reliability is heterogeneity. The greater the variability (heterogeneity) between the test subjects (and their true scores), the greater the reliability coefficient. If one examines a trait where there is a lot of heterogeneity, then the reliability is higher than for a trait with a trait with little heterogeneity. This has two important implications. If it is first emphasized that reliability is a characteristic of the test scores and not of the test itself. The following implication is that examples of heterogeneity can be used in reliability generalization studies. These studies look at the extent to which the reliability estimates from other studies with the same test are similar and how the reliability estimates have been used. These studies can be used to identify and understand how the characteristics of a sample influence the reliability of test scores.

How is the reliability of difference scores determined?

There are also studies that look at how much a group of test subjects changes compared to another group of test subjects. This also has to do with variability. People want to know how much variation there is in the change of all test subjects. One method to see how much a subject has changed in trait is to take the test twice and then subtract the first score from the last score. This is used to calculate the difference score (Di = Xi - Yi). A positive score is an improvement, a negative score is a reduction and a score of 0 means that no change has taken place.

There are different types of difference scores. A difference score can be calculated within a person (intra-individual score), the same test is taken twice. Another type of difference score is intra-individual discrepancy score, where two measurements are also taken with the same person but a different test is used the second time. In addition, a difference score between two people can be calculated in which two different people take the same test and the score of one person is subtracted from the score of the other person.

Estimate the reliability of the difference scores

The estimation of the reliability of the difference scores requires three things: The reliability of both tests used to calculate the difference scores (Rxx and Ryy). The variability of the observed scores of the test (Sxo2, Syo2, Sxo, Syo). And the correlation between the observed test scores (rxoyo).

The formula for the reliability of the difference scores is:

Rd = Sxo2 * Rxx + Syo2 * Ryy - 2rxoyo * Sxo * Syo / Sxo2 + Syo2 - 2rxoyo * Sxo Syo.

Factors that influence the reliability of the difference scores

There are two factors that determine whether a set of difference scores will have good reliability. The first is the correlation between the observed scores of the tests. As the correlation between the tests becomes higher, the reliability of the difference scores decreases. The second factor is the reliability of the two tests used to calculate the difference scores. If the tests have a high reliability, the difference scores will generally also have a high reliability.

The reliability of the difference scores cannot be higher than the average reliability of the two individual test scores. But the reliability of the difference scores can be much smaller than the reliability of the two individual test scores.

Unequal variability

In some cases, difference scores are not a clear reflection of the psychological reality. The difference scores then reflect only one of the two variables. This can happen if the two tests have unequal variability, which may be due to the fact that the tests use different measuring scales. The scores must first be standardized to be able to calculate the difference scores. This means that the variables have an average of 0 and the standard deviation is 1. Only then can the test subjects be accurately compared. A difference score can then be calculated from this. However, the difference score does not have to mean anything, even though the metric scales are the same. It only makes sense to calculate a difference score if the test scores have a psychological characteristic in common.

Especially when analyzing discrepancy scores, it is important to first standardize the tests before calculating the difference scores.