Summary of Psychometrics: An Introduction by Furr - 3rd edition

Summaries per chapter with the 3rd edition of Psychometrics: An Introduction by Furr

Please note: for more summaries and study assistance with more and more recent editions of the book, you can check:

Image

Check summaries and supporting content in full:
What is psychometrics? - Chapter 1

What is psychometrics? - Chapter 1

Psychometrics play an important role in daily life. Whether you are a student, teacher, parent, psychologist, mathematician, or physicist; everyone has to deal with psychological tests. Psychological tests can have influenced your educational career, career, health, prosperity and so on. Psychometrics can even influence questions about life and death. For example in some countries, people with a severe cognitive impairment (significantly below average) can not receive the death penalty. But what is significantly below average? And how can we determine if the intelligence of an individual is below this limit? These kind of difficult questions can be answered through psychological tests. All in all, we can therefore say that psychometrics extends beyond psychological research. Psychometrics play a role in daily life, everyone has to deal with it.

How are psychological characteristics measured?

Psychologists use instruments to measure observable situations in the physical world. Sometimes psychologists measure a certain type of behavior solely because they are interested in that specific behavior in itself. But it is mainly behavioral scientists that measure human behavior to measure unobservable psychological attributes. We then identify a certain observable behavior and assume that this represents a certain unobservable psychological process, attribute or attitude. You must ensure that what you are measuring is also what you are aiming to measure. In social science, theoretical concepts such as short-term memory are often used to explain differences in human behavior. Psychologists call these theoretical concepts hypothetical constructs or latent variables. They are theoretical psychological properties, attributes, processes or states that cannot be observed directly. The procedures or actions with which they measure these hypothetical constructs are called operational definitions.

What is a psychological test?

According to Cronbach, a psychological test is a systematic procedure for comparing the behavior of two or more people. This test must meet three conditions:

  1. the test must have behavioral samples;
  2. the behavioral samples must be collected in a systematic manner and;
  3. the purpose of the test must be to measure the behavior of two or more people (inter-individual differences).

It is also possible that we measure the behavior of an individual at different times, in which case we speak of intra-individual differences.    

Different types of testing

You can distinguish between different tests in the field of content. Which type of answer is used (open or closed) and which methods are used in the measurement. 

A distinction is also made between the different objectives of testing: the criterion referenced and norm referenced. Criterion referenced tests (also called domain referenced tests) are most common in situations where a statement must be made about a certain skill of a person. One pre-determined cutoff score is used to divide people into two groups: people whose score is higher than the cutoff score and people whose score is lower than the cutoff score.       

Norm-referenced tests are mainly used to compare the scores of a person with scores from the norm group. Nowadays, it is difficult to make a distinction between the benchmark tests and the benchmark tests. 

Another well-known distinction between tests is the distinction between the so-called speed tests (speed) and power tests. Speed tests are time-bound tests. It often happens that not all questions can be answered in a questionnaire. We look at how many questions you can answer correctly in the given time. Power tests are not time-bound tests. Here it is highly likely that one can answer all questions in a questionnaire. These questions often become more difficult and it is checked here how many questions people have answered correctly.        

Finally, the difference between reflective or effect indicators and formative or causal indicators is briefly discussed. An example of reflective / effect indicators are scores on intelligence or personality tests. These scores are usually considered as a reflection, or consequence, of a person's intelligence level. There are formative / causal indicators against this. Socio-economic status (SES) can, for example, be quantified by combining different indicators such as income, education level, and occupation. In this case, the indicators are not caused by SES. In contrast, the indicators are, in part, what SES defines. In this book the focus is on test scores, derived from reflective / effect indicators, which is typical of most tests and measurements in psychology.            

What is psychometrics?

With psychometrics, the focus is on the attributes of testing. Just as psychological tests are designed to measure psychological attributes of people, psychometry is the science where people are concerned with the attributes of psychological tests. There are three attributes that are important: the type of data (mainly scores), the reliability and validity of the psychological tests. Psychometrics is about the procedures with which test attributes are estimated and evaluated.

Psychometrics is based on two important foundations. The first foundation is the practice of psychological testing and measurement. The use of formal tests to measure skills of an individual (of any kind) goes back 2,000 years, or maybe even 4,000 years. Especially in the last 100 years there has been a huge increase in the number, type, and application of psychological tests. The second foundation is the development of static concepts and procedures. From the beginning of the nineteenth century, scientists became increasingly aware of the importance of static concepts and procedures. This led to an increase in knowledge about how quantitative data resulting from psychometric tests can be understood and analyzed. Pioneers in this field are Charles Spearman, Karl Pearson, and Francis Galton .       

Francis Galton was obsessed with measurements, mainly the so-called 'anthropometry'. Anthropometry contains measurements of human characteristics such as the size of the head, the length of an arm and the physical strength of the body. These properties have, according to Galton, psychological characteristics. He called these measurements of mental traits 'psychometrics'. Galton was primarily interested in the ways in which people differ. Galton's point of view was known as differential psychology , or the study of individual differences.   

Psychometrics is the collection of procedures that are used to measure variability in human behavior and subsequently combine these measurements into psychological phenomena. Psychometrics is a relatively young, but rapidly developing, scientific discipline. 

What are the challenges in psychometrics?

Many sciences are very similar but behavioral science has its challenges. 

One of those challenges is to try to identify and capture the important aspects of different types of human psychological attributes in a single number. 

A second challenge is participant reactivity. When participants know that they are being tested and they know why, this in itself influences the responses of the participant. For example, if a participant knows that it is being tested whether he or she is a racist but does not want this to appear in the test, this influences his or her reactions. Examples of participant reactivity are demand characteristics (influenced by what the participant thinks is the goal of the researcher), social desirability (responding to the wishes of the outside world), and malingering (wanting to leave a bad impression).       

A third challenge is that psychologists rely on so-called composite scores. This means that scores that have something in common are combined. For example, in a questionnaire with ten questions about extraversion, the scores on these questions are combined.  

A fourth challenge in psychological measurement is the problem of score sensitivity. Sensitivity refers to the possibility of a measurement to distinguish meaningful dimensions. For example, a psychologist wants to know if a patient's mood has changed. But if the psychologist uses an instrument that is not sensitive enough to measure small changes, the psychologist may miss important changes.  

The final challenge is the lack of attention to important information in psychometrics. Knowledge about psychometrics increases the chance of developments in testing. And test takers should at least use psychometrically good instruments. 

These challenges should make us aware of the data collected through psychological measurements. For example, we must be aware of the fact that participant reactivity can influence the reactions of the participants in a test.

What is the purpose of measuring in psychology? 

The theme that links the chapters of this book is related to the fact that the ability to identify and characterize psychological differences is the basis of all methods used to evaluate tests.

The purpose of measuring in psychology is to identify and quantify psychological differences that exist between people, over time or in different situations.

Access: 
Public
What is important when assigning numbers to psychological constructs? - Chapter 2

What is important when assigning numbers to psychological constructs? - Chapter 2

In psychological tests, grades are assigned to traits to show the difference between the traits of the different test subjects. Measuring is the assignment of a figure to objects or to characteristics of individual behavior according to a certain scale. Scaling is the way numbers are assigned to psychological traits.

What are the fundamental problems with numbers? 

In psychological measurements, figures are used to show the level of a psychological characteristic. The figures can apply to different properties in different ways.

Identity

The most important thing when measuring the characteristic is looking at the differences and similarities between people. With the differences one can divide the test subjects or objects into categories. The categories must satisfy a number of points. First, all test subjects within a category must be the same on the attribute that this category represents. Second, the categories must be reciprocally exclusive. This means that each test subject can only be classified in one category. Third, no persons may fall outside the categories. Numbers are only used here as a label for the categories. They have no mathematical value: quantitative significance can therefore not be considered.

Rank order

The ranking of the figures contains information about the relative size of a property that people possess. So whether you possess a trait to a greater or lesser extent compared to the other people in the category. Here too, the figures are only a label. They give a meaning to the ranking within the category, but have no mathematical meaning.

Quantity

Indicating the quantity provides the most information. With regard to quantity, the figures are given per person and it is therefore possible to look at the precise difference between two people. At this level, the figures also have a mathematical meaning, with these figures calculations can be made. When psychological measurements are made, it is often assumed that the scores contain the characteristic of quantity. But this is rarely a good assumption, as we will discuss later.

The number zero

There are two potential meanings of zero. Zero can mean that the object or person does not exist (absolute zero). This is for example at the reaction time. Zero can also be an arbitrary amount of a property (arbitrary zero). In this case one can think of a clock or thermometer. It is important to see whether the zero in a psychological test is relative or absolute. It is possible that the test indicates zero while the person has that characteristic. Then you can take it as a relative zero while it was initially intended as an absolute zero. Identity, ranking, quantity and the meaning of zero are important issues in understanding scores on psychological tests.  

How can the measured variable be determined?

If the property of quantity is used, the measurements must be clearly defined. An example is length. If you want to know the length of something, you can measure it with a ruler. The ruler is divided into centimeters, so you can now measure the length per centimeter. In psychology, the measured quantity is often less clear / self-evident. There are three ways in which measurement variables can be arbitrary. 

  1. height, weight, etc., of a random unit are chosen. This is a decision that will be secured later on.
  2. The units are not tied to one type of object. Units can be applied to many types and many different objects.
  3. Units can serve different types of measurements. An example is a piece of rope with which you can measure length, but you can also use the piece of rope to measure the weight of something.

If the units are in physical form, standard measurements are based on the three points mentioned above. And are therefore arbitrary on all three points. The measurements in the psychological world are generally only arbitrary at the first point. So you can choose what the unit means and what size is used. But with these measurements, the units are usually tied to a specific object or dimension. An important exception is that standard measurements are sometimes used to measure psychological characteristics. Such as cognitive processes that are measured by a person's reaction speed.

What role do adding and counting play in psychometrics? 

Both in the physical and in the psychological world, counting is important in the measurements that we perform.

Adding

An important assumption is that the measurement size of the unit does not change when counting the units. Every piece of unity is the same. With the addition of a unit, one is added every time. This is constant. Even if the conditions of the measurement change, the size of the unit remains the same ( conjoint measurement ). With a questionnaire there are questions that are easy and questions that are difficult. As a result, with most questionnaires one point cannot be awarded for each question. More points can be awarded for questions that are more difficult. But how many points do you assign to a question? This gives a paradox: We want to translate a psychological characteristic into a list of figures to look at the quantity, but this is not exactly because we do not know how much precise unity there can be with a psychological characteristic.

Counting

A point of controversy about the relationship between counting and measuring arises when we start to count things instead of properties. Counting is only equal to measuring when the quantity of a characteristic or property of an object is reflected.

Which measuring scales are there? 

Measuring is the addition of figures to observations of behavior in order to clearly see the differences between psychological traits. There are four measurement levels, or four scales: nominal, ordinal, interval and ratio.

The nominal scale

The nominal scale is the most fundamental level of measurements. At the nominal scale, the test subjects are divided into groups and those who are equal to each other are classified together. So there are differences between the groups. You can assign numbers to the groups but those numbers only give meaning to the group. Therefore, it cannot be taken into account. In daily life figures are also assigned to individuals, but this does not belong to the nominal scale. It is important to make clear what the numbers belong to. Or to individuals or to a group (nominal measurement level).

The ordinal scale

On the ordinal scale one can look at qualitative differences between the observations of behavior. Here numbers are assigned to individuals within a group and to these numbers one can see the ranking of the individuals. These figures only indicate whether you possess a trait to a greater or lesser extent compared to the other people in the group. This does not say anything about the extent of that person's property.

The interval scale

The interval scale goes one step further than the ordinal scale. In this case, if you assign numbers to the groups, the numbers also represent a certain amount. These figures represent quantitative differences between people on the trait being measured. Furthermore, the interval scale has an arbitrary zero. If there is a zero in the list, it does not mean that the unit is absent. With the interval scale you can add and subtract quantities, but you cannot multiply them. Many psychological tests are used and interpreted as if they are based on an interval scale, but in fact the majority of all psychological tests are not based on an interval scale.

The ratio scale

Ratio scales have an absolute zero. So if there is a zero in the list, it means that the unit is absent. It is also possible to multiply at the ratio scale, which is not possible at the interval scale. According to most test experts, there are no psychological tests that contain the ratio scale. When measuring the reaction time one would think that the ratio scale is used, but this is not the case because there is no single person who can respond in zero milliseconds.

Additional point with measuring scales

In theory, it is possible that a score zero at an interval scale means a quantity of a property, but it does not mean that the property is completely absent.

For tests with dichotomous variables, the binary codes (0 and 1) can be used. Depending on the characteristic being measured, it can be interpreted as a nominal scale or as an interval scale.

What implications does scaling have?

Scaling can have important implications on statistical analyzes. Some known, basic statistical procedures can only be performed with measurements that have an interval or ratio scale and not with measurements that have a nominal or ordinal scale. Consider descriptive statistics, such as the average or the correlation. When 'hair color' is the variable to be investigated, with the following categories: 1 = blond (14 people), 2 = black (4 people), 3 = brown (2 people). Then we can calculate the "average hair color" as follows: (1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 2 + 2 + 2 + 2 + 3 + 3) / 20 = 28/20 = 1.4. Although it is mathematically possible to calculate the average hair color, it is not meaningful to do this. This is just one of the many possible examples to show that scaling therefore influences the static analyzes and interpretation of outcomes. Some researchers even go so far as claiming that parametric statistics are only valid when they relate to interval or ratio data. Regardless of whether this statement is true or not, most behavioral scientists use the assumption that tests and measurements have an interval measurement level (with the exception of very short or single-item tests, which often have problems setting up an appropriate analytical strategy).

In psychological tests, grades are assigned to traits to show the difference between the traits of the different test subjects. Measuring is the assignment of a figure to objects or to characteristics of individual behavior according to a certain scale. Scaling is the way numbers are assigned to psychological traits.    

Access: 
Public
What are variability and covariability? - Chapter 3

What are variability and covariability? - Chapter 3

What is variability?

Variability (also called variance) is the difference within a set of test scores or between the values ​​of a psychological trait. Inter-individual differences are differences that occur between people. Intra-individual differences are differences within one person at different times. Individual differences are very important in psychological tests. Reliability and validity of tests depend on the ability of a test to quantify differences between people. All research in psychology and all scientific applications of psychology depend on the ability of a test to measure individual differences. It is important to know that every area of ​​scientific psychology depends on the existence and quantification of individual differences.

You can quantitatively display scores of a group of people or scores of one person at different times in a so-called distribution of scores. A distribution of scores is quantitative, because the differences between scores are expressed in figures. The difference between scores within a distribution is called the variability.

Calculating variability

To display a distribution of scores, you first have to calculate a few things. First of all, we calculate the average, since it is the most used, in addition to the mode and the median. Secondly, we calculate the variability, which consists of steps. These steps are as follows:

  1. Calculate the mean by dividing the sum X by the total number of N people or things.
  2. Calculate the deviation by displaying the difference between X and the average of X.
  3. Calculate the squared deviation by squaring the deviation.
  4. Calculate variance s² by dividing the sum of squared deviations by the total N.
  5. Standard deviation √s² = calculate s by taking root of the variance.

Interpretation of the variance and standard deviation

  • Is never less / lower than zero, or s² ≥ 0 and s ≥ 0.
  • You can never interpret a single score as a large or small value.
  • Comparison is only possible if two or more scores are based on the same measuring instrument / variable, for example IQ. After this you can also determine whether it is a large or small value.
  • The variance and the standard deviation can only be used in certain concepts, for example in correlations or when you measure the reliability of scores.

Normal distribution

Distribution forms and normal distributions are qualitative because they represent the scores in a graphical way. The variable is placed on the x-axis, for example IQ scores from low to high. The proportions of the number of people who have achieved a certain score are displayed on the y-axis. A figure emerges from this: the normal distribution. This rarely (almost never) has a mirror shape. Usually the figures are either crooked to the right or crooked to the left. Skewed to the right means that there are more people who score low. Skewed to the left means that there are more people who score high.

What is covariability?

With a variance, the difference is calculated within one set of scores. With covariability, also called covariance, the difference of one set of scores is compared with the difference of another set of scores. In other words: with a covariance, the relationship between two variables is searched for, for example IQ and GPA. With a variance, one variable is used.

There are important characteristics with a covariance that clearly show the relationship between the two variables. The direction and strength of the relationship, but also the consistency between the two variables is important.

Direction and strength

The direction of the relationship between the two variables can have a positive or negative relationship. There is a positive (or direct) relationship when high scores on the first variable and high scores on the second variable occur at a time. A negative relationship exists when high scores for the first variable and low scores for the second variable occur. This can also be reversed, so low scores on the first variable and high scores on the second variable.

The strength of a connection is difficult to interpret.

Consistency

A strong relationship (positive or negative) between two variables shows that there is a high degree of consistency between them. If there is no clear relationship between two variables, then individual differences on one variable are inconsistent with individual differences on the other variable.

Variance is the variability of a single distribution of scores. Covariance is the variability of two distributions of scores. We have discussed the distribution of scores of a variance for this. We will calculate the distribution of covariance scores as follows:

  1. Calculating deviations from variable X and from variable Y. You do this by calculating the difference between X and the average of X. You also calculate the difference between Y and the average of Y.
  2. Calculate cross-products by multiplying the deviation of X and the deviation of Y by each other. A positive cross product can come out here, which means that the relationship between the variables is consistent. Or when the scores of an individual on both variables are consistent with each other, the individual scores either above the average on both variables or just below the average on both variables. A negative cross product can also be the result. This means that the coherence is uneven and therefore inconsistent. In other words, the individual scores below the average for one variable (so a negative deviation results from this), but on the other variable above the average (this results in a positive deviation).
  3. Calculate covariance using a formula: Cxy = ∑ of the cross-products (multiplying deviation X by deviation Y) by the total N number of people or things.

The covariance provides clear information about the direction of the relationship but not about the strength of the relationship. Correlation coefficients provide clear information about the direction and strength of the relationship.

Variance-covariance matrix

The variance-covariance matrix is ​​always structured in a certain way, with a number of standard features:

  1. Each variable has a row and a column.
  2. The variances of the variables are shown in a diagonal line from top left to bottom right.
  3. All other cells contain covarities between sets of variables.
  4. The covariations are symmetrical. All values ​​below the diagonal are identical to the values ​​above the diagonal.

With a correlation, the value is easier to interpret than with a covariance. Correlation is always between -1 and +1. If the value is below zero, then the relationship between the two variables is negative. If the value falls above zero, it is positive.

Zero means that there is no correlation. The closer to zero the correlation falls, the poorer the relationship between the variables. You can also say that the coherence is inconsistent. The farther away from zero, the better / stronger the coherence between the variables. You can also say that the coherence is consistent. Correlation = Rxy = Cxy / SxSy.

Variance for compound variables / items is used when a psychological test contains a large number of variables / items. You first calculate the variances s² of the number of variables / items separately and add them together. You then calculate the correlation between the scores of the number of variables / items. You multiply this two times. In turn, you multiply this by the standard deviations of the number of variables / items (which you first calculated separately). In the end you add everything.

The total test score variance depends solely on item variability and the correlation between the item pairs. This is an important part of the dependency theory, which is discussed in a later chapter.

There are dichotomous reactions with binary items. This means that when answering a question you can choose from two answers (yes/no, agree/disagree, or 0/1). For example, we ask people to answer a yes or no to a question or we ask people whether they agree or disagree with the statement. It is also possible that certain scores are considered right or wrong. Whether we look at whether or not a certain disorder occurs. We often indicate this with codes, namely code 0 for a negative response (no, disagree, wrong, not true) and code 1 for a positive response (yes, agree, good, true). Code 1 is indicated by p = ∑X / N. Code 0 is indicated by q = 1-p.

You can also calculate the variance using p and q, namely: s² = p x q or p x (1-p). The maximum variance that you can get is 0.25; if p = q = 0.50 then s² = 0.50 x 0.50.

What should you pay attention to when interpreting test scores?

Problems arise when interpreting test scores. As an example:

  • What is a high score and what is a low score?
  • What does it mean if you score high or if you score low?

Consider for example a score of 35 on neuroticism. Is this a high score? And if this is a high score, what does that mean? Am I neurotic or not at all?

A frame of reference ensures that the figures and percentages are easy to interpret. You look at whether the scores fall above or below or even on the average score. You also view how many above or below the average score they fall (think of standard deviations). With this data you can calculate the so-called z-scores. A z-score shows how far above or below the average test score a score falls. Z-scores are easy to compare, even when two completely different variables / units of measurement are used within a test score. For example weight and optimism. Z = (X minus the average of X) / Sx (s = standard deviation of X). The z-score is expressed in the "number of standard deviations". An example, z = 0.5 or -0.5 this means that the score 0.5 standard deviations is above or below the average. This is very close. Another example, z = 2 or -2, this means that the score of 2 standard deviations is above or below the average. This is further away. A z-score distribution includes the following distribution: Z (0; 1) where 0 is the average and 1 is the standard deviation. Z scores say something about a score in relation to the rest of the group. It says how good or bad your score is compared to the average person but says nothing about your abilities in general. Correlation between variables using z-scores: Rxy = ∑ZxZy / N.

Z scores may be easy to compare, they are more difficult to interpret because many people are not familiar with concepts such as "standard deviations" or "distance to the average". Therefore, T-scores (standardized scores) are used with T (50; 10) where 50 is the average and 10 is the standard deviation. T = (z) times (s) + (average of X). Either T = z (10) + 50. Other means / standard deviations may also be given. Another way to interpret scores is in percentiles. An example: an individual has achieved a score of 194. The total number of people taking part in this test is 75. Only 52 people score lower than 194. So: (52/75) x 100 = 69%. You can interpret this as if the score of the individual falls in the 69th percentile and that this person scores higher than 69% of the other people who took the test.

What are normalized scores?

It is often assumed that a psychological trait is normally distributed, but this is not always the case and then a problem arises. It may be thought that a property (such as intelligence) is normally distributed, but the test data (IQ test) is not normally distributed. Researchers then assume that their theory is correct and that the test data (IQ scores) do not accurately reflect the distribution of the construct. Researchers have tried to solve this problem with the help of normalization transformations / area transformations. The scores are then converted to T-scores.

Variability (also called variance) is the difference within a set of test scores or between the values ​​of a psychological trait. Inter-individual differences are differences that occur between people. Intra-individual differences are differences within one person at different times. Individual differences are very important in psychological tests. Reliability and validity of tests depend on the ability of a test to quantify differences between people. All research in psychology and all scientific applications of psychology depend on the ability of a test to measure individual differences. It is important to know that every area of ​​scientific psychology depends on the existence and quantification of individual differences.

You can quantitatively display scores of a group of people or scores of one person at different times in a so-called distribution of scores. A distribution of scores is quantitative, because the differences between scores are expressed in figures. The difference between scores within a distribution is called the variability.

Access: 
Public
What is dimensionality and what is factor analysis? - Chapter 4

What is dimensionality and what is factor analysis? - Chapter 4

When we measure a physical or psychological trait of an object or person, we only measure one trait of object or person. However, you can investigate multiple questions/items that ultimately lead to a certain dimension/trait. These are called composite scores.

In this chapter, the concept of dimensionality wil be discussed. This is done on the basis of three fundamental questions, and associated relevant information from an exploratory (explanatory) factor analysis (EFA):

  1. How many dimensions does a test have?
    1. One-dimensional
    2. Two (+) dimensional
    3. Relevant information from EFA: eigenvalues, scree plot, factor loads etc.
  2. Are the dimensions correlated?
    1. Yes: type of scale = multidimensional with beaded dimensions
    2. No: type of scale = multidimensional with uncorrelated dimensions
    3. Relevant information from EFA: rotation method, interfactor correlations
  3. What is the psychological service of the dimensions?
    1. Factor analysis
    2. Relevant information from EFA: factor loadings.

What is the dimensionality of a test?

Unidimensional

When a psychological test contains items that reflect a single trait of a person and the reactions are not influenced by other traits of that person, this means that the test is unidimensional. The concept of conceptual homogeneity means that all responses to the items / questions are influenced by one and the same psychological trait.

If a psychological test contains items that reflect more than one trait of a person, the test can be subdivided into dimensions (multidimensional). These dimensions are multidimensional with correlating dimensions or multidimensional without correlating dimensions.

Multi-dimensional with correlating dimensions

A test that is multi-dimensional with correlating dimensions is also called test with higher-order factors. This means that there is one higher (general) factor that merges all subtests. Subtests are groups of questions that identify different psychological characteristics. These subtests correlate with each other to a larger whole.

Subtests are specific factors that are in themselves one-dimensional and the questions within the subtest are conceptually homogeneous.

Full scale score is a combination of subscores to a general trait, this is called the higher-order factor.

  • Unidimensional: a single score of a single psychological characteristic.
  • Multidimensional: added subscores.

Multi-dimensional without correlating dimensions

With this type of test, the sub-tests do not correlate with each other and therefore the sub-scores cannot be added up and combined into a larger whole (higher-order factor).

What is factor analysis?

Factor analysis is the most commonly used statistical procedure to measure and test dimensionality. There are two types of factor analysis: explorative factor analysis (EFA) and confirmatory factor analysis (CFA). EFA is the type that is used most often.

Exploratory factor analysis

Suppose you have a test with six items and you want to know how many dimensions are being measured. To measure this, you take the test with, for example, a hundred people, this data is entered into a computer program and then correlations are calculated. This helps to identify and interpret the number of underlying dimensions. Each set of items that correlate relatively high with each other represents a psychological dimension, also called a factor.

If all items of a test correlate to each other to about the same degree, there is only one set (factor) and then the scale is one-dimensional. If there are two or more sets (factors), the scale is multidimensional.

If the items in set one correlate with the items in set two, we can speak of correlated factors and therefore of a multidimensional test with correlated dimensions. If the items from one set do not correlate with those from the other set, the factors are not correlated and we speak of a multidimensional test without correlated dimensions.

Viewing and reviewing all this data is almost impossible if a test contains many items, so EFA is usually used.

Implementing and interpreting an EFA

Step 1: First choose the statistical technique that you will use. The most commonly used techniques are principal axis factoring (PAF) and principal components analysis (PCA).

Step 2: Identify the number of factors (dimensions). There is no simple rule for this, you must use guidelines and subjective assessment. This often uses eigenvalues ​​(eigenvalues) that you can view in three ways. Based on examples, I provide a clear picture of what the terms mean:

  • Eigenvalues: there is a big difference between eigenvalue two and three. This term says that there are two dimensions within this test. Just like if there were a big difference between eigenvalue four and five, this would mean that there are four dimensions within this test.
     
  • Eigenvalue greater than one rule: the rule applies that the number of eigenvalues ​​that are greater than one determines the number of dimensions. In other words, there are three eigenvalues ​​that have values ​​above one. This means that there are three dimensions within the test.
     
  • Screeplot: this term is a graphical representation of the eigenvalues ​​within the test. You can see from the graph that the line flattens from eigenvalue three. A clear smoothing point suggests that the number of factors is one less than the factor number of the smoothing point. From this you can therefore also conclude that there are two dimensions.

Step 3: if the evidence indicates that a scale is multidimensional then we use factor rotation to see if there is a correlation between the dimensions. There are two types of rotations:

  • Orthogonal rotation: produces dimensions that are not related to each other.
  • Skew (oblique) rotation: produces dimensions that may be interrelated.

The idea of ​​a factor rotation is sometimes viewed with restraint or skepticism. How can it be legitimate to change the results so that it "clarifies the psychological significance of factors", while preferring a simple structure? However, this restraint is not necessary. In short, a factor rotation does not change the relative location of the items. The purpose of factor analysis is to find a perspective that summarizes the items and describes their mutual relationship. For any given set of correlations between items countless perspectives are possible that are legitimate and statistically valid. Factor rotation is a tool to find a perspective that is clear and simple.

Step 4: After the cohesion between dimensions has been established by means of factor rotation, the meaning of the dimensions can be determined. This is done by means of factor loads. Factor loads are a link between items and factors (dimensions). Which test items are most strongly linked to a dimension is the question that is answered. The stronger they are linked, the clearer the meaning of the dimension is. It is of course better if test items are strongly linked to only one dimension and not to multiple dimensions, because this makes the meaning complicated. In addition, there can be a positive or negative charge. A positive charge indicates that people who score high on the item also score high on the underlying factor. A negative charge indicates that people who score high on the item score low on the underlying factor.

Simple structure: if items are strongly linked with only one factor.

When you use an oblique rotation you also have to look at the correlations between the factors.

Confirmative factor analysis

EFA is used in situations where little is known about the dimensionality of a test. CFA is used when there are already clear ideas about the dimensionality of a test. For example if you have a test with fourteen items that is designed so that seven questions belong to one dimension and seven questions belong to a second dimension. Then you can use CFA to test whether this is also true.

Chapter 12 deals with CFA in more detail.

When we measure a physical or psychological trait of an object or person, we only measure one trait of object or person. However, you can investigate multiple questions/items that ultimately lead to a certain dimension/trait. These are called composite scores.

In this chapter, the concept of dimensionality wil be discussed. This is done on the basis of three fundamental questions, and associated relevant information from an exploratory (explanatory) factor analysis (EFA):

  1. How many dimensions does a test have?
    1. One-dimensional
    2. Two (+) dimensional
    3. Relevant information from EFA: eigenvalues, scree plot, factor loads etc.
  2. Are the dimensions correlated?
    1. Yes: type of scale = multidimensional with beaded dimensions
    2. No: type of scale = multidimensional with uncorrelated dimensions
    3. Relevant information from EFA: rotation method, interfactor correlations
  3. What is the psychological service of the dimensions?
    1. Factor analysis
    2. Relevant information from EFA: factor loadings.
Access: 
Public
What is reliability? - Chapter 5

What is reliability? - Chapter 5

What is reliability?

Chapter 5 is about the reliability of a test. Reliability is the extent to which differences in the observed scores of the respondent concerned correspond with differences in his or her true scores. The smaller the difference, the more reliable.

According to the Classic Test Theory (CTT) the reliability can be determined on the basis of observed scores (Xo), true scores (Xt) and random scores (Xe). Random scores are also called measurement errors.

Other factors that cause differences between the observed and the true scores are called sources of error. These cause measurement errors, which create a contradiction between the observed and the true scores.

In addition to "sources of error", there are also temporary or transient factors that can influence the observed scores. Examples of this are the number of hours of sleep, emotional state, physical condition, gambling or misplaced answers. The latter means that if you know the correct answer, you still indicate the wrong answer. These temporary/transient factors decrease or increase the observed scores versus the reliable scores.

To find out whether the observed scores are a function of measurement errors or a function of reliable scores, two questions must be asked:

  1. Which part of the observed scores is a function of reliable inter-individual or intra-individual differences?
  2. Which part of the observed scores is a function of measurement errors?

In other words: Xo = Xt + Xe. You can say that the observed scores are determined by the true scores and the measurement errors. The smaller the value of Xe, the better. It seems that the measurement errors are random (at random), this means that they are independent of the true scores Xt. In other words, a measurement error affects both someone with a high true score and someone with a low true score in the same way and with the same amount. There are two characteristics:

  • The average of all measurement errors within a test is zero.
  • Measurement errors do not correlate with true scores, rte = 0.

Instead of saying that reliability depends on the consistency between differences in observed scores and differences in true scores, you can also say: reliability depends on the relationships between the variability of the observed score, variability of the true score, and variability of the measurement error score.

  • Error score variance: Se² = ∑ (Xe minus average Xe) ² / N. The higher Se², the worse the measurement.
  • True score variance: St² = ∑ (Xt minus average Xt) ² / N
  • Observed score variance: So² = ∑ (Xo minus average Xo) ² / N. Or, So² = St² + Se².

This formula should actually be: So² = St² + Se² + 2rte * St * Se.

However, the true scores and the measurement errors are not correlated and therefore rte * st * se = 0. So there remains: So² = St² + Se².

What are the four types of reliability?

1. Reliability in terms of "proportions of variances"

Rxx (reliability coefficient) = St² / So²

Rxx = 0 means that everyone has the same true score. (St² = 0)

Rxx = 1 means that the variance of the true scores is equal to the variance of the observed scores. In other words: there are no measurement errors!

Here is an example of interpretation of Rxx:

Rxx = 0.48 or 48% of the differences in the observed scores can be attributed to the true scores. On the other hand, 1-0.48 = 0.52, so 52% of the differences can be attributed to measurement errors.

2. Reliability in terms of "lack of measurement error"

Rxx (reliability coefficient) = St² / So²

So² = St² + Se² (and therefore also: St² = So² - Se²)

Rxx = (So² - Se²) / So² = (So² / So²) - (Se² / So²)

In other words: Rxx = 1 - (Se² / So²): when (Se² / So²) is small, the reliability is high.

3. Reliability in terms of "correlations"

Rxx = Rot², where Rot² is the squared correlation between the observed scores and the true scores.

Rot = St² / (So * St) = Rot = St / So

Rot² = St² / So².

A reliability of 1.0 indicates that the differences between the observed test scores perfectly match the differences between the true scores. A reliability of 0.0 indicates that the differences between the observed scores and the true scores are totally contradictory.

4. Reliability in terms of "lack of correlation"

Rxx = 1 - Roe², where Roe² is the squared correlation between the observed scores and the error scores.

Roe = Se² / (So * Se) = Se / So

Roe² = Se² / So² so:

Rxx = 1 - Roe² = 1 - (Se² / So²).

If Roe = 0, then Rxx = 1.0

The greater the correlation between the observed scores and the error scores, the smaller Rxx. So reliability will be relatively high if the observed scores have a low correlation with the error scores.

How is the size of the measurement error expressed?

Although reliability is an important psychometric construct, it does not give a direct reflection of the magnitude of the measurement error of a test. Additional coefficients are therefore needed at this point. The standard measurement error displays the average size of the error scores. The greater the standard measurement error, the greater the average difference between observed scores and true scores, and therefore the lesser the reliability of the test.

Standard measurement error = sem

sem = So * √ (1 - Rxx)

If Rxx = 1 then Sem = 0, so: Rxx greater means sem smaller.
sem is never greater than So, so: greater means sem means greater
How is the theory of reliability translated into practice?

The theory of reliability is based on three terms: true scores, observed scores, and error scores. But in practice we do not know whether a score is actually the true score of an individual. We also do not know to what extent measurement errors influence the response of an individual. How then do we translate the theory of reliability into practice?

Although we cannot determine with certainty what the reliability or standard measurement error of a test is, advanced methods have been developed to estimate it. Examples of such techniques are giving two versions of the test, doing the same test twice and so on. In this section, four methods are discussed to estimate the reliability and standard measurement error of a test:

  1. Parallel testing;
  2. the tau equivalent test model;
  3. essentially tau equivalent test model;
  4. congeneric test model.

Each model offers a perspective on how two or more tests are the same.

1. Parallel tests

We speak of parallel tests when two (or more) tests, in addition to the basic assumptions of classical test theory, meet the following three assumptions:

  1. The two tests have the same error variance (se12 = se22).
  2. The intercept between the true scores on both tests is 0 (so a = 0, in Xt2 = a + b (Xt1)).
  3. The slope between the true scores on both tests in 1 (so b = 1, in Xt2 = a + b (Xt1)).

These assumptions have six implications:

  1. This implies that the true scores of test 1 are identical to the true scores of test 2 (Xt1 = Xt2).
  2. Derived from this it means that in case the true score of each participant on test 1 is equal to the true score on test 2, the two sets of true scores correlate perfectly with each other (rt1t2 = 1).
  3. The variances of the value scores of tests 1 and 2 are identical (st12 = st22).
  4. The average of the true scores of test 1 is equal to the average of the true scores of test 2.
  5. The variance of the observed scores of test 1 is equal to the variance of the observed scores of test 2. And finally, sixth, the reliability of the tests is the same (R11 = R22).
  6. When the scores of two tests meet all these assumptions and implications, we speak of parallel tests.

Finally, according to the KTT, there is one further implication that follows from the above: the correlation between parallel tests equals reliability. In formula form: r0102 = R11 = R22. In other words, when two tests are actually (perfectly) parallel, the correlation between the two tests is therefore equal to the reliability of both tests.

The correlation between parallel tests can also be calculated based on the variances of the true and observed scores: r0102 = st2 / so2.

2. The tau-equivalent test model

In addition to the standard assumptions of classical test theory, the tau-equivalent test model is based on the following two assumptions:

  1. The intercept between the true scores of both tests is 0 (so a = 0, in Xt2 = a + b (Xt1)).
  2. The slope between the true scores on both tests in 1 (so b = 1, in Xt2 = a + b (Xt1)).

These two assumptions are the same as the assumptions of parallel tests. The difference lies in the first additional assumption: the tau-equivalent test model does not state the assumption of appropriate error variances. This leads to four implications (the first four that we discussed in parallel testing).

The less strict assumptions mean that the correlation between tau-equivalent tests is not a valid estimate of the reliability. This is in contrast to parallel tests, where the correlation between the tests is therefore a valid estimate of the reliability.

3. The essentially tau-equivalent test model

In addition to the standard assumptions of classical test theory, the essentially tau-equivalent test model is based on one additional assumption:

The slope between the true scores on both tests in 1 (so b = 1, in Xt2 = a + b (Xt1)).
This leads to two implications (the first two discussed in parallel tests), or: rt1t2 = 1 (the correlation between the tests is perfect), and st12 = st22 (the variances of the true scores of both tests are equal).

4. Congeneric test model

The last model is the congeneric test model. According to this model, only the assumptions of classical test theory are accepted. This results in a single implication, namely that the correlation of the true scores between the tests is equal: rt1t2 = 1. This model is therefore the most strict and the most general model. Although this model is more often applicable (this model is conditional for more districted models), it offers limited possibilities for estimating reliability.

What is the 'Domain Sampling Theory'?

According to this theory, reliability is the average size of the correlations between all possible pairs of tests with N items selected from an area ("domain") of test items. The logic of this theory is the foundation of the generalizability theory, this will be discussed extensively in chapter thirteen.

Chapter 5 is about the reliability of a test. Reliability is the extent to which differences in the observed scores of the respondent concerned correspond with differences in his or her true scores. The smaller the difference, the more reliable.

Access: 
Public
How to empirically estimate the reliability? - Chapter 6

How to empirically estimate the reliability? - Chapter 6

Test scores can be used to estimate reliability scores and to estimate the measurement error. In this chapter three methods are discussed for estimating reliability: (1) alternate forms of reliability (also known as a parallel test); (2) test-retest reliability; (3) internal consistency. This chapter also looks at the reliability of the difference scores used for other cognitive growth, symptom reduction, personality change, etc.

What does the alternate forms method include?

The first method is a parallel test to estimate reliability. In the parallel test there are two tests: the test that people want to perform where scores come out and a second test where scores also come out. With these two scores, the correlation between the test scores and the scores of the parallel test can be calculated. The correlation can then be interpreted as an estimate of reliability. The two tests are parallel if both tests measure the same set of true scores and if they both have the same error variance. The correlation between the two parallel tests is equal to the reliability of the test scores. A practical problem with using a parallel test is that we never know for sure that the parallel test can meet the assumptions of classical test theory. We can never be sure that the true scores of the first form will be the same as the true scores of the parallel form. Different test forms have a different content, which can cause problems with the parallel test. If the parallel test is not entered correctly with the first test, then the correlation is not a good estimate of the reliability.

Another possible problem with the parallel test is transmission of contamination through repeated testing. The test subjects may already be influenced by the best test and the condition of the test subjects may be different during the parallel test. As a result, the performance of the test subjects can be different and the test is less reliable. With classical test theory, the error with every test is accidental. If the test subjects are affected, the error scores of the tests have a correlation with each other, whereas this is not possible according to the traditional test theory. This means that the two tests are not completely parallel to each other.

Two assumptions for a parallel test are that the true scores are the same and that the error variance in both tests is the same. The mean of the observed scores of both tests must also be the same and the tests must have the same standard deviations. If all of this is correct and we really have the idea that the two tests measure the same construct, then we can use this as an estimate of reliability. This estimator of reliability is called the reliability of varying forms.

How is the test-retest reliability calculated?

This method is useful for measuring stable psychological concepts such as intelligence and extraversion. The same people can have the same test performed several times. If the assumptions are correct, the correlation can be calculated between the first scores and the repeated scores. This correlation is then the estimator of the test-retest reliability. The applicability of the test retest test depends on a number of assumptions to ensure good reliability. Just as with the parallel test, the true scores for both tests must be the same. The error variance of the first test must also be the same as the error variance of the second test. If these assumptions are met, we can say that the correlation between the scores of the two test samples is an estimate of the reliability of the score.

The assumption that the true scores are the same for both tests cannot always be pursued. First, some concepts are less stable than others. Humor tests, for example, are less stable than a test of a trait. In a test about feeling, one can feel very happy in the first test while feeling more depressed a little later, in the second test. This gives different true scores and makes the test less reliable. The length of the intermediate periods (intervals) can count as a second factor in the stability of the tests. Larger intervals may have larger psychological changes and the true scores may therefore change. Short interim periods can cause transmission or contamination effects. Many test-retest analyzes have an interval of 2 to 8 weeks. A third factor is the period in which the tests are conducted. One can just go through a period of development (especially with children) between the two tests and even then the true scores are no longer the same.

If the true scores remain the same over the two tests, the correlation between the two tests indicates the extent to which the measurement error influences the test scores. The lower the correlation, the more influence the measurement errors have had and the more unreliable the tests are. A difficulty with the test-retest method is that one is never sure whether the true scores have remained the same in both tests. If the true scores change, the correlation not only reflects the influence of the measurement errors, but also the degree of change of the true scores. This cannot be calculated with simple formulas. It could therefore be that a test-retest correlation is low due to differences in the true scores, while the reliability is high in the test. The parallel method and the test-retest method can be theoretically useful, but in practice they are often difficult. They can be very expensive and time-consuming. That is why these methods are not applied quickly.

How is internal consistency used to estimate reliability?

Internal consistency is a good alternative to the parallel test and the test retest method. The advantage of internal consistency is that you only take one test at a time. A composite score is a score calculated from multiple items and is the total score of the test responses. Internal consistency can therefore be used in tests that have more than one item. The idea with internal consistency is that parts (items or groups of items) of a test can be treated as different forms of a test. Internal consistency is used in many areas of behavioral science. Two factors influence the reliability of the test scores. The first is whether the parts from the test are equal to each other. If these parts correlate strongly with each other, the test is reliable. The length of the test is the second factor that counts. A long test is more reliable than a short test. There are three different ways to investigate internal consistency: the split-half method, the "raw alpha" method and the "standardized alpha" method.

Estimates of split half reliability

The split-half reliability is obtained when the test is split in two and the correlation between the two parts is calculated. In this case, two small parallel tests were actually made. The process to use the split-half method is in three steps. The first step is to divide the scores into two. The second step is to calculate the correlation between the two parts. This split-half correlation (rhh) indicates the degree to which the two parts are equal to each other. The third step is to put the correlation in a formula to calculate the reliability (Rxx) estimate. This is done with the Spearman-Brown formula:

Rxx = 2 * rhh / 1 + rhh

With this correlation, the formula must be used because it concerns half a test and not a whole test like the other methods. Because it is a correlation within a test, this correlation is called the estimator of the reliability of the internal consistency. The two halves in the test must have the same true scores and the same error variance. The averages and standard deviations must also be the same. If the two halves do not meet these criteria, the reliability of the test is less. One can then make a different split of the items, but because the halves are not parallel, a different correlation can result. For this reason, the split-half reliability is not often used.

Measuring reliability through internal consistency has an additional problem with regard to power testing and speed testing. In the power tests, the  subjects have plenty of time to answer and the questions differ in difficulty. In the speed tests, the test subjects have a certain time in which as many questions as possible must be answered and the questions are the same in difficulty. If you use the split-half method in a speed test, it will show the reliability of a person's reaction speed. Since all questions are of the same level of difficulty, the subject will have spent approximately the same amount of time on each question. Because of this the reliability is almost always around 1.0 and that is why split-half is almost never used for speed tests.

Cronbach’s Alpha ("raw" coefficient alpha)

When one regards each item as a subtest, one comes a lot further with internal consistency. Calculating the internal consistency at item level is based on two steps. All statistics are calculated in the first step. In the second step, the statistics are applied in calculations to estimate the reliability of the entire test.

The most used method is the Cronbach’s alpha, also known as raw coefficient alpha. For this we first calculate the variance of the scores over the entire test (sx²). The covariance between each pair of items is then calculated. If the covariance of a few items is 0, then it is possible that not every item measures the same construct or that the measurement error has a major influence on that item. This means that the test has some problems. After all covarities have been calculated, they are added together. The larger this number is, the more the items match each other. The next step is to estimate reliability with the following formula:

α = estimated Rxx = (k/k-1) * (∑c / sx²)

K is the number of items in the test.

There are different formulas for calculating Cronbach’s Alpha. Another formula is: α = estimated Rxx = (k/k-1) * (1- (∑si²/ sx²))

Standardized coefficient alpha

Another method is to use the general Spearman-Brown formula, also known as the standardized alpha estimate. This method gives a roughly the same outcome as the regular Cronbach’s Alpha, it is popular with computer programs such as SPSS and this method gives the clearest picture of reliability. If a test uses standardized scores or z-scores, the standardized alpha gives a better estimate of the reliability. The standardized Alpha is based on correlations. As a first step, we calculate the correlation between each pair of items, just like with the raw Alpha. These correlations reflect the extent to which the differences between the responses of the participants match. We then calculate the average of all correlations (rií) that were obtained. The next step is to introduce this correlation in this more general form of the Spearman-Brown formula:

Rxx = k*r / 1 + (k-1)* r

Raw Alpha for binary items: KR20

Many psychological tests have binary items (you can choose from two answers). For these tests a special formula can be used to estimate the reliability, namely the Kuder-Richardson 20 formula. This is based on two steps. First all statistics are collected.

These are the proportion of correctly answered questions (p) and the proportion of incorrectly answered questions (q). Then the variance of each item is calculated with si² = pq and the variance of all test scores (sx²). The second step is to process these statistics in the Kuder and Richardson formula (KR20):

Rxx = (k / k-1) * (1- (∑pq / sx²))

Omega

The omega coefficient applies to the same types of scales as the alpha (usually continuous items that are combined into a total score on the test). The omega is based on the idea that reliability can be defined as the ratio of signal and noise. In other words: reliability = signal / (signal + noise). We omit a more detailed discussion of the omega, since that goes beyond the purpose of this book.

Assumptions for the alpha and omega

The accuracy of the reliability estimates described above depends on the validity of certain assumptions. In summary, the alpha method only has accurate reliability estimates when the items are essentially tau-equivalent or parallel (see Chapter 5 for a discussion of these models). The omega is more broadly applicable; the omega also provides accurate reliability estimates for congeneric tests.

Theory and reality of accuracy and the use of internal consistency estimators

Many researchers do not look at the assumptions that must be made when calculating the Alpha. Alpha is the method that is usually chosen to calculate reliability. This is because it is easy to calculate and the test subjects are not needed more than once. Not much attention is paid to the assumptions, because the assumptions are less accurate (therefore more quickly satisfied) with the Alpha. If the items are approximately equal to each other, the estimate is reliable. Here the error variations do not have to be the same. If the items are approximately equal to each other, the estimates of the KR20 and of the alpha coefficient are reliable. If the items are not the same, the KR20 and the Alpha will underestimate reliability. The reliability can also be overestimated. This is because only one test is used in the calculation of Alpha and therefore the error variance may be underestimated. In general, the Cronbach’s Alpha is used the most, because it is easiest in terms of assumptions, and gives a good reliable score.

Internal consistency and dimensionality

The internal consistency of items is separated from the conceptual homogeneity (items are one-dimensional) of the items. The reliability of a test can be high, even if the test measures multiple properties (conceptual heterogeneity / multidimensional). So with the reliability of internal consistency, it is not useful to look at the conceptual homogeneity or the dimensions (multiple properties) of the test.

Which factors can influence the reliability of test scores?

There are two factors that contribute to the reliability of internal consistency. The first factor is the equality between the parts of the test. This has a direct effect on the reliability estimate. If the correlation is positive, the parts are consistent with each other. This does depend on the size of the correlation. Items can be removed from the test or rewritten if the items are not good for the correlation. This may result in a higher correlation. This means that there is a higher internal consistency and therefore also a higher reliability.

The second factor that can affect reliability is the length of the test. Long tests are more reliable than short tests. With longer tests, the variance of the true score rises faster than the error variance. Reliability can also be calculated with this formula:

Rxx = st² / (st² + se²)

Here st² is the variance of the true score and se² the error variance and st² + se² = so² (the observed score). If we double the length of the test we get the following formula to calculate the true score variance:

st²-double = 4* st²-1 part

From this formula we can conclude that when we double the length of the test, the variance of the true score becomes four times as much. The error variance is given a different formula when extending the test:

se²-double = 2* se²-1 part

Here we can see that when the test doubles, the error variance also doubles. After calculating these figures we can enter them in a formula to estimate the reliability:

Rxx-double = 4 (st²-1 part) / (4 (st²-1 part) + 2 (se²-1 part))

This formula can be converted to the following formula:

Rxx double = 2Rxx original / 1 + Rxx original
The general formula for a test that is extended or shortened is a Spearman-Brown formula (prediction formula):

Rxx extended or shortened = n * Rxx original / 1 + (n-1) Rxx original or

Rxx = k * ŕií / 1 + (k-1) * ŕií

N is the factor used to extend or shorten the test. Rxx original is the reliability estimate of the original version of the test. In the second formula, K is the number of items in the new version of the test. rii is the average inter-item correlation.

The average inter-item correlation can be calculated if we know the standardized Alpha and the number of items:

ri1 = Rxx / k- (k-1) Rxx

It is therefore useful for reliability to extend a test, but on the other hand the new items that are added must be exactly parallel to the items that are already in the test. With longer tests it is less useful to add more items than with tests that are less long.

Heterogeneity and general reliability

Another factor that influences reliability is heterogeneity. The greater the variability (heterogeneity) between the test subjects (and their true scores), the greater the reliability coefficient. If one examines a trait where there is a lot of heterogeneity, then the reliability is higher than for a trait with a trait with little heterogeneity. This has two important implications. If it is first emphasized that reliability is a characteristic of the test scores and not of the test itself. The following implication is that examples of heterogeneity can be used in reliability generalization studies. These studies look at the extent to which the reliability estimates from other studies with the same test are similar and how the reliability estimates have been used. These studies can be used to identify and understand how the characteristics of a sample influence the reliability of test scores.

How is the reliability of difference scores determined?

There are also studies that look at how much a group of test subjects changes compared to another group of test subjects. This also has to do with variability. People want to know how much variation there is in the change of all test subjects. One method to see how much a subject has changed in trait is to take the test twice and then subtract the first score from the last score. This is used to calculate the difference score (Di = Xi - Yi). A positive score is an improvement, a negative score is a reduction and a score of 0 means that no change has taken place.

There are different types of difference scores. A difference score can be calculated within a person (intra-individual score), the same test is taken twice. Another type of difference score is intra-individual discrepancy score, where two measurements are also taken with the same person but a different test is used the second time. In addition, a difference score between two people can be calculated in which two different people take the same test and the score of one person is subtracted from the score of the other person.

Estimate the reliability of the difference scores

The estimation of the reliability of the difference scores requires three things: The reliability of both tests used to calculate the difference scores (Rxx and Ryy). The variability of the observed scores of the test (Sxo2, Syo2, Sxo, Syo). And the correlation between the observed test scores (rxoyo).

The formula for the reliability of the difference scores is:

Rd = Sxo2 * Rxx + Syo2 * Ryy - 2rxoyo * Sxo * Syo / Sxo2 + Syo2 - 2rxoyo * Sxo Syo.

Factors that influence the reliability of the difference scores

There are two factors that determine whether a set of difference scores will have good reliability. The first is the correlation between the observed scores of the tests. As the correlation between the tests becomes higher, the reliability of the difference scores decreases. The second factor is the reliability of the two tests used to calculate the difference scores. If the tests have a high reliability, the difference scores will generally also have a high reliability.

The reliability of the difference scores cannot be higher than the average reliability of the two individual test scores. But the reliability of the difference scores can be much smaller than the reliability of the two individual test scores.

Unequal variability

In some cases, difference scores are not a clear reflection of the psychological reality. The difference scores then reflect only one of the two variables. This can happen if the two tests have unequal variability, which may be due to the fact that the tests use different measuring scales. The scores must first be standardized to be able to calculate the difference scores. This means that the variables have an average of 0 and the standard deviation is 1. Only then can the test subjects be accurately compared. A difference score can then be calculated from this. However, the difference score does not have to mean anything, even though the metric scales are the same. It only makes sense to calculate a difference score if the test scores have a psychological characteristic in common.

Especially when analyzing discrepancy scores, it is important to first standardize the tests before calculating the difference scores.

Test scores can be used to estimate reliability scores and to estimate the measurement error. In this chapter three methods are discussed for estimating reliability: (1) alternate forms of reliability (also known as a parallel test); (2) test-retest reliability; (3) internal consistency. This chapter also looks at the reliability of the difference scores used for other cognitive growth, symptom reduction, personality change, etc.

Access: 
Public
What is the importance of reliability? - Chapter 7

What is the importance of reliability? - Chapter 7

This chapter explains how reliability and measurement errors affect the results of behavioral research. Awareness of these effects is crucial for behavioral research.

Which two sources of information can help evaluate an individual test score?

There are two important sources of information that can help us evaluate an individual test score. The first is a point estimator. This is a value that is interpreted as the best estimate of someone's score on a psychological trait. The second is a confidence interval, which gives an area with values ​​in which the true score of a person lies. If the true score has a large confidence interval, we know that the observed score is a poor point estimator of the true score.

Point estimates

Two types of point estimates can be taken from an individual observed score. The first point estimator is based on the observed test score only. When a test subject takes the test at a certain moment, an observed score is obtained. This is then an estimate of the true score. The second point estimator also takes the measurement error into account. By estimating with the score of the first test what the test subject will score in the second test, an adjusted true score can be estimated based on this estimate. When a test subject takes the test for a second time, the second time the score will be closer to the group average. This is called regression to the mean. This prediction is based on the logic of the classical test theory and the random measurement error. An estimate of the adjusted true score shows the difference between someone's observed score on the first test and the observed score on the second test. The magnitude and direction of the difference depends on three factors:

  1. The reliability of the test scores.
  2. The magnitude of the difference between the original observed test score and the average of the test scores.
  3. The direction of the difference between the original score and the average of the test scores. The following formula is used to estimate the adjusted true score.

Xest = Xavg + Rxx(Xo – Xavg)

Xest is the estimate of the adjusted true score, Xavg is the average of the test score, Rxx is the reliability of the test and Xo is the observed score. The reliability of the test influences the difference between the estimated true score and the observed score. With a smaller reliability, the difference between the estimated true score and the observed score becomes larger. The observed score itself also influences the difference between the estimated true score and the observed score. The difference will be greater with more extreme observed scores.

One reason for not calculating the estimated true score is that an observed score is already a good estimator of the psychological characteristic and there can be little reason to correct it. A second reason is that the estimated value does not always lead to a regression to the average.

Confidence intervals

Confidence intervals represent the accuracy of the point estimator of an individual true score. The accuracy of the confidence interval and the reliability have a link due to the standard measurement error (sem).

sem = s0 √ (1 - Rxx)

The greater the standard measurement error, the greater the average difference between observed scores and true scores. For the calculation of a 95% confidence interval around the standard measurement error, the following formula applies:

95% confidence interval = Xest ± (1.96) (sem)

Xest is the adjusted true score (that is a point estimate of the true score of an individual), sem is the standard measurement error of the test scores and 1.96 (the z-score) indicates that we calculate a 95% confidence interval. The interpretation of a confidence interval is that we can say with 95% certainty that the true score can be found somewhere in the confidence interval. Tests with a high reliability will require a smaller confidence interval than tests with a lower reliability. Reliability affects the confidence, accuracy and precision with which a person's true score is estimated.

Confidence intervals can be calculated in different ways and with different sizes (95%, 90%, etc.)

The intervals can be calculated with the standard measurement error or the standard estimation error (which is also influenced by reliability). The estimates of the true scores, and the confidence intervals that go with them, are important in making decisions. And reliability plays a major role in this.

On which two factors does the correlation of observed scores from two measurements depend?

According to the classical test theory, the correlation of the observed scores of two measurements (rxoyo) depends on two factors: the correlation between the true scores of the two psychological constructs (rxtryt) and the reliability of the two measurements (Rxx and Ryy).

rxoyo = rxtyt * √ (Rxx * Ryy)

The correlation between two sets of observed scores is:

rxoyo = cxtyt / sxosyo

We can calculate the observed standard deviation with the reliability and the standard deviation of the true scores. See below:

sxo = sxt / √Rxx and syo = syt / √Ryy

The classical test theory shows that the correlation between two measurements is determined by the correlation between psychological constructs and the reliability of the measurements.

The measurement error suppresses the correlation between measurements

There is a difference between the correlation of the observed scores and the correlation of the true scores. This has four important consequences:

The intervals can be calculated using the standard measurement error or the standard estimation error (which is also influenced by reliability). The estimates of the true scores, and the confidence intervals that go with them, are important in making decisions. And reliability plays a major role in this. On which two factors does the correlation of observed scores from two measurements depend? According to the classical test theory, the correlation of the observed scores of two measurements (rxoyo) depends on two factors: the correlation between the true scores of the two psychological constructs (rxtryt) and the reliability of the two measurements (Rxx and Ryy) . rxoyo = rxtyt * √ (Rxx * Ryy) The correlation between two sets of observed scores is: rxoyo = cxtyt / sxosyo We can calculate the observed standard deviation with the reliability and the standard deviation of the true scores. See below: sxo = sxt / √Rxx and syo = syt / √Ryy The classical test theory shows that the correlation between two measurements is determined by the correlation between psychological constructs and the reliability of the measurements. The measurement error suppresses the correlation between measurements There is a difference between the correlation of the observed scores and the correlation of the true scores. This has four important consequences:

  1. The observed correlations (between measurements) will always be weaker than the correlations of the true scores (between psychological constructs). This is because measurements will never be perfect and imperfect measurements make the observed correlations weaker.
  2. The degree of weakening depends on the reliability of the measurements. Even if only one of the tests has low reliability, the correlation of the observed scores becomes a lot weaker compared to the correlation of the true scores.
  3. Error limits the maximum correlation that can be found. As a result, the observed correlation of two measurements can be lower than expected.
  4. It is possible to estimate the true correlation between two constructs. Researchers can estimate all parts of the formula except for the correlation of the true scores. When converting the formula, it results in the following :

    rxtyt = rxoyo / √Rxx * Ry

    This formula is called the correction for attenuation, because researchers can see what the correlation would be if it were not influenced by weakening. The estimated correlation has a perfect reliability and with a perfect reliability the observed correlation is equal to the true correlation.

In addition to reliability, what else should you pay attention to when looking at the results of a study?

Because the measurement error reduces the observed correlation, this has disadvantages for interpreting and leading the research. Results must always be interpreted with the help of reliability. An important result of a study is the effect size. Some effect sizes show the extent to which the variables are interrelated and others show the magnitude of the differences between groups.

An example of an effect size that shows the extent to which two variables are interrelated is the correlation coefficient. High reliability results in larger observed effect sizes and lower reliability reduces the observed effect sizes. There are three common effect sizes that are used in studies: correlations, Cohen's d and N2. These effect sizes are each used in different analytical situations. 1. Correlation is usually used to represent the relationship between two continuous variables. 2. Cohen's d is usually used when looking at the relationship between a dichotomous variable and a continuous variable. 3. N2 is usually used when looking at the relationship between a categorical variable with more than two levels, and a continuous variable.

A second important result of a study is statistical significance. Statistical significance provides certainty of a result. If a result is statistically significant then it is seen as a real find and not just a fluke. With statistical significance, a clear difference is demonstrated. The observed effect has a major influence on statistical significance. If the effect size becomes larger, the test is rather statistically significant.

The effect of reliability on effect size and statistical significance is very important when looking at the results of a study.

Including reliability when drawing psychological conclusions from an investigation has three important implications. The first is that researchers should always include the effects of reliability on the results obtained when they interpret effect sizes and statistical significance. The second is that researchers must use measurements that have high reliability. That way the problem of weakening can be kept to a minimum. Yet there are two reasons why a researcher sometimes uses measurements with low reliability. The first reason is that the interest may lie in an area where it is very difficult to obtain high reliability. A second reason may be that researchers work with a low reliability, because a measurement method with a higher reliability has not been searched for long enough. It can take a lot of time, money and effort to find a good method with high reliability. Researchers make the assessment of the effort they want to put in and the reliability that they want to achieve. The third implication of including reliability is that researchers should report reliability estimates of their measurements. This is necessary because the readers must be able to interpret the results.

What should you pay attention to in the construction and improvements of tests?

With test construction and improvement, attention is drawn to the consistency of the test parts and then the items are mainly looked at. Test developers test the items from a test and see which items can be removed or which have to be strengthened to improve the quality of psychometric tests.

To see if an item contributes to internal consistency, the item average is looked at, the item variance and the item discrimination. It is important to know that the procedures and concepts described below must be performed for each dimension measured by the test. So in a one-dimensional test, the following analyzes would be performed on all test items together as one group. And with a multidimensional test, the following analyzes would be performed separately for each dimension.

Item discrimination and other information concerning internal consistency

An important factor for the reliability of internal consistency is the extent to which the test items are consistent with each other. The internal consistency has an intrinsic link with the correlations between the items. With a low correlation, one item has little consistency with the other items and the internal consistency decreases.

To calculate the correlation between items, SPSS can be used to look at the "inter-item correlation matrix", but because many tests consist of many items, this is not the most convenient method. Item discrimination is the extent to which an item distinguishes between people who score high on a test and people who score low on a test. High discrimination values ​​are required for good reliability. There are several ways to calculate an item discrimination. One is the item-total correlation. We can calculate a total score and then calculate the correlation between an item and the total score. This item-total correlation shows how large the difference in responses is at the item compared to how large the difference in responses is in total. A high item-total correlation indicates that the item is consistent with the test as a whole.

With SPSS it is indicated as corrected item-total correlations and then the correlation with the total score is calculated for each item. It is "corrected" because the item itself does not count towards the total score. Another way of item discrimination for binary items is the item discrimination index (D). This compares the proportion (p) of people who scored high on the test and answered the item correctly, with the proportion (p) of people who scored low on the test and answered the item correctly. The proportion of well-answered questions in the group is calculated for those two groups. The difference between the two groups can then be calculated by subtracting the proportion of the lowest group from the proportion of the highest group.

D = phigh - plow

Items with high D scores are better for internal consistency. SPSS has two other ways to look at the internal consistency of a test, namely the squared multiple correlation and the Cronbach's Alpha if item deleted. The latter gives the correlation of the total test if one item is removed from the list.

Item variance and Item difficulty (mean)

The item's mean and variance are important factors that can influence the quality of a psychometric test. They can contribute to how consistent an item is with the rest of the items. This is important for the reliability of the test. A variable needs variability to be able to correlate with another variable.

If all test subjects answer the same, there is no variability. Variability is required with good reliability.

A link between the item variability and the psychometric quality can be made by the item average. The item average can say something about the item variability. An item with limited variability makes little contribution to psychometric quality. The averages can also be seen as a difficulty. If more people have a good answer to one question than another, the difficulty is different. For example, if the average is 0.70, it means that 70% of people have answered the item correctly. The classical test theory suggests that binary test items must have an average of 0.50 so that all items have maximum variability.

This chapter explains how reliability and measurement errors affect the results of behavioral research. Awareness of these effects is crucial for behavioral research.

Access: 
Public
What is validity? - Chapter 8

What is validity? - Chapter 8

What is validity?

For more than 60 years the following basic definition of validity was assumed: Validity is the extent to which a test measures what it is intended to measure. Although this definition has been used very much and still is, the concept of validity is presented a little too simple with this definition. A better definition would be that validity is the extent to which the interpretation of test scores for a specific purpose is supported by evidence and theory.   

This more advanced definition leads to three important implications:

  1. Validity refers to the interpretation of test scores regarding a specific psychological construct, it is not about the test itself. This means that a measurement is not valid or invalid, but that validity relates to the interpretation and use of measurements.
  2. Validity is a matter of degree, it is not an "all or nothing" construct.
  3. Validity is entirely based on empirical evidence and theory.   

Why is validity important?

Validity is perhaps the most important point regarding the psychometric quality of a test. This point is further substantiated in this section based on the role and importance of validity in psychological research and practice.  

Validity is needed to be able to perform psychological research. The goals of scientific research (describing, predicting, explaining a determined aspect in the world) depend to a large extent on our ability to manipulate and measure specific variables. 

Suppose we are interested in the relationship between violent games and aggressive behavior. The hypothesis is that children show more aggressive behavior when they play a lot of violent games. In order to test this hypothesis, it is important to measure the variable "aggressive behavior". Only then can we state with a higher (reasonable) degree of certainty that violent games are indeed associated with increased aggressiveness. If we do not measure aggressive behavior, this conclusion must be seriously questioned. So without test validity we cannot make reliable statements about the role of video games on aggressiveness. 

More generally, we can state that without test validity, test-based decisions about individuals can be wrong and sometimes even harmful.  

What is the current perspective on validity and which facets play a role in this? 

For years there was a traditional perspective on validity, in which there are three types of validity were identified:

  1. Content validity.
  2. Criterion validity.
  3. Construct validity.

Nowadays construct validity is seen as the essential concept in validity. Construct validity is the extent to which a test score can be interpreted as a reflection of a certain psychological construct.     

Three major organizations (AERA, APA, and NCME) published a revision of Standards for Educational and Psychological Testing in 2014, emphasizing five facets of construct validity. Construct validity is determined by five types of information:

  1. Content;
  2. internal structure;
  3. response process;
  4. associations;
  5. consequences.    

1. Content

First, the (construct) validity is determined by the extent to which the actual content of a test is related to what should be included in the test. In other words, when a test aims to measure intelligence, the items of the test must reflect the important facets of intelligence. This is also referred to as content validity . Content validity is usually evaluated by experts who are familiar with the relevant construct. There are two threats to content validity: (1) constuct-irrelevant items; (2) construct underrepresentation . A test must not contain any content (items, questions) that are not relevant to the construct that the test is intended to measure. Construct irrelevant content is outside the main construct that the test is intended to measure, and it lowers validity. Second, a test must contain content that covers the full range of the construct. Again, if this is not the case, then the validity is reduced. In short, it means that a test must cover the full range of the construct, no more and no less.        

A closely follow this up concept is impressed validity (face validity). Impression validity is the extent to which a measurement appears to be related to a specific construct according to nonexperts (such as respondents or legal representative).     

2. Internal structure

A second important point for the validity of test interpretations is the internal structure or dimensionality of a test. The internal structure or dimensionality of a test is the way in which parts of the test are related to each other. Here too, an important component is: the extent to which the intended structure overlaps with the structure that it should have. The internal structure of a test vaa assessed by factor analysis (discussed earlier in Chapter 4 and is discussed further in Chapter 12). Some items from a test correlate more strongly with certain items than others. A set of items that strongly correlates with each other and forms a cluster of items is also called a dimension or factor. Factor analysis is used to identify the presence and nature of those factors. In addition, factor analysis is also used to map the associations between factors within a multidimensional test. Finally, factor analysis is used to identify which items are linked to which factors.        

3. Response process

A third type of proof of validity is the correspondence between the psychological processes that respondents actually use when they complete a measurement and the processes they should use. Many psychological measurements are based on assumptions about the psychological processes that people use when they complete a certain measurement. For example, for an item such as "I often go to parties," the researcher logically assumes that respondents try to remember how often to go to parties and then assess whether that number can be seen as "a lot." Various procedures have been developed to assess the response process validity. Some of these procedures use direct evidence (for example, by simply interviewing the respondents and asking them what considerations they had when completing the measurement), others using indirect evidence (for example, by eye tracking) .      

4. Associations

A fourth type of proof of validity is the extent to which test scores are associated with other variables. This is about the degree of similarity between the actual associations with other measurements and the associations that the test should have with other measurements.    

There are different types of evidence. Convergent evidence refers to the extent to which test scores correlate with other measurements from relevant constructs (ie, the extent to which test scores correlate with measurements with which they should correlate). Discriminant evidence is the extent to which test scores are uncorrelated with testing of uncorrelated constucts (ie, the extent to which test scores do not correlate with measurements with which they should not correlate). In addition, a distinction can be made between concurrent and predictive validity. Competitive validity refers to the extent to which test scores correlate with other relevant variables that were measured at the same time. Predictive validity refers to the extent to which test scores correlate with other relevant measurements that were measured at a later time.       

5. Consequences

The fifth and final source of information for validity is formed by the consequences of testing. Are the test scores equally valid for men and women? Are the consequences of the test the same for different groups? Is there beooge of unintended differential effects of the test on certain groups?

What other perspectives on validity are there?

Not everyone agrees with the validity perspective that we discussed in this chapter. Many other perspectives on validity are possible, but a detailed description of these perspectives goes beyond the purpose of this book. Three other perspectives for assessing validity are, however, briefly explained. 

  1. Criterion validity places less emphasis on the conceptual meaning or interpretations of the test scores. Test users sometimes just want to get groups separated by means of a test and then they do not find it important which construct is behind it. Criterion validity is the extent to which the test scores can predict the criterion variables. This also includes competitive validity and predictive validity. The psychological significance of test scores is relatively unimportant, because the only important thing is to separate the groups. It is nowadays considered that criterion validity has become part of construct validity. 
  2. Another alternative is to learn what the test scores actually mean instead of testing certain theoretical hypotheses about the test scores. Researchers can also look at the real meaning of the test scores and draw up an evaluation. This is called the inductive approach to validity . The purpose of the inductive approach is to find out the full meaning of the test scores. This means that the construct can also be changed later. In applied research, for example, it can be used to get a specific job performance test . In the research field it is applied to discover new areas and to develop a theoretical basis for this. Test developers generally do not spend much time and effort on further developing existing studies.    
  3. One can also look at validity by emphasizing the connection between tests and psychological constructs. A test is only a valid measurement of the construct if the construct influences the performance of the test subjects during the test. According to Borsboom, Mellenbergh and van Heerden (2004), the first objective of validity is to provide the theoretical explanation for the outcome of the measurement.

In which respects do reliability and validity differ?

It is important to know the difference between reliability and validity. Reliability is the extent to which the differences in the test scores between people correspond to the real differences in the trait, whatever trait it may be. We can discuss the reliability of the test without interpreting the test scores. Reliability is used to look at the reactions to the test and validity is used to look at the interpretation of the test scores. Validity is also more connected to psychological theories, while reliability is more a matter of quantity. Conceptually, there is often no validity without reliability, but reliability is possible without validity. If a test is reliable, that does not mean that it is valid. Tests can be reliable and not valid.

For more than 60 years, the following basic definition of validity was assumed: Validity is the extent to which a test measures what it is intended to measure. Although this definition has been used very much and still is, the concept of validity is presented a little too simply through this definition. A better definition would be that validity is the extent to which the interpretation of test scores for a specific purpose is supported by evidence and theory.   

Access: 
Public
How to evaluate evidence for convergent and divergent validity? - Chapter 9

How to evaluate evidence for convergent and divergent validity? - Chapter 9

In the previous chapter we discussed the conceptual framework of validity, where we could identify five types of proof of validity. One of the types of evidence was convergent and divergent validity: the extent to which test scores have the "right" pattern of associations with other variables. This is discussed further in this chapter. 

What is a nomological network?

Psychological constructs are laid down in a theoretical context. The basis of the construct has connections with the basis of other psychological constructs. The connection between the construct and other related constructs is called a nomological network . According to this network, measurements from one construct would be strongly associated with some other constructs, but weakly correlate with measurements from other constructs. For validity it is important that the test scores correspond as much as possible with the expected associations.  

What methods are there for evaluating convergent and discriminant validity?

There are four methods used to look at convergent and discriminatory associations. The following four methods are common methods for evaluating convergent validity and discriminant validity:

  1. Focus on certain associations;
  2. correlation sets;
  3. multitrait multimethod matrices;
  4. quantify construct validity (QCV).

These four methods are discussed in this section. 

1. Focused associations

With some measurements it is fairly clear which specific variables are related to it. For the validity of the interpretations, the relationship between the test scores and those specific variables can then be examined. When the test scores are highly correlated with the variables, there is a strong validity and if the correlations are low, the validity can be called into question. Test developers get more confidence in the test if the correlation with relevant variables is high. These correlations are called validity coefficients. The quality of a test is higher when the validity coefficients are high.  

A process in which all validity coefficients are tested in multiple studies is called validity generalization. The most validity proof comes from relatively small studies. The correlation is then calculated between the test scores that have been measured and the scores on the criterion variables. The small studies are often done and can also be used, but it also has a disadvantage. If a test has been carried out at one location or at a certain population, which results in an excellent validity score, this does not necessarily mean that it will be the same at another location or at another population.

Studies that look at the validity generalization are meant to investigate the usefulness of the test scores. This type of research is a kind of meta-analysis, they combine the results of several smaller studies into one large analysis. There are three important things about validity generalization:

  • It can reveal a general level of the predicted validity of all smaller studies.
  • It can reveal the degree of variability between the smaller studies.
  • It deals with the source of the variability between the smaller studies. Further analysis of small studies may explain differences between these studies.

2. Sets of correlations

The nomological network of a construct can have associations with other constructs of different levels. As a result, when evaluating convergent validity and discriminant validity, a large amount of criterion variables can be considered.

The researchers usually calculate all correlations between the variable and the criterion variables. From there, a subjective look is taken at which correlations and therefore which criterion variables are relevant. So which criterion variables go into the nomological network. This approach to evaluating validity is common among researchers. First the researchers collect as much data as possible and take many relevant measurements. Then the correlation patterns are examined and the patterns that mean something for the test are included in the test.

3. Multitrait multimethod matrices

Campbell and Fiske have developed the multitrait multimethod matrix (MTMMM) from the conceptual basis of Cronbach and Meehl. With the MTMMM analysis, construct validity is obtained by means of measurements of multiple properties and several different methods are used. The purpose of the MTMMM analysis is to get a clear evaluation of the convergent validity and the discriminant validity. Two important sources of variance can influence the correlations between the measurements. These are the property variance and the variance in the methods. A high correlation between two properties can mean that they share property variance. A correlation can also be high because both properties were measured with the same method. They then have a shared method variance. This can cause a correlation to come out while the properties have no correlation at all. But because it was done with the same test, there is a correlation by the way the subject thinks. For example, the subject can make both tests with low self-esteem, it makes sense that there can be a correlation with two traits. A high correlation can therefore indicate sharing of the property variance but it can also indicate a shared method variance. A correlation can also be weak because two different methods have actually been used, while the properties may actually have a correlation. This makes it difficult to interpret construct validity. Each correlation is a mix of property variance and method variance. The MTMMM analysis organizes relevant information and makes it easier for researchers to interpret the correlations.

An MTMMM analysis must be a good test of different correlations that represent different property and method variances. This is possible, for example, with two correlations:

  • A correlation in which the same property was tested with two different measurements.
  • A correlation in which different properties were tested with one type of measurement.

The first correlation is expected to be strong and the second correlation to be weaker. When the method variance is included, it can be expected that the first correlation is weaker and the second correlation stronger.

Campbell and Fiske (1959) have derived four types of correlations from the MTMMM:

  1. Heterotrait-heteromethod correlations: these are different traits, measured with different methods.
  2. Heterotrait-monomethod correlations: here different properties are subjected to the same method.
  3. Monotrait-heteromethod correlations: the same property (construct) is measured with different methods.
  4. Monotraot-monomethod correlations: a property is measured with a method. These correlations represent the reliability; the correlation of the measurement with itself.

Evaluating the construct validity, the property variance and the method variance with the different correlations can be looked at in a clear way with the MTMMM analysis. The convergent validity can be found by looking at the monotrait-heteromethod correlations. The correlations of the measurements that share the property variance and do not share a method variance must be greater than the correlations of the measurements that share no property variance and no method variance. Also, the correlations of the measurements that share the property variance and do not share method variance must be greater than the correlations of the measurements that do not share the property variance and the method variance.

Today, improvements for the MTMMM analysis are still being looked into. Despite the familiarity of the MTMMM analysis, it is not often used.

4. Quantifying Construct validity (QCV)

With this method, researchers quantify the extent to which the theoretical predictions for the convergent and discriminant correlation fit with the actual obtained correlations. So far, evidence for convergent validity and discriminant validity has been primarily subjective. The one may find the correlation strong while the other experiences it as less strong. The QCV procedure has been developed to obtain as objective and precise validity as possible. This makes the fourth method different from the other three methods.

The effect measures are taken first from the QCV analysis. So here we look at the extent to which the actual correlations correspond to the predicted correlations. These effect measures are r alerting -CV and r contrast -CV. called. High and positive correlations mean that the actual convergent and discriminant correlations are very similar to the predicted convergent and discriminant correlations. Secondly, a test of statistical significance follows from the QCV analysis. The statistical significance examines whether the correspondence between the two correlations has not happened by chance.  

The QCV analysis takes place in three phases:

  1. Researchers make clear predictions about the expected convergent and discriminant validity correlations. Good consideration must be given to the criteria attached to the measurements and a prediction must be made of each correlation relevant to the test.
  2. In the second phase the researchers collect the data and the actual convergent and discriminant correlations are calculated. These correlations show the actual correlations between the variable that we are interested in and the criterion variables.
  3. In the third phase, the extent to which the predicted correlations and actual correlations are matched is quantified. If the correlations are well matched, this means a high validity. If the correlations do not match well, this means a low validity. The test is presented with two types of results, the effect measures and the statistical significance. For the effect measure r alerting -CV , the correlation between the predicted correlations and the actual correlations is calculated. A high, positive correlation means that the predicted correlations and the actual correlations are well matched. The same applies for the r constrast -CV , the greater the correlation, the better it is for the convergent and discriminant validity. The r constrast -CV is similar to the r alerting -CV , but it is corrected for the mutual correlations between the criterion variables and the absolute level of the correlations between the main test and the criterion variables. The statistical significance is also examined in the third phase of the QCV analysis. This examines the size of the test and the amount of convergent and discriminant validity. With a z- test it is then calculated whether the correlations were not obtained by chance.   

The QCV approach can be a good approach, but it is not perfect. The effect measures may have low values ​​due to incorrect predictions, while the proof of validity may be high. A wrong choice can also be made in choosing the criterion variables. Another criticism is that researchers had high values ​​for the effect measures but that predicted convergent and discriminant correlations did not match well with the actual convergent and discriminant correlations.

Multiple strategies can be used in the analysis of tests. Although the QCV analysis is not perfect, it still has advantages over the other methods. First, the QCV analysis allows researchers to look closely at the pattern of convergent and discriminant validity that would theoretically make sense. Second, it allows the researchers to make explicit predictions about the associations with other variables. Thirdly, the QCV analysis focuses on the variable of interest. Finally, it provides an interpretable value that reflects the extent to which the actual outcomes match the predicted outcomes, and the QCV analysis also contains statistical significance.

Which factors can influence the validity coefficients? 

In the previous section we discussed strategies that can be used to accumulate and interpret evidence for convergent and / or divergent validity. All these strategies are determined to a greater or lesser extent by the size of the validity coefficients. Validity coefficients are static results that represent the degree of associate between a test and one or more criterion variables. It is important to be aware of the factors that can influence the validity coefficients. For that reason we discuss those factors in this section.     

The associations between constructs

A factor that influences correlation is the true association between two constructs. If two constructs are strongly associated with each other then a high correlation is likely to result. With predictions, a correlation is expected, because it is thought that there is a connection between the constructs.

The measurement error and reliability

Measurement errors can influence the correlations and therefore also the validity coefficients. The correlation between testing of two constructs is:

rxoyo = rxtyt √( Rxx * Ryy)  

rxoyo here is the correlation between the two tests, rxtyt here is the actual correlation between the two constructs, Rxx is the reliability of the test variable and Ryy is the reliability of the criterion variable. To evaluate the convergent validity, researchers must compare the correlations with the expected correlations. When evaluating the validity correlation, one must take into account the fact that two reliabilities are used. The reliability of the test and the reliability of the criterion test. The criterion test can have a low reliability, which means that the validity is also lower. If the reliability of one of the tests is low, this can be tackled in two ways. The first is to give less weight to the test with low reliability when assessing validity. The second is to adjust the validity coefficient by means of the correction for attenuation. If you want to adjust the coefficient for the reliability of one test, then this form of the formula can be used:    

rXY-adjusted = rXY-original / √Ryy  

rXY-original is the original validity correlation, Ryy is the estimated reliability of the criterion variable and rXY-adjusted is the adjusted validity correlation.  

A limited range

A correlation coefficient shows the covariability between two distributions of scores (we discussed this earlier in Chapter 3). The amount of variability in the distributions can influence the correlations between the two sets of scores. The correlation can therefore be limited by a limited range in both distributions and as a result it provides relatively poorer proof of validity.

There are no clear, simple guidelines to identify the degree of range limitation; it especially requires caution (careful thought) and attention from the researchers. Importantly, such as knowledge about relevant tests and variables. For example, a researcher must thoroughly analyze whether all observed scores fall within the range that is (theoretically) possible for that construct. A commonly used method to assess the degree of range limitation is by looking at the degree of convergent and discriminant validity. The correlations are used to look at the quality of the psychological measurement. With strong correlations expected, the convergent evidence is examined. The correlations can be lower due to the influence of a limited range.    

The relative proportions

The skew of the distributions of the scores also affects the size of the validity coefficient. If the two variables that correlate with each other have a different skew then the correlation between these variables will be reduced. So if research is being done into a variable with a very skewed distribution, then it is possible that a relatively small validity coefficient will come out. 

The formula for the correlation between a continuous and a dichotomous variable ( r CD ) is:

rCD = cCD / sCsD  

cCD is the covariance between the two variables, sC is the standard deviation of the continuous variable and sD is the standard deviation of the dichotomous variable. Due to the proportion of observations in the two groups with the dichotomous variable, the covariance and the standard deviation are directly influenced. The covariance for this is:   

cCD = p1p2 (C2avg - C1avg )   

p1 is the proportion of participants in group 1, p2 is the proportion of participants in group 2, C1avg is the average of the continuous variable in group 1 and C2avg is the average of the continuous variable in group 2. The standard deviation of the dichotomous variable is the second term that is influenced by the proportion of observations. The formula for this is:    

sD = √p1p2 

The calculation for the correlation can be converted to show the direct influence of the relative proportions:

rCD = √p1p2 (C2avg - C1avg ) / sC   

This formula shows the influence of group promotions on the validity correlations. When the validity coefficient is based on a continuous variable and a dichotomous variable, the validity can be influenced by differences and the size of the groups. The validity can be lower if there is a difference in the size of the groups.

The method variance

This was previously discussed in the MTMMM analysis. Correlations between two different methods are smaller than correlations between measurements from one method. If only one method is used , there is a good chance that the correlation is greater, because it also contains a shared method variance.

Time

Validity coefficients based on correlations calculated from measurements at different times are smaller than correlations calculated from measurements at the same times. And longer periods between two moments in time will produce smaller predictive validity correlations.

The predictions of some events

An important factor that can influence the validity coefficient is whether the criterion variable has been one event or a list of multiple events. One-off events are more difficult to predict than a list of multiple events. It is more likely to obtain large validity coefficients when the criterion variable is based on the listing of multiple events.

How can you interpret the validity coefficient? 

Once the validity coefficient has been determined, it must be decided whether it is high enough for convergent evidence, or low enough for certainty of discriminatory validity. Although there is a precise way to quantify the relationship between two measurements, it will not always happen intuitively. Especially for inexperienced researchers, evaluating validity can then be problematic. He or she does not know well when a correlation is strong or weak.

The explained variance and squared correlations

In psychological research it is common to use squared correlations. These show the proportions of variance in one variable, which are explained by the other variable. The explained variance interpretation is attractive, because earlier claims say that research in general is about measuring and understanding variability. The more variability you can explain, the better it can be understood. For the explained variance, the variance analysis is applied (ANOVA).

There are three reasons for criticizing the squared variance:

  1. In some cases it is technically wrong.
  2. Some experts say that the variance itself is a non-intuitive metric. For a measurement of differences in a set of scores, the variance is based on the squared deviations from the mean.
  3. Squaring the correlation can make the relationship between two variables appear smaller.

The squared correlation approach for interpreting the validity coefficients is widely used, but it can also be misleading. It also has a number of technical and logical problems.

Estimating practical effects

One way to interpret the correlation is to estimate how much effect it has in real life. The greater the correlation between the test and the criterion variable, the more successful it can be used in decisions about the criterion variable.

Four procedures have been developed to properly predict the correlations:

1. Binominal Effect Size Display (BESD)

This procedure has been developed to show the practical consequences of using correlations to make decisions. With the BESD you can see how many predictions are successful and how many predictions will not be made successful based on the correlation. That can be viewed with a 2x2 model. The following formula is used to predict the number of people in a cell of the table:

Cell A = 50 + 100 (r / 2)

r is here the correlation between the test and the criterion. The following formula applies to cell B:

Cell B = 50 - 100 (r / 2)

Cell C has the same formula as Cell B and Cell D has the same formula as Cell A.

By putting the validity correlation in a kind of table and converting the numbers into successful predictions, it is easier to see whether the test has a good validity. The criticism is that the test is only suitable if as many people score high as low. And it is made for a situation in which half of the sample is 'successful' on the criterion and the other half unsuccessful. The BESD assumes the same relative proportions.

2. Taylor-Russel Tables

These tables can be used when the assumption of equal proportions is unfounded.

These tables give the chance that a prediction based on an 'acceptable' test score will lead to a successful implementation of the criterion. The Taylor-Russel tables, like BSED for the test and the outcomes, have dichotomous variables. The difference with the BSED is that the Taylor-Russel tables can make decisions based on different proportions. For the Taylor-Russel tables, we need to know the size of the validity coefficient, what the selection proportion is and what the successful selection proportion is if the selection would have been made without the test.

3. Utility Analysis

The utility analysis formulates validity in terms of costs versus benefits. Researchers must assign valid values ​​to various aspects of testing and decision making in the process. First, the benefit of using this test to make decisions compared to other methods that can be used must be estimated. The researcher must then estimate the costs (disadvantages) if this test is used to make a decision.  

4. Sensitivity and specificity

This is especially useful for tests designed to identify a categorical difference. The ability of the test to make the correct identifications with regard to the categorical difference can then be evaluated. An example is a diagnosis where the disorder may be present or absent. There are four possible outcomes:

  1. True positive, the test provides a good identification where the disorder is really present (true positive).
  2. True negative, the test gives a good identification where the disorder is not present (true negative).
  3. False positive, the test indicates that the disorder is present while in reality it is not(false positive).
  4. False negative, the test indicates that the disorder is absent while it is actually present(false negative).

Values ​​of sensitivity and specificity are values ​​that summarize the proportions of good identifications. The sensitivity let the chance to see someone with a disorder correctly identified by the test. Specificity shows the probability see someone who the disorder is not correctly identified by the test. In reality one can never know if someone has a disorder, but it is a guideline that is trusted. We will illustrate both concepts with an example (see the table below).

Table 1.

 

In reality, the disorder ...

According to the test, the disorder is ...

Present

Absent

   Present

80

120

   Absent

20

780

In this example there are 80 true positives, 120 untrue positives, 20 untrue negatives and 780 true negatives. The sensitivity can be calculated by: 

Sensitivity = true positives / (true positives + false negatives)

In the example this amounts to: 80 / (80 + 20) = 80/100 = .80. The sensitivity (the proportion of individuals with the disorder who were correctly identified by the test) is therefore .80. In other words, 80% of people who actually have the disorder are also identified as such by the test. Although there is a high (ie good) sensitivity here, sensitivity alone is not sufficient to claim that there is a high validity. There must also be a high degree of specificity.  

Specificity = true negatives / (true negatives + false positives)

In the example this amounts to: 780 / (780 + 120) = 780/900 = .87.

The proportion of people without a disorder who are also identified as such by the test (ie as the absence of the disorder) is .87. In this example, there is therefore a high degree of sensitivity and specificity. Based on this, we can state that it is plausible that there is a high validity.   

Finally, on the basis of sensitivity and specificity, you can calculate the correlation between the test results and the clinical diagnosis using the following formula:

r = (TP * TN - FP * FN) / √ ((TP + FP) * (TP + FN) * (TN + FP) * TN + FN))  

with TP = true positives (true positive), TN = true negatives (true negatives), FP = false positives (false positives) and FN = false negatives (false negatives).

In our example, the correlation is:

r = (80 * 780 - 120 * 20) / √ ((80 + 120) + (80 + 20) + (780 + 120) + (780 + 20)) 

r = 60000/120000

r = .50

The correlation between the test results and the clinical diagnosis is therefore .50. 

Guidelines and standards in the field

Another way to look at the correlations is to evaluate the context. Different requirements apply in one research field than in another. Things are found in physical science that are much more powerful than findings in behavioral science. According to the guidelines of Cohen (1988), correlations of 0.10 are considered small in psychology, correlations of 0.30 are considered medium and correlations of 0.50 are considered large. Nowadays Hemphill (2003) has made new guidelines. Now a correlation below 0.20 is small, between 0.20 and 0.30 is medium and above 0.30 the correlation is large.

Statistical significance

Statistical significance is an important part of inferential statistics. Inferential statistics are procedures that help us make decisions about populations. Most studies have a small number of participants. Most researchers use this small number of participants as an example for the entire population. And assume that this data is a good representation of the data that they would receive if they examined the entire population. Nevertheless, the researchers are aware that it is not possible to just say things about the entire population as is the case in the sample.  

The inferential statistics are used to gain more confidence in statements about an entire population if only samples have been used. When a sample is statistically significant, it is representative of the population. If there is no statistical significance, then the correlations cannot accurately represent reality and the correlations may therefore have been obtained by chance. It is therefore logical that many researchers find statistical significance very important.

When evaluating convergent validity, it is expected that the validity coefficients are statistically significant. When evaluating discriminant validity, it is expected that the validity coefficients are not statistically significant. With statistical significance, the question arises: do we believe that there is a validity correlation (not zero) in the population from which the sample was taken? And how sure are we that it is? And are we sure enough to conclude that? Two factors that influence the questions are the size of the correlation in the sample and the size of the sample. Confidence rises when the correlation in the sample is not zero, but it can happen that the correlation in a sample is not zero while it can be zero over the entire population. A second factor is the size of the sample. The greater the number of test subjects, the greater the confidence in the sample. So larger correlations and larger samples increase the chance that a test is statistically significant.

Are we sure enough that the correlation in the population will not be zero? Researchers have found that a test with a confidence interval of 95% has statistical significance. So a test is statistically significant if there is a 5% chance of being wrong (this is the alpha level). It is possible that there are low correlations and that the test is nevertheless statistically significant or that there are high correlations and that the test is not statistically significant.

A non-significant convergent validity correlation may be due to a small correlation or a small sample. If the correlation is small, then this evidence is against the convergent validity of a test. If the correlation medium is to large, but the sample is small, then it is not necessarily the case that the convergent validity is poor. In this case the examination is poor because the sample was too small.

With discriminant validity, a high correlation provides evidence against the discriminant validity. A significant discriminant validity correlation can arise because the correlation is large or because the sample is large. If the correlation is large then this evidence is against the discriminant validity of a test. If the correlation is small, but the sample is large, then it does not necessarily mean that the discriminant validity is poor. In such cases the statistical significance says nothing and it is better to ignore it.

In the previous chapter we discussed the conceptual framework of validity, where we could identify five types of proof of validity. One of the types of evidence was convergent and divergent validity: the extent to which test scores have the "right" pattern of associations with other variables. This is discussed further in this chapter. 

Access: 
Public
What types of response bias are there? - Chapter 10

What types of response bias are there? - Chapter 10

Chapters 10 and 11 deal with threats to psychometric quality. Two types of threats are central here: reaction bias and test bias. Chapter 10 is about response bias. Test bias is discussed in chapter 11.

What is response bias?

Consciously or unconsciously, cooperative or not, zelfverbet honoring or rather self-effacing, response bias plays a constant role in psychological measurement. Reaction bias means that respondents' reactions (negatively) influence the quality of the psychological measurement. Bias means the bias or bias of responses / outcomes, which are often incorrect.  

What types of response bias are there?

There are different types of the reaction bias, each type being influenced by other factors:

  • Influenced by the content or design of a test
  • Influenced by factors of the test context
  • Influenced by conscious possibilities to react in an invalid way
  • Influenced by unconscious factors

These factors led to six types of response bias : (1) acquiescence bias (saying yes and saying no); (2) extreme (vs. average) responses; (3) social desirability (" faking good" ), (4) malingering (" faking bad"), (5) random or carefree reaction, (6) gambling.

1. Acquiescence bias (saying yes and saying no)

The acquienscence bias arises when an individual goes along / agrees with statements without looking back / paying attention to the meaning of the statements. This is common with psychological character tests, questionnaires about your own point of view / opinion, interest questionnaires and clinical investigations.

When someone answers all questions in '' one direction '' (that is, either only answers positively or only answers negatively), then these responses can display a valid set of responses, or they may just display a response bias.

The correlation between the same type of tests (from this respondent) is strong. Because if the respondent gives an acquiescent response, there is a good chance that this respondent will also give an acquiescent response in other tests.

The causes of this reaction bias are:

  • The items are complex (too difficult) or ambiguous (look alike).
  • The test situation creates distractions.
  • The respondent simply does not understand the material, so fill in something.

Saying no: bias creates low test scores in the same (negative) direction.
Saying yes: bias creates high test scores in the same (positive) direction.
Consequence: higher / stronger correlations are created than there actually are.

2. Extreme (vs. average) responses

Even though two respondents have the same level of the relevant characteristic / proposition that is stated in a test, one respondent is more inclined to give '' extreme '' answers, while the other respondent prefers to give ''average'' answers .

Example: the statement is 'I am spiritual' and the response options are: not at all, not really, neither of them / neutral, a bit, completely. An 'extreme' respondent gives one of the most extreme answers; or "not at all" or "not at all". An 'average' respondent gives a less extreme answer; or "not real" or "a little."

These extreme or average responses are not in themselves a bias. It reflects the character level of the individual. People with more extreme characteristics should give more extreme reactions and people with more average characteristics should give more average reactions.

Problems do arise such as:

  • People with identical character traits use different levels, for example one respondent uses extreme responses and the other respondent uses average responses.
  • People with different character traits use the same level, for example, both respond extremely or both respond on average.

3. Social desirability ("faking good")

The social desirability of response bias is when the respondent's intention is to respond in a way that is socially acceptable, apart from his / her actual character traits.

This is influenced by:

  • The test content: when the subject of the test links with social desirability.
  • The test context: when the consequences of the given reactions are important.
  • The personality of the respondents: some people are more inclined to respond socially desirable.

Correlations here are also higher between variables than they actually are.

Del Paulhus did a lot of research into socially desirable responses as an aspect of personality. According to him there are two processes:

  1. Impression management: intention to appear socially desirable (sometimes called ' faking good ').
  2. Self-deception: unrealistic positive image of yourself. For example, overestimating psychological characteristics.

Another distinction can also be made:

  • State-like: impression management (consciously responding in a way that is appropriate in a certain situation).
  • Trait-like: self-deception (one has more aptitude for self-deception than the other).

4. Malingering ("faking bad")

Although many researchers are concerned about the problem of social desirability, the opposite can also occur. In some situations, respondents may be more inclined to exaggerate the nature and severity of their psychological problems. Or even pretending that something is wrong, while nothing is really wrong. This is the opposite of the social desirability bias. According to some studies, this form of reaction bias occurs in some test contexts in 7.3% to 27% of cases in psychological evaluations, and even up to 35-45% in forensic evaluations.  

5. Respond randomly or carefree 

Sometimes respondents give answers that are completely or partially random. This can be caused by various factors: a lack of motivation, fatigue and so on. The random or carefree reaction leads to meaningless scores. For example, a respondent may choose to enter the same answer for each item (for example, "neutral" or "agree"). By answering each item randomly, regardless of the item content, the test scores are meaningless. It is estimated that this type of reaction bias occurs in 1-10% of the respondents.

6. Gambling

The last type of reaction bias is gambling. Gambling occurs with questions that only have one correct answer. The result is inconsistency between observed differences and actual differences between respondents, because one is lucky with gambling and the other not. 

What methods are there for dealing with response bias? 

There are three broad strategies for dealing with reaction bias: 

  1. Managing the test context.
  2. Managing the test content and / or scores.
  3. Use specially designed 'bias' tests.

In addition, we can distinguish three goals when dealing with reaction bias:

  1. Minimizing the occurrence of reaction bias
  2. Minimizing the effects of reaction bias.
  3. Discovering reaction bias, possibly intervene.

These strategies and goals can be combined to summarize different methods for reaction bias. We will first list these and then discuss them in more detail.  

  1. Strategy 1 + goal 1 = anonymize, minimize frustration, warnings
  2. Strategy 2 + goal 1 = simple items, neutral items, forced choices, minimal choice
  3. Strategy 2 + goal 2 = balanced scales, opportunity corrections
  4. Strategy 2 + goal 3 = embedded validity scales
  5. Strategy 3 + goal 3 = social desirability tests, extremity tests , acquiescence tests

1. Minimize the occurrence of the response bias + manage test context 

The occurrence of reaction bias can be minimized by managing the way in which the test is presented to the respondent and by managing the conditions that are set for the respondent within the test situation.

  • Reducing situational factors that can cause socially desirable responses.
  • Tell the respondent that it is being processed anonymously, so the respondents are less inclined to respond socially desirable. The personal responses do not have any consequences for the respondent, so they are more likely to respond honestly.

A disadvantage of anonymity is that they want to make less effort or have a low motivation. With quick and random response as a result.

Solution: Tell respondents that the validity of their responses to the items is being measured. In other words, false / invalid responses are detected and deleted. This is especially a good solution for simulation (exaggerating psychological problems).

2. Minimizing response bias + manage test content 

Choosing certain forms of test content to reduce the occurrence of reaction bias.

  • Formulate the items so that they are easy to understand.
  • Use neutral terms in the items. So that no link can be made with socially desirable / acceptable answers by respondents.
  • Forced choice ( forced-choice ): there are only two answers (which clearly differ from each other) from which you can choose. You must indicate which term best fits your personality. For example, timid or argumentative.

This is a good solution for extremity choices . For example, yes or no answers.

3. Minimize effects of response bias + manage test content / scores 

Using specialized score procedures to reduce the effect of reaction bias.

One of those specialized score procedures is the so-called 'balanced scales'. This is used as a solution for the acquisition bias (saying yes and saying no). A problem is that no distinction can be made between those who have really high or low scores and those who just choose a positive or negative direction for each item. Balanced scales are tests or questionnaires that deliberately contain certain positively oriented items and negatively oriented items and not just contain positively-oriented or only negatively-oriented items.

Consequence: in this way, the respondents must pay attention to the type of question, negative or positive. Moreover, the people who don't do that are picked out like this. In this way, a distinction is made between the respondents who are truly honest and observant and the respondents who respond in an one-way street (invalid way).

Another specialized score procedure is to have incorrectly answered items weighed differently than unanswered items. This is mainly used as a gambling solution. For example, a correctly answered item receives one point, an incorrect item results in a ¼ point deduction, and an unanswered item receives zero points. This minimizes the effect of gambling.

4. Managing test content to discover reaction bias + intervention

Identifying respondents who possess a form of reaction bias. Valid scales are used that examine the patterns of responses during the test and evaluate the extent to which that pattern reflects different things such as random responses, acquiescence responses, artificial 'good' responses and artificial 'incorrect' responses.

  • L-scale: measuring instrument to detect social desirability of bias.
  • F-scale: measuring instrument for, for example, detecting simulation.
  • K-scale: measuring instrument to detect ' faking good '.
  • VRIN scale: measuring instrument to detect random reactions.
  • TRIN scale: measuring instrument to detect acquiescence bias.

We can detect the acquiescence bias by means of the reverse scale and the balanced scale .

Intervention: after discovering reaction bias, action is taken where it is possible.

  • Do not include test scores of an individual in further analysis.
  • Keeping (suspicious) scores, but handling the scores with care.
  • Retain (invalid) scores, and use statistical control procedures for potential invalid scores.

5. Use special 'bias' test to discover reaction bias + intervention

Use several scales to measure reaction bias.

These scales allow test users to identify and remove potentially invalid responses. And they allow test users to statistically check the effects of reaction bias.

Use of these scales in two ways:

  1. To better understand the reaction bias by studying causes, implications, correlations with other variables, etc.
  2. Use scales to measure the extent of (possible) reaction bias in test scores, which may have influenced the test scores .

There are also scales that measure individual differences in the tendency to respond socially desirable. For example ' Marlow-Crowne Social Desirable Scale '. He does this by asking questions that contain true or false answers.

There are also tests that can reveal simulation. One such test is the Dot Counting Test (DCT). In this test, people must count the points on twelve different cards as quickly as possible. Simulation occurs when respondents need just as much time to count both cards with dots that are randomly distributed and cards with dots that are divided into groups.

Terminology

Finally, two terms are briefly discussed for clarification: response sets and response styles. Reaction sets are temporary aspects of test situations or of the test itself. (= situation related). Response styles are stable characteristics of individuals (= person related). 

Chapters 10 and 11 deal with threats to psychometric quality. Two types of threats are central here: reaction bias and test bias. Chapter 10 is about response bias. Test bias is discussed in chapter 11.

Access: 
Public
What types of test bias are there? - Chapter 11

What types of test bias are there? - Chapter 11

In the previous chapter we discussed response bias, a common threat to the psychometric quality of tests. In this chapter we discuss the second major threat: test bias. Test bias arises when the true scores and the observed scores differ between two groups. Think of men and women as two groups. The emphasis of test bias is therefore on systematic differences between groups of respondents. Please note that the identification of differences between groups does not necessarily mean that there is also (test) bias; it may be that these differences are actually present in reality that way. 

What types of test bias are there?

There are generally two types of test to distinguish bias: construct predictive bias and bias. Construct bias: bias regarding the meaning of a test. Predictive bias: bias regarding the usability of a test. These two types of test bias are independent of each other. In other words, one bias can exist in a certain test without the other bias.

1. Construct bias

Construct bias concerns the relationship between true and observed scores. This means that the test contains different interpretations of meaning from the two groups. If the interpretation differs per group, a test construct bias is created. This leads to situations in which two groups have the same average true scores but have different average observed scores in a test. 

2. Predictive bias

A predictive bias is a relationship between scores from two different tests. When one test (the so-called predictor test) contains scores that are used as a predictor for the scores of the other test (the so-called outcome test). A predictive bias exists when the relationship between the predictor test (true scores) and the outcome test (observed scores) differs between two groups. In other words, for one group the predictor test is a good predictor but for the other group the predictor test is a bad predictor.

What are the ways to identify test bias ?  

There are roughly two categories of procedures to identify test bias: (1) internal methods that identify construct bias; (2) external methods that identify predictive bias.

Although there is a difference in test scores between two groups, this does not necessarily mean that there is a test bias. Perhaps the difference is based on reality. For example: if a test shows that the weight of men is on average higher than the weight of women, then this is based on reality. But you can have your doubts when it comes to math skills. For example, it is not logical that the math skills of men are better than the math skills of women.

How can you discover construct bias?

Since we never know the true scores of a person, we use procedures that provide an estimate of the existence and extent of a construct bias.

We use internal structures to find out whether there is a construct bias. These contain a pattern of correlations between items and/or correlations between each item and the total score. Evaluation is as follows: we compare the internal structures for a test separately for two groups. If the two groups exhibit the same internal structures in terms of their test responses, we can conclude that the test does not suffer from construct bias. Conversely, if the two groups do differ in internal structures with regard to the test reactions, then there is construct bias.

There are five methods to discover construct bias:

  1. Reliability.
  2. Ranking (rank order).
  3. Item discrimination index.
  4. Factor analysis.
  5. Differential item function analysis.

These five methods are discussed in more detail below.

1. Reliability

An intuitive way to evaluate the construct bias is by testing the reliability for each group separately. One of the ways to estimate reliability is through the internal consistent ( alpha coefficient ). In Chapter 6 we discussed that internal consistent refers to the degree to which parts of a test are interrelated. Translated into this context, this means that the alpha provides insight into the internal structure of a test, that is, are the test items consistent with each other or not. Group differences in reliability are an indication that the test does not "work" equally well for different groups.

2. Ranking (rank order)

If the items can be ordered by difficulty, we can use this ranking to make a relatively quick and easy estimate of the construct bias. This ranking can then be made between different groups, and then compare them with each other. If there is a difference in ranking of the items between the groups, there is an indication of construct bias.

3. Item discrimination index

The item discrimination index is a representation of the extent to which the item is related to the total test score. Item discrimination index distinguishes the variety in levels of the construct that is measured between people.  

An item makes a strong distinction between people with varying levels of the construct being measured, when people with a high capacity have a high chance of correctly answering the relevant question, which concerns the same capacity. However, people with a low capacity have a small chance of correctly answering the question in question, which is about the same capacity (= high item discrimination index value, eg 0.90). This means that the item is a good reflection of the construct that is measured by the test.

An item does not make a good distinction between people with varying levels of the construct being measured, when people with a low capacity give almost as many correct answers as people with a high capacity. An example of a low item discrimination index value is 0.10.

The item discrimination index can be used to estimate construct bias. An item is selected, from which we calculate the item discrimination index separately for each group. We then compare the group indexes per item. Equal indexes = no test bias. Uneven indexes = probably test bias. It is important to know that the item discrimination index is independent of the number of people in a group.

4. Factor analysis

A fourth way to investigate construct bias is through the use of factor analysis for items, where two or more groups are distinguished. As we discussed earlier, factor analysis is a statistical technique to divide the variance or covariance between test items into clusters or factors. A factor is a set of items that correlate highly with each other and therefore are interrelated. If all items correlate equally with each other, then there is one factor and we speak of a unidimensional structure. When there are several factors, we speak of a multidimensional structure.      

There are two trends in factor analysis: exploratory (exploratory factor analysis: EFA) and confirmatory ( confirmatory factor analysis: CFA). The latter, CFA in particular, is suitable for mapping construct bias. CFA is discussed further in the next chapter (chapter 12).

5. Differential item function analysis

Differential item function analysis gives the possibility to estimate the respondent's characteristic levels directly from test data scores. These are also called the true scores. We then compare the characteristic levels (true scores) with the item responses (observed scores) for all people in the two groups, and see if they match. If not, the item will suffer from bias.

  • However, construct bias: two people (male and female) have the same characteristic level, but the item characteristic curve (ICC) is not the same. In other words, the chance that the two people give a correct answer is not the same as each other.
  • No construct bias: two people (male and female) have the same characteristic level and the item characteristic curve is the same. In other words, the chance that the two people give a correct answer is equal.
  • Uniform bias: differences in group in terms of curve location. The two lines do not overlap or cross each other. People from one group with the same characteristic level as people from the other group are less likely to answer the question correctly.
  • Non-uniform bias: difference in group in terms of location and shape. The two lines overlap / intersect. At some levels the item is easier for men and at some levels the item is easier for women.

With uniform and non-uniform bias, the test measures different characteristics for men and women.

What are the ways to identify predictive bias? 

Predictive bias refers to the extent to which test cores are equally predictive of different groups. Ideally, when a test is used to make predictions about people, a test is equally predictive for all groups of people. If this is not the case, and the test is therefore not as predictive for different groups of people, then there is a predictive bias.  

Scores of two variables / measurements are obtained. Next, it is examined to what extent the scores of the first test can be used to predict the scores of the second test (which is related to the scores of the first test). An external evaluation of the test is required to discover the predictive bias. Two considerations are: (1) Does the test really help you predict the outcome? (2) Does the test predict the outcome evenly for several groups? We can investigate this on the basis of regression analysis. 

Regression analysis

Regression analysis contains linear relationships between test scores (true scores) and outcome scores (observed scores).

Ŷ = a + b (X), where X indicates the capacity score

  • a = intercept (starts with X = 0);
  • b = the direction coefficient;
  • Ŷ = predicted value for individual.

The observed scores never exactly end up on the linear regression line. The regression line is formed from predicted scores, and the observed scores do not always match exactly.

''One size fits all'': The regression comparison is applicable to all groups. Different groups share a corresponding regression line, apart from gender, ethnicity, culture, or other group differences.

You can investigate whether the test contains bias by making a regression formula on the basis of data (for example, of both men and women). This is called the common regression line . We must create a regression line for each group separately (i.e, for men and women) and compare it with the common regression line. If these are not the same, then there is a predictive bias. If these are the same, then there is no question of a bias.

Within this method there are different types of bias: intercept bias, slope bias, and intercept + slope bias.

1. Intercept bias

With intercept bias, the direction coefficient of the two group regression analyzes corresponds to the common direction coefficient, but the intercept of the two group regression analyzes does not correspond to the common intercept.

One size does not fit all, so a predictive bias. In other words, there are different observed scores for men and women. Difference consistency exists because the difference between men and women remains the same as the X rises / falls. The two regression lines (of the men and of the women) are parallel to each other.

2. Slope bias

The intercept of the two group regression analyzes is the same as the common intercept . But the direction coefficient of the two group regression analyzes is not the same as the common direction coefficient.

'One size does not fit all ', so a predictive bias. In other words, there are also different observed scores for men and women here. There is no difference consistency, because the difference between men and women changes each time the X rises / falls. The regression lines (of men and women) do not cross each other.

3. Intercept and slope bias

The intercept of the groups is not equal to the common intercept and the direction coefficient of the groups is not equal to the common direction coefficient. This is much more common than that one part contains a bias and the other part does not contain a bias. Here, of course, there is also 'one size does not fit all '. The regression lines (of the men and of the women) do indeed intersect.

What other statistical procedures are there to detect test bias?

In addition to the methods we discussed in this chapter, there are a number of other statistical procedures to discover test bias. Structural equation modeling, for example, can be used to discover test bias. More complex regression models can also be used. But these methods (and their complexity) go beyond the scope of this book and are therefore not discussed further.         

What is the difference between test bias and test fairness?

Test bias is not the same as test fairness. Test fairness has to do with an appropriate use of test scores, in the field of social and / or legal rules and the like. Test fairness is not a psychometric aspect of a test. Test bias, on the other hand, is a psychometric concept, embedded in theory about test score validity. Test bias is defined by specific statistical and research methods, which enable the researcher to make decisions about the test bias. Both are important for psychological testing.     

In the previous chapter we discussed response bias, a common threat to the psychometric quality of tests. In this chapter we discuss the second major threat: test bias. Test bias arises when the true scores and the observed scores differ between two groups. Think of men and women as two groups. The emphasis of test bias is therefore on systematic differences between groups of respondents. Please note that the identification of differences between groups does not necessarily mean that there is also (test) bias; it may be that these differences are actually present in reality that way. 

Access: 
Public
What is a confirmatory factor analysis? - Chapter 12

What is a confirmatory factor analysis? - Chapter 12

In chapters 4 and 8 we discussed the internal structure (i.e., dimensionality) of a psychological test. As we briefly introduced there, the internal structure of a test has to do with the number and nature of the psychological constructs that we measure with the items. One way to identify those constructs is through factor analysis. In this chapter, we will discuss factor analysis, and in particular confirmatory factor analysis (CFA).

What are EFA and CFA used for?

There are two types of factor analysis: exploratory factor analysis (EFA) and confirmatory factor analysis (CFA). These two types of factor analysis are most suitable for different phases of test development and evaluation. EFA is most suitable for the first phases of test use (clarifying the construct and the test). CFA is most suitable in later phases of test use, after the initial evaluations of item properties and dimensionality and after major revisions of the test content (ie when the test content is virtually fixed). Confirmative factor analysis (CFA) is used to investigate the dimensionality of a test when there are already hypotheses about the number of underlying factors (dimensions), the connections between items and factors, and the coherence of the factors.

What is the purpose of CFA?

With CFA we evaluate hypotheses about the internal structure or dimensionality of a measurement model. CFA shows the extent to which the assumed measurement models correspond to the actual data of the respondents. Thereafter, if required, the assumed model can be adjusted to better match the actual data.

How do you perform a CFA?

After a specific measurement model has been evaluated, the model is usually adjusted and then the adjusted model is evaluated again using CFA. A model is often adapted and evaluated several times.

Before you perform a CFA, there are three important things you must do. First, you must make clear which psychological construct you are going to measure and develop a number of test items. Secondly, you must find enough people to take the test. Finally, all items must have the same direction, so you must score negatively coded items in reverse. Performing a CFA consists of four steps:

  1. specification of the measurement model;
  2. calculations;
  3. interpret and report the results;
  4. model changes and new analysis (if necessary). 

These four steps are discussed below.

Step 1: specification of the measurement model

Enter the data in a statistical software program. You make a figure of the measurement model and the program then converts it into formulas. First the number of dimensions (also called factors or latent variables) must be determined. It must then be determined which items are linked to which factors. At least one item is associated with each factor. And each item is usually connected to only one latent variable. If a model is multidimensional, then it must also be determined which factors may be associated with other factors. We only need to determine whether or not there are connections, the software will then estimate the precise values ​​of these connections.

Step 2: calculations

After we have entered all the details of the measurement model, we have the program run a CFA. Although these calculations are performed 'behind the scenes', it is still useful to know the statistical process. The basic calculations have four phases:

  1. The data is used to calculate the actual item variances and covarities between items.
  2. The actual variances and covarities of the items are used to estimate the parameters. There are several important parameters. One is the factor loading(s) of each item. This is the extent to which an item is associated with a factor. A second parameter is the connections between different factors. CFA also calculates the significance of each parameter.
  3. The estimated parameter values ​​are used to calculate implied item variances and covarities. So the program calculates item variances and covarities as they are implied by the estimated parameters. If the assumed model is correct then the implied variances and covarities correspond to the actual variances and covarities from the first step.
  4. The software program provides information regarding the general suitability or "fit" of the assumed model. It compares implied variances / covarities with actual variances / covarities and it calculates a 'model fit' and 'adjustment indexes' ( modification indices). These adjustment indexes provide specific ways in which the measurement model could be improved.

Step 3: interpret and report the results

After entering the data and calculating parameters and the 'fit' of the model, the results are interpreted.

First we look at the fit of the model. A 'good fit' ( ' good fit') indicates that the supposed model matches the actual responses to the test, this supports the validity of the model. A 'bad fit' ( ' poor fit') indicates that the assumed number of dimensions does not correspond to the actual responses to the test. The chi-square is a measure that is used to indicate the degree of ' poorness of fit' of the model. Large, significant chi-squared values ​​indicate a poor fit, and small, non-significant chi-squared values ​​indicate a good fit of the model. Sample size influences the chi-square. A large sample provides large chi-squared values, which in turn provide statistical significance. In addition to the chi-square, a CFA provides a number of other fit indexes. These indexes do not produce statistical significance and all of these indexes have different scales and norms.

If the fit indexes indicate that the model is not suitable, then the adjustment indexes are viewed and it is looked into how the model could be improved. If the fit indexes indicate that the model is suitable, then the parameter estimates are viewed.

If the hypothesis is that an item is associated with a certain factor, then we expect to find a large, positive, and statistically significant factor load. If we find that, then the item is a good reflection of the underlying psychological dimension. And we keep this item in the test. If the factor load is small and / or not significant, the item is not related to the psychological dimension and the item is removed from the test. Then the model is adjusted and all calculations are done again.

Step 4: model change and new analysis (if required) 

If the model is not suitable, then we switch to viewing the adjustment indexes and adjusting the assumed measurement model. An adjustment index indicates the potential influence of adjusting a specific parameter. After modifying the model, it is analyzed again, so all calculations are done again.

How can CFA be used to evaluate reliability?

CFA is also sometimes used as a method to estimate reliability. First we use CFA to evaluate the basic measurement model of the test. Then, if necessary, we adjust the measurement model and re-analyze it. Finally, we use the non-standardized parameter estimates to estimate the reliability of the test:

Reliability = true variance / (true variance + error variance)

So, estimated reliability = (∑גi)2 / (((∑גi)2 + ∑өii + 2∑өij) 

גi = factor loading of an item.

Өii = error variance of an item.

Өij = covariance between the errors of two items.

(∑גi)2 = is the variance of the true scores. 

Σөii + 2Σөij = the random error variance.

How can CFA be used to evaluate validity?

CFA can also evaluate validity in various ways. Firstly, CFA provides insight into the 'internal structure' aspect of validity. Second, if responses to a test are measured together with measurements of related constructs or criteria, then we can evaluate the relationship between the test and those variables. This provides important information about the psychological significance of the test scores. There are two ways we can use CFA to view these validity components. We can use CFA to evaluate convergent and discriminant validity by applying CFA to multitrait-multimethod matrices. In addition, we can evaluate convergent validity by examining a test and one or more criterion variables using CFA.

How can CFA be used to assess 'measurement invariance'?

CFA has also recently been used to evaluate group differences in the psychometric properties of tests. CFA is particularly useful in the conceptualization and detection of construct bias ("measurement invariance"). Construct bias implies that the test has a different internal structure, and therefore meaning, for different groups (see Chapter 11 for a more detailed description). Vice versa, if the internal structure of a test does not differ between groups, then this evidence is against construct bias, and we speak of an internal structure in variance.

Measurement invariance can be mapped with CFA by comparing groups in terms of specific parameters (such as the lambda , the theta , etc.) of measurement models. If groups have different values ​​for a parameter, then this is proof of a lack of invariance for the parameter (and therefore proof of a certain degree of construct bias, because the parameters differ between groups). The extent to which there are differences can be summarized in four different levels of measurement invariance:

  1. Configural;
  2. weak/metric; 
  3. strong/scalar
  4. strict

In short, the greater the difference, the less robust you st test for measurement invariance (the first level is the weakest, least robust level).  

In chapters 4 and 8 we discussed the internal structure (i.e., dimensionality) of a psychological test. As we briefly introduced there, the internal structure of a test has to do with the number and nature of the psychological constructs that we measure with the items. One way to identify those constructs is through factor analysis. In this chapter, we will discuss factor analysis, and in particular confirmatory factor analysis (CFA).

Access: 
Public
What is the generalizability theory? - Chapter 13

What is the generalizability theory? - Chapter 13

The Generalizability Theory (G theory) helps us to distinguish the effects of multiple facets and then to use different measurement strategies. It is an ideal framework for complex measurement strategies in which several facets influence the measurement quality. This is a fundamental difference compared to the classical test theory (CTT), where different facets are not assumed.

What is a facet and what role do facets play in the complexity of a measurement strategy?

According to the G theory, measurement errors can be differentiated in different facets.

G theory can be used to investigate what effect the various aspects of a measurement strategy have on the total quality of the measurement. In this way the various items can also be examined. For example, when investigating which items are related to the onset of aggression each combination of items can be investigated separately. Each part of the measurement strategy is called a facet and different measurement strategies are partly defined by the number of facets. The more facets a measurement strategy has, the more complex the strategy is. An example of three facets is: items, observers and situations.

Which two key components play a role in G theory?

The concept of generalizability, as the name suggests, is very important within the G theory. The measurement quality is usually evaluated in terms of the ability to draw conclusions from a limited number of observations to an unlimited number of observations. When a psychological or behavioral variable is observed, only a limited number of observations can be made. The aim of the G theory is to obtain scores that are representative of the scores that would have been obtained if all possible items that could measure the construct were used.

The concept of consistency is also very important within the G theory. It is important to see whether the degree of variability of an individual's test scores is consistent with the variability of universal scores. In the G theory, estimates of generalizability are based on variance components which represent the extent to which differences exist within the 'universe' for each element of the design. A variance component is the variance of universal scores within the population of individuals. The magnitude of the variance component of a facet indicates the extent to which the facet influences observed scores.

Which two phases are distinguished in the G theory?

The G theory can be used for multiple types of analysis, but a basic psychometric analysis consists of a two-phase process: the G study and the D study. The variance components are estimated in the first phase. In such a study, factors are identified that influence the observed variance (and therefore the generalizability). This phase is called a G study, because it is used to identify to what extent the different facets could influence generalizability. In the second phase, the results of phase one are used to estimate the generalizability of the different combinations of facets. This phase is known as a D study, because the phase is used to make decisions about future measurement strategies.

The various steps associated with the two studies are discussed in more detail below.

1. G study

In this phase, variance analysis (ANOVA) is used to generate estimates of variance components for each factor. The purpose of ANOVA is to investigate the variability of a score distribution and to see the extent to which this variability is associated with other factors.

In a design with one facet there are three factors that can influence variability.

  1. The extent to which the targets differ.
  2. The extent to which the items differ.
  3. Measurement errors.

For these three factors there are different formulas to calculate the variance components (see table 13.3 in the book).

In a design with one facet, the ANOVA gives two main effects and a residue (error).

The result that we are most interested in is the target effect . This reflects the extent to which targets have different averages. The target effect is the 'signal' that a researcher is trying to discover. In a design with one facet, the residual effect is the 'noise' that potentially marks the signal of the target effect. If the measurement is good, participants who score high on one item will do the same on the other items. When the items are inconsistent it indicates that there are no clear differences between the individuals and that the items may not be good reflections from the construct.

2. D study

During this phase, the psychometric quality of different measurement strategies is estimated that can help in planning a good measurement strategy for the research in question. In this phase, coefficients or generalizability for different measurement strategies are estimated. These coefficients vary between 0 and 1.0. The following applies:

Generalizability coefficient = Signal / ( Signal + Noise )

For the formulas and one worked out to calculate the relative generalizability of the differences between targets, see book page 435-436. 

There is an important difference between a design with one facet and a design with multiple facets. This difference lies in the complexity of the components that influence the variability in the data. When a new facet is added, new components are also added. This complexity makes the ' noise ' or 'error element' of the generalizable coefficients more complex.

Examples of 'one-facet designs' and 'multiple facet designs' can be seen in the book.

Which other measurement designs are there? 

There are at least four important ways in which a G theory analysis can differ compared to another G analysis:

  1. The number of facets;
  2. random vs. fixed facets;
  3. crossed vs. nested designs;
  4. relative vs. absolute decisions.

1. The number of facets

The more facets there are, the larger and more complex the design and the more effects there are that generate variance components. The basic logic and process of the G theory is however the same with designs with fewer facets.

2. Random vs. fixed facets

If there is a random facet, then the items of the facet are chosen randomly from a sample of a universal number of items.

If there is a fixed facet, then all conditions of the facet are included in the analysis. People do not want to generalize outside the conditions used in the analysis .

The difference between using random or fixed facets can have important psychometric consequences. It can influence the psychometric quality of the research. It can also have consequences for the generalizability of the quality of the measurements.

3. Crossed vs. nested designs

In a multi-faceted analysis, the pairs of facets are crossed or nested. When a pair is crossed, all possible combinations of two facets are included in the analysis. If not all possible combinations are included, then it is a nested design. Determining this is important because it determines which effects can be estimated in a G analysis.

4. Relative vs. absolute decisions

A G theory can be used to make two types of decisions. Relative decisions contain the relative order of participants. When tests are used to make relative decisions, they are often referred to as norm- referenced tests Absolute decisions are based on the absolute level of the score of an individual. When tests make such decisions, they are called criterion-referenced tests . Determining the difference between these two decisions is important because it influences the way noise or error is perceived. It influences the number of variance components that contribute to error when generalizibility coefficients are calculated. In most studies, researchers are more interested in the relative perspective than the absolute perspective. They are more interested in understanding relative differences between participants' scores on a measurement. So why some people score relatively high and others relatively low.

The Generalizability Theory (G theory) helps us to distinguish the effects of multiple facets and then to use different measurement strategies. It is an ideal framework for complex measurement strategies in which several facets influence the measurement quality. This is a fundamental difference compared to the classical test theory (CTT), where different facets are not assumed.

Access: 
Public
What is the Item Response Theory (IRT) and which models are there? - Chapter 14

What is the Item Response Theory (IRT) and which models are there? - Chapter 14

What is IRT?

The Item Response Theory (IRT) is an alternative to the classical test theory (CTT). The IRT identifies and analyzes the measurements in behavioral sciences. The reaction of the individual to a certain test item is influenced by characteristics of the individual ( trait level ) and properties of the item (difficulty level). 

  • For a difficult item/question someone needs a high ' trait level' to be able to give a correct answer.
  • Conversely, with an easy item / question, someone with a low ' trait level' is enough to give a good answer.

Example:

Statement 1: I like to chat with my friends. 
Statement 2: I like to speak to a large audience.

Statement 1 needs a low extraversion level (= trait level) to agree with this. 
Statement 2 needs a high extraversion level (= trait level) to agree with this.

IRT analysis has a distribution of (0; 1), the average is 0, and the standard deviation is 1.

So if an item has a difficulty level of 0 then:

  • Has an individual with an average trait level (so 0), 50% chance of a correct answer.
  • Has an individual with a high trait level (therefore higher than 0), a greater chance than 50% of a correct answer.
  • Has an individual with a low trait level (therefore lower than 0), a smaller chance than 50% of a correct answer.

What is item discrimination?

Item discrimination refers to distinguishing individuals in low and high trait levels. The discrimination value of the item indicates the relevance of the item in relation to the trait level being measured.

  • Positive discrimination ≥ 0: relationship between item and trait (property) that is being measured. This means that high trait scores provide a greater chance of answering the item correctly and low trait scores provide a smaller chance of answering the item correctly.
  • Negative discrimination ≤ 0: inconsistency between item and trait . This means that high trait scores provide a smaller chance to answer the item correctly.
  • Discrimination value = 0: no relationship between item and trait (property) that is measured by the test.

So: the greater (positive) the discrimination value, the more consistent, the better.

A third component that must be taken into account is gambling. With multiple choice or true / false questions, people might gamble if they don't know the answer. Because of this, they sometimes give the correct answer while they actually did not know the correct answer. IRT can include gambling as a component in the analysis.

Which IRT models are there?

According to the IRT perspective we can identify the components that influence the likelihood that a person will react to a certain item in a certain way. A measurement model expresses the relationship between the outcome (the response of an individual to a certain item) and the components that influence the outcome (the skills of the person, the quality of the item). There are different measurement models, each expressing this link in their own way. In other words; IRT models show the mathematical link between the observed scores and the components that influence the scores. These are both the characteristics of the individual and the characteristics of the item. In this section we will discuss the most common IRT models.       

The one-parameter model (1PL): The Rasch model 

The Rasch model (one-parameter logistic model) (= 1PL) only has the properties of the individual and the properties of the item as components that influence the scores.

P(Xis=1| Өs, βi) = (e (Өs – βi) ) / (1 + e (Өs – βi)

P = chance of a certain answer on item i of respondent s.

X is = response X to item i of respondent s. " X is = 1" indicates a correct answer for this item.

= S = trait level of respondent s.

β i = difficulty value item i.

e = logarithm, you can find this on your calculator.

The two-parameter (2PL) model 

The two-parameter model (2PL) has three components that influence the scores, namely the characteristics of the individual, the characteristics of the item and the item discrimination.

The formula here is:

P(Xis=1| Өs, βi, αi) = (e (αi (Өs – βi)) / (1 + e (αi (Өs – βi))

α = the discrimination of item i.

The three-parameter (3PL) model

The chance of gambling is also included in the three-parameter model . The 3PL model can be seen as a variation on the 2PL model, where one component has been added (the chance of gambling): refers to the lower chance of answering item correctly . According to the 3PL model, the chance of a correct answer is therefore influenced by: 

  1. The characteristics of the individual, i.e., the " trait level" Ө; 
  2. the item difficulty β; 
  3. the item discrimination α; 
  4. the "gamble parameter".

Graded Response Model

The 1PL and 2PL model are made for items with binary answer options. The Graded Response Model (GRM) is made for testing, etc. with more than two answer options. As with previous models, this model assumes that a person's response to an item is affected by that person 's trait level, item difficulty, and item discrimination. But the GRM has different difficulty parameters for one item.

If there are m number of answer options or categories, a distinction can be made m-1 time between answer options. For example, for an item with five answer options (strongly disagree, disagree, neutral, agree, totally agree) there are four differences. Such as the difference between 'agree' and 'totally agree'. Each of these differences can be represented in the following way:

P(Xis ≥ j| Өs, βij, αi) = (e (αi (Өs – βi))) / (1 + e (αi (Өs – βi)) )

J = the answer option.

βij = difficulty parameter for answer option j on item i.

Other parameters are the same as with the previous models.

P is the chance that a person with trait level s on item i will choose answer option j or higher.

There are m - 1 difficulty parameters (βij) for each item.

You can also calculate the chance that someone will choose a specific answer to a certain item:

P(Xis = j| Өs, βij, αi) = P(Xis ≥ j – 1| Өs, βij, αi) - P(Xis ≥ j| Өs, βij, αi).

J = the answer option (eg completely agree).

J - 1 = the answer option for it (eg agree).

Which parameters can you estimate?

  • Proportion of correctly answered items for each respondent = divide the proportion of correctly answered items by the total number of answered items.
     
  • Trait level: Ө s = LN ( Ps / 1-Ps) 
    Ps = proportion of correctly answered items by respondent s. 
    LN = (natural) logarithm
     
  • Proportion of correct responses for each item: divide the number of respondents who answered correctly by the total number of respondents who responded.
  • Item difficulty: βi = LN (1-Pi / Pi) 
    Pi = proportion of correct responses / correct answers for item i 
    LN = ( natural ) Log

How can you describe the characteristics of the test as a whole?

Item characteristic curve (ICC)

An item characteristic curve gives the chance of a correct answer to an item for a person with a certain trait level.

  • x-axis: trait level (with 0.00 = average)
  • y-axis: chance of correct answer (between 0.00 and 1.00)
  • from left to right à easiest item (left) à hardest item (right)

An example of the item characteristic curves of four items from a test is shown below.

In this example , the item discrimination parameter is the largest for item 1. Suppose a person has a skill of Ө = 6, then the chance of success (ie, a correct answer) for item 1 is great, but for items 3 and 4 low (even almost 0). Suppose a person has a skill of Ө = 5, then the most likely score pattern (order item 1, item 2, item 3, item 4, where 1 = right and 0 = wrong): 1, 1, 0, 0.  

Item information and test information

Perspective of the CTT: there is a single reliability for a test.

Perspective of the IRT: there is more than one reliability. The psychometric quality of a test is better in some people than in others. So a test may give better information for some trait levels than other trait levels.

For example if there are two difficult questions and four respondents: two of them have a low trait level, the other two have a high trait level. The test then provides more information about the two people with high trait levels. The people with low trait levels both answer the difficult questions incorrectly, so even if they have a different low trait level you won't see that on this test. For the two people with the high trait levels, one of them may answer one item correctly and the other answer both items correctly. The test therefore provides more information about people with high trait levels, because small differences in trait level are noted in this group. 
Item information can be calculated using the following formula:

I (Ө) = Pi (Ө) (1 - Pi (Ө))

I (Ө) is the item information on a certain trait level (Ө).

Pi (Ө) is the chance that a respondent with a certain trait level will answer the item correctly.

Higher item information values ​​indicate a better psychometric quality of the item.

If we calculate information values ​​for different trait levels then we can display these in an item information curve. Higher curves indicate better quality. The top of a curve represents the trait level at which the item provides the most information.

Item information values of a specific trait level can be added together to determine the test information value of that trait level. If we calculate test information scores for multiple trait levels, we can display them in a test information curve. From this you can read how much information t

For which purposes can IRT be applied? 

IRT is a theoretical perspective that is used for different purposes in psychological measurements. A number of applications of IRT are:

  • Evaluation and improvement of psychometric properties of items and tests.
  • Evaluate the presence of differential item functioning (DIF). DIF is when the properties of an item in one group are different than in another group. For example a man and a woman with the same trait level have a different chance to answer the item correctly.
  • Analyzing Person Fit. This is an attempt to identify people whose response pattern does not match the patterns of responses expected on a set of items.
  • Computerized Adaptive Testing (CAT). CAT is a method that is used to accurately and efficiently determine someone's trait level by conducting computer-controlled testing. The test adjusts the questions to someone's trait level. If you have answered a question correctly, the next question is more difficult, if you answer it correctly, you will get a more difficult question, if you answer the difficult question incorrectly, you will get an easier question. In this way someone can determine his trait level quicker.

The Item Response Theory (IRT) is an alternative to the classical test theory (CTT). The IRT identifies and analyzes the measurements in behavioral sciences. The reaction of the individual to a certain test item is influenced by characteristics of the individual ( trait level ) and properties of the item (difficulty level). 

  • For a difficult item / question, someone needs a high ' trait level' to be able to give a correct answer.
  • Conversely, with an easy item / question, someone with a low ' trait level' is enough to give a good answer.
Access: 
Public
Access: 
Public

Image

Image

 

 

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Image

Spotlight: topics

Check the related and most recent topics and summaries:

Image

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the summaries home pages for your study or field of study
  2. Use the check and search pages for summaries and study aids by field of study, subject or faculty
  3. Use and follow your (study) organization
    • by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
    • this option is only available through partner organizations
  4. Check or follow authors or other WorldSupporters
  5. Use the menu above each page to go to the main theme pages for summaries
    • Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Main study fields NL:

Follow the author: Psychology Supporter
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics
1834