Summary: Neuropsychological Assessment
- 13319 reads
This summary of Psychological assessment and theory creating and using psychological tests by Kaplan & Saccuzzo is written in 2016
Book Structure
R.M. Kaplan & D. P. Saccuzo Psychological testing: principles, applications, and issues. 2013 8th edition.
This book is structured in a way that enables the reader to grasp the simplest and most complex issues of testing. It is roughly divided in three sections: principles, issues and applications.
Basic Concepts
We use tests in order to measure certain behaviour and give it a quantitative value; we also gain a better understanding of the behaviour, which further gives us an opportunity to predict behaviour. The measures or test scores we obtain are never perfect, though they greatly help the prediction process. Tests consist of items, which represent stimuli - questions/problems that need to be worked on in the test.
When we want to measure some features of human behaviour we use psychological testing. Make a distinction between different types of behaviours such as overt behaviour which is observable and covert behaviour which is intrinsic and not that obvious (e.g. thoughts).
Be careful when interpreting the scores a test is measuring – the meaning of the scores is subject to change depending on how we define the scoring results. In order to avoid interpretation problems we use scales – which cluster bare scores into distributions that are more specific.
Since tests measure a variety of behaviours, there are many test variations in use. Individual test – only one person at a time receives the test. Group test – more people at a time receive the same test (high school class exams).
Ability tests –speed, accuracy or both are being measured. The three types of ability measured are: achievement that is based on the previous successes, aptitude concerned with the potential to master a skill, and intelligence referring to someone’s general capacity to solve and adapt to problems, think abstractly, and benefit from experience. These three constructs often interact with each other.
Personality tests – overt and covert behaviours are being measures, more specifically a person’s typical behaviour. General distinction is made between structured-objective and unstructured –projective personality tests. Structured tests are those for which you for example need to tick a box and state true or false, while Rorschach is a projective test for which an individual is asked to interpret a stimulus that is rather ambiguous.
Main use of psychological testing is to compare individuals and draw conclusions about these differences (if possible).
Historical Perspective
The tests we encounter nowadays were most likely developed during the past 100 years, even though the origins of testing can be traced back to more than 4 millennia ago in China where oral examination was used to asses promotion issues and evaluate work.
Test batteries represent the use of two or more tests at once and they were common during the Han Dynasty. The Western world most likely got familiar with testing via the Chinese.
Charles Darwin’s contribution to testing culture was an indirect one. Sir Francis Galton, Darwin’s relative, used the evolutionary theory proposed by Darwin to study humans. If the fittest ones survive and we all differ from one another, then some people must have certain characteristics that make them fitter than the rest, Galton argued.
His most valuable work was that he exposed existence of the individual differences in sensory and motor functioning that are the cornerstone of modern scientific psychology. Cattell took this work further and introduced mental tests.
Another stream of thoughts set ground for experimental psychology with Herbart, Fechner, Weber and Wundt whose works were theoretically more relevant and led to the understanding of the great importance of testing control and standardization.
The modern tests of today however, originate from the arising need to test those who were emotionally and mentally impaired. In order to provide those individuals with adequate education the need for further test development was necessary. Alfred Binet is a name associated with the emergence of first general intelligence test. The Binet-Simon scale consisted of 30 items and the results were compared with the standardized sample. He was aware of the significance of standardization of tests, though the sample taken for comparison was not necessarily the accurate one. Take for example 100 Asian girls from poor families as a standardization sample and use test results of an African American adult man from a rich family – the comparison is of no use in this case.
This led to the emergence of the representative sample that is needed to compare the person being tested to people similar to him/her in order to get useful test outcomes.
The Binet-Simon scale was revised several times and the standardization sample increased over time, even more importantly the term mental age was introduced bringing attention to the importance of the measurement of child’s performance compared to its own specific age group. This term brings across the idea of the difference between child’s chronological age (let us say 8) and mental age (let us say 6 - meaning that this child is 8 but performs as an average 6 year old). It was highly criticized for its focus on verbal and language skills.
World War I contributed to the growth of testing demands due to the emerging need to evaluate military recruits. Since the Binet scale was an individual test, a need for mass testing arose during this time, leading to the development of two structured group tests called Army Alfa – which required literacy and Army Beta – which did not.
Another development that followed was the emergence of achievement testing which consisted of multiply choice questions that had a large standardized sample as a norm against which one could compare the results. They are easy to administer and less biased subjectivity-wise.
Furthermore, the Wechsler-Bellevue Intelligence Scale (W-B) won an innovation in intelligence testing now giving the opportunity to test multiply abilities and their combinations in an individual. No need for verbal ability in order to assess the performance (non-verbal scale inclusion).
Personality testing is associated with measuring traits. Traits represent (partly) stable dispositions that can be used to differentiate between people. Optimists tend to remain optimistic even during harsh times. The Woodworth Personal Data Sheet is the first structured personality test that was developed during World War I. The test included items such as: “Do you wet the bed?” – “yes” or “no” and the responses were taken for granted, meaning that dishonesty and personal interpretation of the question were disregarded. Personality tests were harshly criticized and almost disappeared from use by the late 1940s.
Projective tests emerged at around the same time and in addition to the ambiguous stimulus they also provide very vague responses. An example is the Rorschach inkblot test, which provided the subject with an ambiguously looking ink drawing and asked for a rough interpretation of the same. A similar approach to testing was developed in the Thematic Apperception Test (TAT) where the individual was asked to make up a story based on a presented photograph. This way the TAT is supposed to assess human needs and motivations.
Projective tests became popular during the time personality tests were disregarded. Over time, projective tests have failed to prove solid psychometric properties. The need for empirical methods to construct tests was growing and structured personality tests such as the Minnesota Multiphasic Personality Inventory (MMPI) emerged. The authors claimed that the meaning of tests responses had to be explained using empirical methods. This is the most widely used test of the present.
The Sixteen Personality Factor Questionnaire, introduced by R.B.Cattel uses factor analysis as a way to find the minimum number of characteristics (dimensions) or factors to represent a large number of variables (this was the main issue with previous personality tests such as Woodworth - too many assumptions to be investigated). It is still widely used.
With the process of test development many applied areas of psychology developed. Tests remain a controversial issue, nevertheless all psychological areas depend on them greatly.
The need for statistics
Science needs information on how likely it is that certain events happen due to a chance alone, this is why we use statistical methods, and more specifically we use statistics for two purposes: description – because numbers can serve as summaries of the observations and we can make inferences which represent logical deductions explaining events that cannot be explained from direct observation. In terms of making inferences, imagine you want to know how many people listen to a certain radio station. You cannot ask everyone so you take a sample and by examining the sample you make inferences about the population.
Scales of measurement
We need to define measurements in order to make sense of the results. For this we use different scales. We recognize following important measurement properties of scales:
Magnitude (“moreness”), which stands for instance of an attribute that can be described as more, less or equal in amount compared to another instance. For example, weigh Anna and Hannah and since Anna weighs more, one can say that the scale of weight has the property of magnitude.
Equal Intervals is a scale property that indicates if the range between two points on the scale has the same differential meaning as if you were to take any other two points on the scale. It is not the same difference if two people score 35 and 40 on an IQ scale while two others score 130 and 135 even though the score difference is exactly 5 in both cases.
Absolute 0 is the case when no property about to be measured exists. This construct is hard to obtain in psychology since defining an absolute 0 point of, for example, friendliness is hard and somewhat meaningless.
These properties are used to determine different types of scales:
Nominal scales have one purpose and that is to name the objects. It is used when information is qualitative. An example is when we want to explain the person’s gender so we put 1 = male, 2 = female.
Ordinal scales allow us to rank individuals but are unable to describe the differences between those ranks. The scale has a magnitude, but lacks the property of equal intervals and absolute zero.
Interval scales have the magnitude and equal intervals, but no absolute zero (for example Fahrenheit).
Ratio scale is the one containing all three properties (speed of travel as an example where 0 km/h means no movement). Mathematical operations are possible with ratio scales, we can for example say that the speed of 120km/h is twice as fast as 60km/h.
Frequency distributions
In order to get an overview of the score of a group or an individual we use the distribution of scores. Frequency distributions provide information about how frequently each value was acquired. Usually one can find on the X-axis the scores while the Y-axis explains the frequency of the scores. When the distribution is bell-shaped we have the highest frequency towards the centre of the distribution.
A skewed situation occurs when the distribution is asymmetrical, thus the tail goes off to the right of the X-axis making it a positive skew, or left of the X- axis making it a negative skew. An example of a variable that is highly skewed is income, because very few people are extremely rich and a large deal of the population has a low income.
The class interval is the unit on the X-axis that explains a particular score interval.
Percentile ranks
By calculating the percentile rank, one answers the question how many scores fall below a certain value “Xi”. In order to calculate this we use the following formula:
Pr = B/N * 100 = percentile rank of Xi
Pr stands for percentile rank, Xi for the score of interest, B for the number of scores below Xi and N for the total number of scores. Since B is always less or equal to N we multiply the fraction by 100. By doing so we get a percentage. It is useful to know that the percentile rank fully depends on the comparison group.
Percentiles and percentile ranks are similar, while the first one explains the point in a distribution under which a certain percentage of cases fall, the percentile rank is the percentage of cases below the percentile (Pr).
Describing distributions
The mean (X bar) represents the arithmetic average score in a distribution of scores that we use as one of the ways to summarize our data. In order to calculate the mean score we divide the total score (sum of X’s) with the number of cases (N).
The standard deviation (S) represents the average deviation around the distribution’s mean. If the mean is for example 4, and the standard deviation 2, this means that the values between 2 and 6 fall within one standard deviation from the mean.
Variance (S²) is the squared deviation around the mean that represents the average squared deviation. We use variance to avoid getting values of zero when calculating standard deviations, since the sum of deviations around the mean is always zero. So we initially square it and then in order to get the standard deviation out of the variance value we take the square root of that value. In short, standard deviation is the square root of the average squared deviation around the mean.
The Z-Score is used to transform the data into standardized units because this way it is easier to interpret. It is calculated by dividing the difference between the individual score and the mean value (Xi ‑ X bar) with the standard deviation (S).
The Z score can be a positive as well as a negative number, depending on whether the score falls below the average score (negative) or above the average (positive). If there is no difference between the score and the average than the Z-score equals to 0.
Formula to obtain Z score:
Z = (Xi ‑ Xbar) / S
A standard normal distribution has the mean of 0 and variance 1.0. This is because any variable transformed into a Z score has specific properties. Think of the formula for Z and notice that in case you want to find the mean for the Z score you would have a formula where the numerator is the deviation around the mean- sum(Xi-Xbar)/S and the denominator is a constant (N). Since the sum of the deviation around the mean is always 0, the mean of Z scores will always be equal to 0. Now we have 50-50 to the left and right from the mean, which explains the S of 1.0.
There are many ways to transform the raw data in order to make more sense of it. McCall’s T is an example, since it is a system where the mean distribution is set at 50 and standard deviation at 10. There is nothing special about these numbers, since the T score is a simple transformation of Z: T = 10Z + 50. However, you can create any system that suits you by multiplying the Z score with what you want your standard deviation to be and adding what you would like your mean score to be. This way we standardize our scores (take SAT as an example). This is different from normalizing the scores – if you would transform scores of a skewed distribution, it would remain a skewed one.
Quartiles are points in the distribution that divide it into equal fourths. So the first quartile (Q1) stands for a 25th percentile, second (Q2) for the median, third (Q3) for 75th etc. The interquartile range stands for the interval between the Q1 and Q3 or the middle 50% of the distribution.
Deciles are points in the distribution that divide it into equal tenths. They range from D1-D10, each taking up equal 10% of the whole distribution. The stanine system, developed by the U.S. Air Force, converts a set of scores into a scale ranging from 1-9.
Norms
Norms in testing represent the performances on a specific test by defined groups. They are used to provide us with information on the performance by comparing it to what has been observed in the standardized sample. Take IQ scores and SAT scores as an example; you score something and then you use that score to compare it to the standardized or normative score. If you score 130 on an IQ test that is known to have a mean of 100 and standard deviation of 15, your scores indicate an above average intelligence.
Age related norms are found with tests that have several normative groups – intelligence tests, for example, have to take into account whether the test taker is a child or an adult. In order to assess the growth of children, the paediatricians commonly use the age-related norms. What is important to bear in mind is that children of the same age tend to go through different patterns of development. However, children tend to stay at about the same levels as their peers and this is called tracking. Tracking is often controversial, especially in education, where it happens that children are distributed over different classes by the specific performance they show at that moment.
A norm-referenced test works by comparing an individual to a norm. This has been the subject of some criticism as many young children are exposed to competition in areas were they to a below average standard.
Criterion referenced tests are used to assess a specific skill or ability that the test takers can demonstrate (e.g. math skills). The results are not used to compare it to any group or individual, the results have a rather diagnostic kind of reference. They are used to identify issues that can be further worked on.
History and concept of reliability
Psychology as a science has a difficult task with measurement tasks. Complex features such as intelligence are not simple to assess. Fortunately the theory of measurement error is well developed within psychological research. Reliability (consistency of data from many examinations) has a special place in psychological examination since it provides evidence for the scientific feature of psychology as a study. Charles Spearman is a pioneer in development of reliability assessment. Later on many reliability coefficients were introduced to the field.
Classical test score theory states that everyone has a true score that we could obtain if no measurement errors were made. This is why we measure the observed score by adding the error to the true score: X=T+E, where X - observed score, T- true score and E- error. The error of measurement is then the difference between the observed score and the true score we want to obtain: X-T=E. The classical test theory emphasizes that the error in the measurement are random.
Sampling theory indicates that the distribution of those random errors is bell-shaped, so the distributions centre should show the true score and what is around the centre is the distribution of sampling errors. This distribution of errors tells us further how much error there is in out measurement. Classical theory adopts the thought that the true score will not change with repetition of the same test for a certain individual. However, due to the random errors after repeatedly applying the test it is possible to obtain different scores.
Due to the assumption that the error distribution will be the same for everyone, the classical test theory uses the error standard deviation as its essential measure of error - standard error of measurement. Moreover, this measurement tells us, on average, how much the observed score differs from the true score.
The Domain Sampling Model
The domain sampling model is a notion related to the classical theory and tries to figure out problems related to the use of limited number of items while trying to explain and assess a more complex construct. This model explains the reliability as the ratio of the observed score variance on the test (short one - since no time to assess to assess all the feature that could explain, for example, intelligence) and the long-run true score variance. Reliability is not that easy to achieve but can be estimated from the observed test correlational score with the true score. This would have been a good option if we knew the true score, nevertheless the true scores are rarely possible to assess. Finding a true score in testing someone’s ability to spell in German would require from that person to spell every existing word. The alternative way is to estimate the true scores and these estimations’ distributions should be normal and random. To estimate reliability we can continuously create several random parallel tests by drawing new random item samples from the same domain. Find a correlation between the scores on one test and all other random and parallel tests. Then we average the correlations and take the square root of it. Because of the squaring the estimation of the reliability is always positive.
Item Response Theory
Item response theory (IRT) is a psychometric item and it is very important since the testing reliability culture is moving away from the classical theory. The new approach IRT relies on the use of computer to focus on the levels of item difficulty which further helps to gather knowledge about a person’s ability. So the computer fits the person’s responses by for example switching to harder items if the person gets several items in a row correctly and vice versa. More reliability is acquired using IRT with a short test that contains fewer items.
Reliability Models
To explain reliability we usually rely on the correlational coefficients, though it is possible to use a mathematical ratio instead. To do this we make a ratio of true score variance and observed scores variance. Moreover, the observed score does not have to resemble the true score and this can be due to many external influences such as noise or temperature. The reliability is most commonly estimated in the following ways: test–retest, parallel forms, or internal consistency.
Test-Retest reliability estimates are used to try to assess the error when administering a test at two different time points. This is of course valuable only if we are testing something that is not supposed to change over time. In contrast to that, those tests that are about to measure something that changes is not useful to be assessed by test-retest estimation. This type of reliability is somewhat easy to assess, we just need to give out a test at two different but well planned points and then find correlation between them. A drawback is a possibility for a carryover effect to happen which is the moment when the first testing has an influence on the following one. Practice effect is a well-known type of carryover effect when we have certain skills improving over time because of practice. Due to these issues we must set an exact time interval between the two tests and do it carefully.
In order to make a test that is reliable one needs to be sure that the test scores are not representatives of some subset of items from the field we are initially studying. Parallel forms reliability is comparing two of the same forms of a test measuring the same feature. The items themselves are different but selected according to the same rules and have the same difficulty. Another term for parallel forms reliability is the equivalent form reliability. If the two test forms are administered at different times, we include the error that relates to the time discrepancy.
The Split-half method for assessing reliability simply divides the administered test into halves that are then scored independently. After assessment the results of the halves are compared to each other. When the test is long is it preferable to divide it in half at random, while if one wants to keep it simple the split in first and second half is also possible. If the test items are getting increasingly more complex than the odd-even system of splitting is most commonly used.
However, the reliability of the halves is not as strong as of the whole test and this is when we can use the Spearman-Brown formula in order to estimate what the correlation would be if applied to the whole test. This formula increases the estimate of the reliability but it is not always good to use, for example when the halves do not have same variances. In this case the general reliability coefficient α can be used and α provides the lowest possible estimate of the reliability. Important to know is that alpha can support that a test has a needed reliability but cannot tell when a test is unreliable. In case the variances of both halves are equal then alpha and Spearman-Brown coefficient provide the same results. The formula to use when assessing the reliability of items that are dichotomous (0 or 1) is the Kuder-Richardson formula – KR20.
The coefficient alpha is used when we cannot recognize right or wrong answers, such as personality tests. Imagine scales from strongly disagree to strongly agree - none is incorrect but it explains your position on the scale in between the agreement and disagreement. This is a very general reliability estimate. Important formula is: r = α = (N/N-1)*(S²-sumSi/S²).
Alpha is more general because it has the power of describing the item even if no right-wrong indication is present (in contrast to the Kuder-Richardson formula). Alpha estimates the reliability through the use of internal consistency - if the items are not measuring the same feature than we can say the test is lacking internal consistency. If this is the case, factor analysis is the most common way to deal with inconsistent measurements.
Sometimes we want to study a type of behaviour or characteristic by obtaining a difference in particular scores and evaluating why this is the case. In such a situation where we are comparing two different attributes we must make sure to make a Z comparison, since Z is the standardized unit. This difference in the scores is a common problem with further use of scores. As mentioned, when a difference is found, the error (E) is probably larger that the observed scored separately (T-true score) because in this case E consists of errors from both parts that create the initial difference.
Moreover, T is expected to be smaller than E because whatever the two parts have in common will vanish when the difference in the scores is made. Because of this, the reliability of the different scores is expected to be smaller that of each of the scores.
Use in Behavioural Observational Studies
It is well known that some psychologists prefer the use of observational studies to tests. Observational studies seem simple and straightforward, however those have many sources of errors, and very common are sampling errors that must be taken into account with the evaluation of results. Generally, when observing behaviour one often meets high unreliability mostly due to the difference in true scores and those recorded by the observer. To control this and improve reliability we can use several techniques to estimate the reliability. Those are interrater, interscorer, interobserver, and interjudge reliability and they all test how consistent the reports of different judges on the same behaviour are. We can simply record the percentage of times they agree, however this technique has two problems. One is that we lack the level of agreement that could be gotten just by chance, and the second is that we cannot get an average of the percentages.
The Kappa statistic is known as the most suitable way for evaluating the agreement level amongst observers. Kappa measures the agreement between them by relying on the nominal scale. Thus, we get a proportion of the expected agreement taking into consideration the chance agreement. Kappa varies between -1 and 1 (less than a chance agreement-full agreement).
Sources of Error and how to assess them
Errors: time sampling occurs when we give the same test in different time, even if we administer them to the same people. Item sampling when we have the possibility to assess a feature using a great possibility of items. Internal consistency stands for the intercorrelations between the items in the same test.
Important: When assessing reliability, take the possible sources of errors into consideration.
The use of Reliability Information
In the next paragraph the practical facets of the reliability evaluation will be described. The standard error of measurement provides information on how inaccurate a measurement can possibly be. When large, standard error indicates less certainty about how accurate the measurement of a particular item is. Standard error can be calculated by using the reliability coefficient and standard deviation: Sm= S*√1-r, where Sm is the standard error for the measurement, S is the standard deviation and r is the reliability coefficient. To create a confidence interval around particular observed scores, the researchers use the standard error. More specifically, one cannot know whether the observed score is the true one, but when forming the confidence interval around that score we can estimate the probability that the true score will fall within the interval (or not).
What should be the reliability level so that we can call it a high reliability? Range from .7-.8 is good enough for most cases in research, however it depends on the purpose of the test. Others believe everything under .9 is not worth of mentioning. Highly focused tests tend to have high reliability, while complex constructs are usually less reliable.
In order to increase the test reliability psychometrics suggests two methods and those are lengthening the test and discarding low reliability items. Moreover, the reliability will increase as we increase the number of items, to do this the researcher might end up spending a lot of time and money. Using Spearman-Brown formula could help in this case since it can indicate how many items more are needed in order to increase the reliability. Often while testing it turns out that some of the items do not measure the construct in question. If one leaves those out - the reliability will increase. In order to make sure the items are measuring the same thing, one can use factor analysis or inspect the correlation between every item and the total test score - discriminability. When this correlation is low, this indicates the discrepancy in the measures of the items. It can mean another thing - too easy/hard item will give results that are different to evaluate. Low correlation - should be excluded.
Measurement error attenuated = diminishes the potential correlation. We need to correct for the attenuation and we do this by dividing the observed correlation between tests 1 and 2 with the square root of the reliability of test 1 * the reliability of test 2. R12(hat)=r12/√r11*r22. The discrepancy that we obtain indicates that correcting for the attenuation would increase the observed correlation by x (from-to).
Definition of Validity
In testing, validity stands for something close to a meaning. It is an agreement between the quality of what is the test supposed to measure and a test score (measure). In simple words: “Is the test measuring what we want it to measure?” Validity can also be defined as the evidence for implications made about a test score. This evidence consists of three criterions: construct-related, criterion-related and content related. Face validity is officially not a form of validity but it is a term that is widely used in testing. It stands for a simple measure if a measure has validity. It is somewhat a brief impression of whether the items seem to be related to the purpose of the test, however, it has nothing to do with validity because it does not provide any evidence in support of the conclusions (test scores).
The different aspects of Validity
Content-related evidence for validity of a measure provides the information about how adequate the representation of the domain test is designed to cover. Does what you have on your exam really represent your knowledge of the subject? It is the logical type of evidence, in comparison to the rest that is rather statistical. Two concepts related to the content validity evidence are construct underrepresentation ‑ failure to grasp important components of a construct and construct-irrelevant variance ‑ when scores are influenced by some side factors not related to the construct itself.
Criterion-related evidence for validity tries to assess how well a test relates to a specific criterion, providing such evidence when correlation between the test and criterion measure is high. We have a test that stands in for a measure for what we actually want to measure. For example a premarital test serves to predict the marital satisfaction in the future. This predicting feature of the criterion validity evidence is better known as predictive validity evidence. Take the SAT and GPA score as examples, the SAT is a predictive variable while the GPA is the criterion. The test is used to predict the success on the criterion mentioned. Another criterion is the concurrent-related validity evidence; it explains the simultaneous relationship between the criterion and the test. It is possible to assess only when they can be measured at the same time. For example, test for learning disabilities and school performance. Moreover, when a person does not know how to respond to a measure of criterion-say occupation, the SII ‑ Strong Interest Inventory (uses collection of patterns of interest among people satisfied with their jobs) will be a better predictor of perceived career fit than personality would. This stands for vocational interests in general (better predictors).
The test-criterion relationship is mostly expressed with a correlation called validity coefficient. This number expresses how good the test is in making assumptions about the criterion. In general the coefficient between .3 and .4 is often regarded as high. If we square our coefficient, we get the percentage of variation in the criterion.
Construct-related validity evidence represents a succession of procedures where a researcher concurrently defines constructs and is developing the tools to measure it. By this he is making evidence of what a specific test means. To gather this evidence is a continuous process that takes time as if finding support for a complex theory. In 1959, Campbell and Fiske found a distinction between two essentials for a test to be meaningful. Those are the convergent and discriminant types of evidence. The convergent evidence is present when a certain measure correlates with other tests that are believed to measure the same thing. Thus, the measures of a same construct converge on the same item.
Convergent evidence can be assessed in two ways, one is that we have to provide information that a test measures the same things as other tests used for the same cause. The second way is to show specific interactions that we can expect if the test is measuring what it is supposed to measure.
Discriminant evidence is needed in test validation as a proof that a test is measuring something distinctive. This demonstration of distinctiveness is what we actually call discriminant evidence (same as divergent validation). This type of evidence actually shows that the measure is not able to represent another construct but the one it was designed for. Different categories of validity are no longer supported as constructs, the different categories of evidence are.
In theory, we can have reliability without validity, but it is not possible to demonstrate that a test without reliability is valid. Reliability and validity are certainly related concepts.
Guidelines for Item Writing
Writing items can be difficult and there are many things to consider. DeVellis provided some guidelines to in order to help with item writing:
Make the item very specific - make it clear what you actually want to measure by using the theory.
Pay attention when selecting and developing items, for example avoid unneeded ones.
Avoid items that are too long.
Make sure that the language difficulty is suitable for the test takers - that it is clear to them what is asked.
Avoid bringing up two or more ideas with one item - avoid so called “double barrelled” questions.
Having both positively and negatively worded items is good.
Being cautious about ethnic and cultural differences is necessary since the same item can be interpreted differently across cultures.
The dichotomous format is a format that offers you two alternatives for each item. The usual form of this format is the true-false test. You are presented with a statement and it is on you to decide if it is true or false. The positive aspects of this format are: simplicity, fast scoring, and easy test administration. Some drawbacks are: they rely on the test takers ability to learn by heart – not allowing him/her to show the understanding around the topic.
Dichotomous tests tend to lack reliability in comparison to some other tests. This type of format is widely used with personality tests where we are in a need of the absolute justice.
Polytomous (polychotomous) format is similar to the dichotomous format, though it provides more than two alternatives for the response. Most commonly found with the multiple-choice examination, this format is easy to score and the ability to provide the correct answer just by guessing is lower than 50% (which is the case with the dichotomous format).
The major advantage is that is takes up less time to respond to an item because it lacks the need to elaborate upon and write an answer. As only one of the alternatives is correct the rest are labelled as distractors. Choice of distractors is essential because too many or too complicated distractors take up too much time and often have a negative effect on the test reliability. Research suggests the use of three-four good distractors is the best option.
Years of psychometric analysis indicates that the three-option multiple choice items are as good (or better) than any other number of alternatives used.
The problem of the expectancy of a right answer by guessing is often dealt with the “correction for guessing” formula:
corrected score= R-(W/(n-1)) where R=the number of right responses
W=the number of wrong responses
n=the number of choices for each item
W/(n-1) represents an estimation of how many items one is expected to get right only by chance.
The essay is another format, very common in school class use and its validity/reliability are rarely assessed and analysed.
The Likert format, which is very common with personality and attitude assessment, requires from a test taker to specify the degree of agreement with a presented attitudinal quote. For example: “I am afraid of spiders” – strongly disagree, disagree, neutral, agree, and strongly agree. Sometimes the neutral option is avoided.
The category format is a form of the Likert format, though it provides even more choices for an answer. Most common is the 10-point rating scale. For example: “On a scale from 1-10 how much do you find your best friend reliable?” It can have more than 10 points, or less at times. Since it is known that people tend to change ratings depending on the environment/context, the category formats are criticized for their lack or reliability. This can be avoided by specifying the endpoints of the scale very strongly and reminding the test taker to think about the endpoints in this way. Why 10? It depends on the test takers involvement and relatedness to the topic in question. If the test taker is greatly involved and motivated to give accurate responses he/she is able to respond best if there are many points on the scale - since they can distinguish many “shades”. With people who are uninterested it makes no difference if you provide them with a 7 or 27 point scale.
The visual analogue scale is a format related to the category format and presents the test taker with a 100-mm line and asks him/her to place a mark as a response to the question somewhere between the end points of the line. Scoring in this case is time-consuming, though these scales are popular with self-rating health.
Checklists and Q-sorts
Adjective checklists are lists of adjectives that require the test taker to indicate which ones characterize him/her. It is also used to characterize others. Only two options are present- either you are something (adventurous) or you are not.
The Q-sort technique is similar but it uses more categories in order to assess the personality. For example, you are given statements and asked to place them in 9 piles. Most of the characteristics people place in the piles-4, 5 and 6 which reserve places for statements that mildly characterize the subject in question. Those at the extremes 1 and 9 usually tell something interesting about an individual.
“All of the above” option is mostly advised to be avoided as an alternative answer- though highly ignored.
Analysing Items
Item analysis stands for a group of methods used to assess test items and are considered to be an essential aspect of test construction.
Item difficulty is a measure obtained by the number of people who get a specific item right. As the proportion of correct answers increases among the group of test takers - the difficulty of the item is decreasing. An item that is answered correctly by everyone is obviously a bad item since it provides us with no information about the discrepancies among test takers - which is exactly what we are trying to assess.
The optimal difficulty is considered to be halfway between getting the right answer by chance and 100% getting it right. To be more precise, you take the 100% success level – 1.0 and subtract the chance level - .25 from it and divide this by 2 - the number obtained is the half-way point (for a 4-choice item). Furthermore, we add the “by chance” performance -0.25 to the obtained value and this way we calculate the optimum item difficulty which is 0.625 for a 4-choice item. The best is to have items of different difficulty in order to make several discriminations. For example - few easy items can contribute to the control of anxiety of the test takers, which further increases the reliability.
Item discriminability is another way to assess the item quality by looking at the relationship between the performances on a particular item with the performance on the test as a whole. In other words test if people who did well on a certain item have also done well on the test in general. There are several ways of assessing the discriminability.
The extreme group method compares those who have performed well with those who have performed rather poorly. Then you have to find the proportion of people from these groups who got each item right and compare it between the two extreme groups. This difference is called the discrimination index. When this index is a positive number somewhat away from 0 – we consider this item a good one, when near 0 - means no discriminability, when negative – bad item.
The point biserial method is assessing the correlation between an item and a total test score:
rpbis = [(Y1bar-Ybar)/Sy]*sq.root (Px/(1-Px)) where:
rpbis = the point biserial correlation or index of discriminability
Y1bar = the mean score on the test for those who got item 1 correct
Ybar = mean score on the test for all persons
Sy = the standard deviation of the exam scores for all persons
Px = the proportion of persons getting the item correct (Allen & Yen, 1979)
Note it is not smart to use point biserial correlation with tests with only a few items since the item performance necessarily contributes to the total score. In order to better assess the results we use the item characteristic curve. The total score is presented on the X axis while the proportion of the test takers who got the item correct is presented on the Y axis. When you observe a gradual positive slope of the graph line representing the proportion of people who pass item gradually increasing as the test scores increase this means that the item is good because it discriminates at all levels of performance.
A flat line indicates that a test taker of any ability was equally likely to answer correctly - this is a consequence of a poor item.
A curve that gradually rises and then starts turning down for people at the highest levels of performance indicates that those with the best overall scores did not have the best chances of getting the item correct. This often happens with the “none of the above” alternatives.
Item response theory is an approach to testing that analyses the item while considering the chances of getting each item right or wrong taking the ability level of each test taker into consideration. The biggest advantage of this approach is that that person’s score is not defined by the total number of correct answers but by the difficulty of the items the person got correct. Another crucial advantage is in its ease to adapt to computer administration.
Criterion-Referenced testing compares performance with some specific criterion for learning, for example Annie’s score of 77 (out of 100) on a maths test is compared not to the rest of his class but to how much he “should have” learned. Step one in using this type of testing is to clearly define the learning outcome about to be achieved. To properly evaluate the criterion-references test one should study two groups of students - the ones exposed to the learning program and the ones who are not.
The main issue with the item-analyses is that even though statistics help the test maker to assess which item is good and which one is not it still does not contribute to the successful learning of the students.
The difficulty of defining Intelligence
Alfred Binet defines intelligence as: “The tendency to take and maintain a definite direction; the capacity to make adaptations for the purpose of attaining a desired end, and the power of autocriticism”. Other such as Spearman and Freeman had different views, which explain that it is hard to define intelligence in only one manner. Moreover, Taylor (1994) defines three streams of research that study intelligence and those are: the psychometric approach - examining the fundamental structure of a test; the information-processing approach that emphasizes the underlying processes of how humans solve problems; and cognitive-tradition - focusing on human adaptations to the real-world demands. The mentioned view of Binet falls within the psychometric approach. It is and was widely known that people are able to accomplish remarkable things and that they also differ in this capability on different levels, which indicates the existence of intelligence, however the main problem was how to define intelligence. Binet for example, was indecisive of what he actually wanted to measure, and alongside the above-mentioned definition, he and his colleagues developed the first intelligence test.
Binets Principles of Test Construction
In Binet’s view intelligence corresponds to a capacity to find and retain a purpose or direction, adapt if necessary in order to achieve the purpose and be able to criticize oneself, which would induce the adjustment in the strategy towards the goal. Binet with colleagues worked on developing ways to measure judgment, reasoning and attention. Binet provided foundation for further human test abilities and was guided by nowadays well-known constructs of age differentiation and general mental ability.
Age differentiation is explained by the fact that we can make a difference between younger and older children in their capabilities where the older ones are more capable. Binet decided to use tasks with which he could estimate the mental ability of a child by comparing the result on the task with the one “average” for a child of a specific age. This way one can determine age capabilities of a child independent from the child’s chronological age. This was later called the mental age. Moreover, Binet decided to measure only the total product of several distinct elements of intelligence and named this general mental ability. He chose this most probably to ease himself from having to define every independent element of intelligence, thus to make it more practical.
General Mental Ability - Spearman
Next to Binet, Spearman used the notion of mental ability as a ground for all intelligent behaviour. Following Spearman’s theory, general intelligence factor g and a great number of other specific factors consist of what he called intelligence. Spearman’s idea of general mental ability (which he called the psychometric g, or just g) was grounded on the phenomenon that if one administers many different ability tests to an unbiased population we will find that almost all the correlations will end up positive. This is also referred to as positive manifold, which, as Spearman explains, results from all the tests being influenced by the “g”.
Factor analysis was introduced by Spearman as a way to statistically support the notion of g. Simply put, factor analysis reduces a set of variable scores into factors – a smaller number of variables. Spearman also claimed that as much as 50% of the variance in mental-ability tests is characterized by the g. This notion is still used in the present day.
However, present theories tend to emphasize the idea of multiple intelligences rather than a single one. As the “gf-gc” theory proposes, we have two basic types of intelligence and those are fluid and crystalized. Fluid intelligence refers to all the things that enable us to acquire new knowledge, to reason and to think while crystallized intelligence is what we have already acquired and understood.
Binet - scale history (including Terman’s Stanford-Binet Intelligence Cycle)
Binet scales went through many revisions and new forms. In 1905 we have the Binet-Simon scale that represented a 30-items individual intelligence test with the increasing difficulty of the test items. By this time Binet solved two problems about his previous work and those are: he now knew exactly what he wanted to measure and he further came up with the items to support those measurements. However the Binet-Simon scale lacked several things, some of them were specific measuring units as well as the normative data that could support the validity. Norms in the 1905 scale were based on only 50 children and the children were considered “normal” according to their school performance.
1908 Simon (French minister of public instruction) and Binet incorporated the idea of age differentiation in their work and made the 1908 scale an age scale. The items in the scale were centred around the age level and not just simply by increasing difficulty like before. This scale meet a few challenges as well due to the fact that when we group items according to the child’s age level, comparing the performance on different forms of tasks becomes increasingly difficult. Besides the challenges the 1908 scale was a definite improvement in comparison to the 1905 version. This scale was very focused on the verbal and language ability and this was the main criticism towards it. With the introduction of the mental age concept, Binet started working on the problem of unspecific units for evaluating the results. In short, Binet-Simon scale offered two crucial concepts and those are age scale format and the mental age concept.
Terman’s Stanford – Binet scale was developed by L.M.Terman and for a leading intelligence scale from 1916 up to its revisions. The 1916 Stanford-Binet scale was developed with the presence of age differentiation, age scale and general mental ability. Mental age construct was retained as well. What made a difference was the increased scope of the standardized sample, however, it consisted of only white, native-Californian children, which indicates that it was not as representative.
Furthermore, the 1916 scale introduced the concept of IQ or intelligence quotient. The IQ was now using the person’s mental age together with the chronological age to obtain a ratio score. This ration was regarded as a reflection of the person’s mental development.
IQ = MA/CA*100 where MA is mental and CA is chronological age. The result is multiplied by 100 in order to avoid the fractions. The problem was that the scale had a maximum mental age of 19.5 years, which led to people being older than that having unusually low IQ’s. The maximum mental age was then set to 16 because it was believed that after 16 your mental age stops developing further.
The 1937 scale contained several further improvements. The age range was extended down to the 2-year old point. The maximum mental age was extended to 22 years and 10 months by adding new tasks. The standardization sample was greatly improved, now the norms were set by representatives from 11 US states, however, this obviously did not make it perfect. What helped the psychometric properties of the scale to be more easily examined was the inclusion of an alternate equivalent form. Furthermore, the 1937 scale had a major problem and that was that its reliability coefficients were higher in the case of older subjects than in the case of the younger ones. Also, the reliability was higher for the lower ends of the IQ scale, and lower for the higher ends. The scores were and are most unstable for the youngest age groups in the highest IQ ends. Apart from the reliability problems, the fact those different age groups expressed significant differences in the standard deviation of the IQs was another problem. Because of this, the IQs at a particular age level could end up different to IQs at another level (e.g. Standard Deviation at age level 6 was 12.5 whereas for ages 2.5 and 12 the Standard Deviation would be 20.6 and 20).
The 1960 Stanford-Binet Revision and deviation IQ (SB-LM) was created as an attempt to develop a scale similar to the one from 1937 by using the best things about the 1937 scale: the fact that with the increase in age the test scores increase and that some tasks correlated highly with the test scores as a whole. To add to that, the IQ tables were extended to age 18 and the scorings as well as the test administration were improved. The major problem of the previous scale - the differential variation in the IQs - was solved by introducing the deviation concept of the IQ. The deviation IQ introduced a standard score with a mean of 100 and standard deviation of 16 (now 15). Furthermore, new tables were made in order to correct for the differences in variability at different age levels. This enabled the comparison of the IQs of a specific age to those of another. Several additional revisions followed, however, return to the 1960 model was present due to its better instructiveness.
Looking at the Modern Binet Scale
The modern Binet scales include the gf-gc theory that is grounded on the view that we possess multiply intelligences, not only one. It is a hierarchical model and at the top we have the g. Under g we have three group factors: crystalized abilities - which reflects the learning that is when one is aware of the initial possibility or capacity, fluid-analytic abilities - the original capacity that one uses to obtain the crystalized abilities and short-term memory - the information one can keep briefly after only one presentation. Crystalized ability also has two sub-abilities and those are nonverbal and verbal reasoning. Thurstone’s multidimensional model relies on his argument that in contrast to Spearman’s idea of intelligence as a single construct, intelligence can be best understood and defined as a compromise of independent factors or “primary mental abilities”.
The 1986 revision kept much of the previous versions, however, the age scale was completely removed. In replacement for the age scale, we now have items of the same content grouped together in one of the 15 tests to create point scales (e.g. all language items would be grouped in one test).
The 2003 version provided an extra hierarchical model containing 5 factors where general intelligence is at the top which is the same case as with the 1986 version, just now each of the 5 factors is a “main” factor and each has a verbal and nonverbal measure. 2003 fifth edition is an integration of the point and age scale. Nonverbal-verbal scales have equal weight for any item and each test starts with one of the subtests: verbal or nonverbal. Moreover, there is a point scale of similar content of increasing difficulty. So in this version of the scale the routing serves to assess test taker’s ability, where nonverbal examines nonverbal ability, the verbal examines the verbal ability. Since both verbal and nonverbal are equally weighted, it is possible to evaluate the test takers score on all items of similar content.
The level of ability we can initially estimate for a person is called the start point. The level where a minimum of correct responses are found is called the basal. The ceiling is the testing point where a specific number of wrong answers indicate that the items are too difficult. The main idea of the fifth edition is to bring back the extremes in intelligence which is a valuable property of the Binet lost in the fourth edition.
The range in age now spans from 2-85+. The score now ranges from 40-160 making it very useful in assessing the extremes in intelligence. The reliability of the fifth edition is regarded as good with coefficients for the full-scale IQ of .97 or .98 for all of the 23 age ranges mentioned in the manual. In addition, the manual reports four forms of evidence that are supporting the validity of the test in question and those are content validity, empirical approach to item analysis, relative criterion-related evidence of validity and construct validity.
Wechsler scales to WAIS-III- scales, subtests and indexes, and interpretation of features Wechsler indicated that intellectuality is not the only factor involved in the intelligent behaviour, he suggests other factors play a part as well. Three Wechsler intelligence tests are available and those are Wechsler Adult Intelligence Scale, Third Edition (WAIS-III), Wechsler Intelligence Scale for Children, Fourth Edition (WISC-IV) and the Wechsler Preschool and Primary Scale of Intelligence, Third Edition (WPPSI-III).
Wechsler scales greatly differ from what are the main concepts in the Binet. Wechsler believed that since Binet’s scales were meant for children they shall lack the validity when it comes to testing adults. Amongst others, two big differences between these two approaches are Wechsler’s use of point scale instead of the age scale and Wechsler’s use of the performance scale. With the point scale we assign points to each item and the person will receive a certain amount of points for each item done. The advantage of this approach is that it makes it simple to group items of a specific content. Wechsler’s test produces scores for each specific area. This approach is nowadays standard. Performance scale is an entirely new construct embedded in the scale and provides the measure of nonverbal intelligence. It is built of tasks that require the person to perform and not only answer questions.
Wechsler scale included two different scales, verbal was the measure of verbal intelligence and performance measured nonverbal intelligence. Nowadays versions of Wechsler include four major scales. The performance scale was not entirely Wechsler’s innovation, it was used in different forms before as an alternative to the Binet. Wechsler was innovative in offering the possibility to directly compare the nonverbal and verbal intelligence since the verbal and performance scales were standardized on the exact same sample and the units of expression of results were comparable. The idea of performance scale is to overcome the issues and biases triggered by different language, level of education or culture. It took several attempts until the Wechsler scale reached a proper form since the first version called Wechsler-Bellevue was standardized badly; it consisted of about 1000 eastern US whites (mostly from New York). WAIS-III was last revised in 1997 and will possibly to be revised again soon.
Similar to Binet, Wechsler saw intelligence as a capacity to act towards a goal and adjust to the environment. However, in his opinion the elements that build intelligence are not independent, but interrelated. He uses terms global and aggregate to explain this. Intelligence further consists of many interrelated elements and the general intelligence will be the outcome of the interaction of these constructs. Wechsler was concentrated on several constructs on his way explaining the general mental ability while this was not the case with Binet. WAIS-III has seven verbal subtests and those are: vocabulary, similarities, arithmetic, digit span, information, comprehension and letter-number sequencing.
The vocabulary subtest represents the ability to define the presented word and it stands for one of the best measures of intelligence as well as a most stable one, which is one of its crucially important features. If we have a patient who suffered from some sort of a brain damage, the vocabulary subset is the least one to be affected.
The similarities subset presents the person with 15 joint items of increasing difficulty and requires the person to indicate the difference between the presented paired items. Some items require the subject to think rather abstractly and notice the similarity between not so obviously similar constructs.
The arithmetic subtest presents the test takers with 15 somewhat easy problems, which do not require complex mathematical knowledge but require the ability to hold the information available for calculation until the answer has been made. This is why memory, motivation, and good concentration are essential for the performance on this subtest.
The digit span subtest requires from the person to repeat the presented digits that follow one another in a span of 1 second. This measures the capacity for short-term memory. Bear in mind that with Wechsler there are always some side-nonintellective factors that could influence the performance. In this case that would be attention. Anxiety is another example.
The information subtest presents the person with both nonintellective and intellective modules including the necessity to understand what is asked, follow rules and give a response. Nonintellective features in this case are some constructs like curiosity and knowledge acquisition. This subtest is also influenced by alertness, more specifically alertness to environment and cultural opportunities.
The comprehension subtest presents the person with three types of questions, firstly questions that require the person to decide what should be done in a specific situation. Secondly, it requires for a logical explanation for some presented phenomenon. Thirdly, it is required from the person to define or explain a certain proverb. In general, this subtest provides information about the understanding of everyday practical situations or common sense. The problem that could arise in this subtest is if a person’s emotional involvement affects the judgment and leads to an unsuitable response. The letter-number sequencing subtest is one of the latest WAIS-III subtests. It consists of seven items and the person is required to reorder the list of letters and numbers. This subtest provides information on attention and working memory. The verbal scale is assessed by combining the raw scores of the subtests just mentioned to obtain a Verbal IQ (VIQ) one should sum up the age-correlated scores from the verbal subtests just mentioned.
WAIS-III has seven performance subtests and those are: picture completion, digit symbol-coding, block design, matrix reasoning, picture arrangement, object assembly and symbol search.
Picture completion subtest is made out of a picture that is missing an important part and the person is asked to spot the missing part. This task is timed.
Digit symbol system-coding subtest asks from the person to copy items paired with numbers from 1 to 9 and assesses the person’s capacity to learn an unknown task, the level of persistence and speed of performance. It also measures the visual and motor agility.
Block design subtest includes nice differently coloured blocks as well as a booklet with the photos of the same blocks arranged in a specific geometric manner. The person is asked to arrange the blocks and make up increasingly difficult patterns. The input is visual and the response required is a motor product. It is a good way of evaluating the abstract thinking in a nonverbal manner.
Matrix reasoning subtest became a part of the WAIS-III as a way of inducing the assessment of fluid intelligence, which incorporates the ability to reason. The person is given nonverbal, figural stimuli and the task is to spot a pattern or a relationship between those stimuli. This subtest is a good way of assessing the abstract-reasoning and how well a person can process the information.
Picture arrangement subtest asks the person to spot and indicate relevant features of the picture and explain the cause-effect interactions. The person must place the received misarranged photos in the right order and make up a story, thus it assesses the capacity to find the logical sequence of the events.
Symbol search subtest assesses the speed of processing information within intelligence. The person is asked to spot two objects in a group of many and report if the objects required were present or not.
The performance IQ (PIQ) is assessed by summing the age-correlated scores from the performance subtests and comparing them with the standardized sample. The FSIQ or the full scale IQ further follows the same principle as the VIQ and PIQ and we can sum the age-correlated scores from the verbal and performance scale and make a comparison to the standardized sample.
Index scores are another way of assessing intelligence. There are four index scores and those are: verbal comprehension, perceptual organization, working memory, and processing speed. The verbal comprehension score is thought of as a good way to assess crystalized intelligence. It is regarded as better than VIQ because it disregards the arithmetic subtests that have more to do with the working memory. The perceptual index is regarded as a good measure of the fluid intelligence. One of the biggest innovations of WAIS-III is the concept of working memory and this refers to the information that we can hold in our minds for a short while in order to work with some information. Finally, the processing speed index is trying to assess the speed of your mind, while one person needs 30 seconds for a certain task, another needs only 5.
Comparing the verbal and performance IQ and providing a measure for it is one of the very useful features of the WAIS-III in comparison with the Binet scales. What needs to be taken into account is the influence of ethnic background. Another useful measure is pattern analysis ‑ one can assess and describe quite large differences found between the subtests scores. For example, some sorts of emotional problems might have an effect on the subtest performance and this could further form special score patterns. Possibly, we can look further for those patterns every next time we obtain the scores and we might conclude something about the person that took the test. Research on pattern analysis provides contradictory results. This way of analysis must be done very carefully.
Psychometric features and evaluation of Wechsler
The standardized sample of WAIS-III consists of 2450 adults classified into 13 age groups from 16-17 up to 85-89. Race, gender, level of education, and geographical placement were taken into account. The reliability of the WAIS-III is quite high and includes internal and external reliability measure of the verbal, performance, and full-scale IQs. The average coefficients among all age levels range from .94 for PIQ to .98 for the FSIQ (VIQ- .97).
SEM (standard error of measurement) is a number based on the reliability coefficients and it is supposed to assess the discrepancy between what a perfect measuring instrument would provide with what is actually gotten. The validity of the WAIS-III is greatly based on the correlation with the previous versions, especially revisions. Generally the correlations are higher between the FSIQ, VIQ, and the PIQ, and lower for their subtests. Important to note is that according to the theory we possess at least seven different intelligences that are independent of each other and those are: interpersonal, intrapersonal, linguistic, body-kinaesthetic, special, musical and logical mathematical. WAIS-III does not support this theory and leaves little space for such an idea.
Extensions – the WISC-IV and the WPPSI-III
The extensions of the WAIS-III are the WISC-IV and the WPPSI-III.
The WISC-IV is the most recent version of the scale measuring the global intelligence and indexes of certain specific cognitive abilities, process speed, and working memory. This scale has introduced some innovations such as the idea and assessment of fluid reasoning with emphasis on working memory and processing speed concepts. Furthermore, WISC-IV uses empirical data to assess the item biases. It is known that it does not have the power to completely remove the biases but it uses empirical data that are going to that direction. The standardization sample contains 2200 children of taking age, race, regional data, parental education, and occupation were taken into account. If needed to interpret the WISC-IV the approach is very similar to the one used for the WAIS-III and involves the assessment and evaluation of the four major indexes in order to examine the drawbacks in any area and further assess the validity. The reliability is at the lowest level for the youngest children that have the greatest achievements. The validity has been greatly supported just as the competitor-scale was for Binet. The good standardization contributes to this.
The WPPSI-III is another extension of the WAIS-III and it has been revised several times before this version was published. It contains almost all of the WISC-IV components such as five composites, but not the PIQ and the VIQ. Reliability is comparable to that of the WISC-IV and the validity is greatly supported in the manual. This scale is more sensitive to the measurements of the abilities of youths with less language ability than that of their older equivalents.
Introduction
During your time spent studying, you have doubtless encountered a standardized test. This may have come in the form of the GRE Revised General Test (GRE), the SAT Reasoning Test (SAT-I), or even a Goodenough-Harris Drawing Test. Many universities handle admissions through the use of standardized group entrance exams. The key factor to these standardized tests is the test criterion i.e. what the test is trying to predict. This can prove difficult. In the case of the GRE, which is widely used in the admission process to postgraduate programs, the test does not predict the capacity to solve real world problems or clinical skill.
While the tests discussed in this chapter improve the accuracy of a selection process, it is important to note that they account for a very small amount of variability.
Comparison of Group and Individual Ability Tests
Individual tests and group tests both have their own advantages and disadvantages. Individual tests are carried out with a single examiner assigned to a single subject. The examiner follows instructions which are provided in the manual of the standardized test. What follows is a response–record interaction in which the examiner records exactly the subject’s response. These responses are then evaluated, a process which can require a high degree of skill. In contrast, a single examiner can administer a group test to multiple individuals at the same time. Subjects are read the instructions by the examiner, time limits are established, subjects record their responses’ themselves, and the responses are calculated as a percentage which usually requires very little skill.
If a subject is experiencing distress for any reason, be it fear, stress, an uncooperative nature, the examiner in an individual test takes responsibility for maximizing performance. In other words, the examiner can attempt to elicit maximum performance. In the case of a group test, it must be assumed that a subject is fully motivated and cooperative. For this reason, low scores on group tests can be difficult to interpret. They can be attributed to a wide range of factors whether it be low motivation, clerical error, unclear understanding, etc.
Advantages of Individual Tests
Through individual tests, it is possible to learn more about a subject beyond their test score. After time, examiners develop internal norms. Having these internal norms, the experimenters are able to easily identify unusual reactions to certain tasks or situations. This gives the chance to observe behaviour in a standardized setting. This allows the examiner to see beyond the test scores in a unique way.
Advantages of Group Tests
When compared to individual tests, group tests are more cost efficient, require less expensive material, and require less examiner skill. They are commonly more objective as the subject records their own responses, thus making them usually more reliable. Individual tests are mostly applied in clinical settings, whereas group tests are applied in a much broader setting. Group tests are commonly used at various levels of schooling. Areas of military, industry, and research also greatly rely on them.
Overview of Group Tests
Characteristics of Group Tests
For the most part, group tests can be categorized as paper and pencil or booklet and pencil tests due to most of them consisting of a printed booklet, test manual, scoring key, answer sheet, and pencil. This is changing however, as we see a trend of increasing use of computerized testing as opposed to paper and pencil. The amount of group tests far outweighs the number of individual tests. Generally, group test scores are converted to produce percentiles or standard scores, however a few become ratios or deviation Iqs.
Selecting Group Tests
Because of the sheer amount of group tests available, the test user is assured a selection of well-documented and psychometrically sound tests. In particular, ability tests in schools are found to be very reliable.
Using Group Tests
The tests which are to be discussed are almost as reliable and soundly standardized as the best individual tests. As is the case with some individual tests, however, validity data for some group tests are weak, meagre, or contradictory – sometimes all three. When working with group test information, the following cautions should be exercised. Use results with caution: avoid over interpretation, don’t consider scores as being absolute or isolated, and be careful when using results for prediction. Be especially suspicious of low scores: there are many factors which can contribute to a low score, be aware of them. Consider wide discrepancies as a warning signal: if an individual produces large discrepancies either among test scores or other data, this may be a sign all may not be well with the individual. When in doubt, refer: in the case of low scores, wide discrepancies, or suspicion to doubt validity, the best option is to refer the subject for individual testing.
Group Tests in the Schools: Kindergarten Through 12th grade
The goal of tests aimed at schools to measure educational achievement in children.
Achievement Tests Versus Aptitude Tests
Achievement tests aim to ascertain what an individual has learned following a specific instruction. These tests measure how much a student has learned after sufficient training has been provided. Validity is determined by the content related evidence. The test is said to be valid if it accurately samples the domain of the construct being assessed.
Aptitude tests on the other hand aim to measure how much potential for learning an individual possesses. A wide variety of experiences are evaluated in a multitude of ways Validity of an aptitude test is determined by its ability to predict future performance. Hence, these tests rely extensively on criterion oriented evidence.
Group Achievement Tests
The Stanford Achievement Test (SAT) is renowned as being one of the oldest standardized achievement tests still widely used within the education system. The SAT is in its 10th edition and is currently well normed and criterion referenced, with outstanding psychometric documentation. It primarily evaluates achievement in kindergarten to 12th grade in a variety of areas.
The Metropolitan Achievement Test (MAT) is another well standardized and psychometrically sound group measure of achievement. This test measures achievement in reading by assessing word recognition, vocabulary, and reading comprehension. Versions of this test include Braille, large print, and audio formats.
The MAT and the SAT are the pinnacle of modern achievement testing. These tests are psychometrically well documented, reliable, and normed on large samples. Both sample a wide variety of educational factors and cover all grade levels.
Group Tests of Mental Abilities (Intelligence)
Kuhlmann-Anderson Test (KAT) – Eighth Edition
The Kuhlmann-Anderson Test (KAT) is a group intelligence test which is applied to kindergarteners through to 21th graders. The test measures 8 separate levels with a variety of items on each. Unlike most tests, the KAT does not become more verbal the higher the age group being tested, it instead remains primarily non-verbal throughout. This makes the KAT suitable not just for young-children but also for individuals who may be handicapped in following verbally procedures. It may even prove to be suitable for non-English-speaking populations, after proper norming. Results from a KAT can be represented in verbal, quantitative, and total scores. Scores can also be expressed as percentile bands. A percentile band provides the range of percentiles which most likely represent a subject’s true score, much like a confidence interval. The KAT is a soundly reliable, valid, sophisticated test and its non-verbal qualities make it an ideal candidate for tests involving non-English-native speakers.
Henmon-Nelson Test (H-NT)
The Henmon-Nelson Test of mental abilities is another widely used test applicable to all grade levels. This test produces one score, which is thought to measure general intelligence. This has been and continues to be the product of some controversy. However it remains a quick predictor of future academic success. Unfortunately, by just scoring general intelligence, the H-NT does not consider multiple intelligences. The H-NT manual also calls for caution when testing individuals from an educationally disadvantaged background. Research has also shown that the H-NT has a tendency to underestimate Wechsler full-scale IQ scores by 10 to 15 points for a number of populations.
Cognitive Abilities Test (COGAT)
When talking about reliability and validity, the COGAT is similar to the H-NT. The COGAT provides three scores for results: verbal, non-verbal, and quantitative. Unlike the H-NT, the COGAT was designed with poor readers, poorly educated individuals, and non-native-English speakers in mind. Additionally, research has shown that the COGAT is a sensitive differentiator for giftedness, a fine predictor of future performance, and a good measure of verbal underachievement. However, the COGAT has been found to be very time consuming, there is uncertainty regarding whether the norms are representative, and minority populations have been found to score lower than white students across the test batteries and grade levels. For these reasons, great care should be taken when scores are used in conjunction with minority populations.
College Entrance Tests
The SAT Reasoning Test (SAT)
Formerly known as the Scholastic Aptitude Test, the SAT Reasoning Test (SAT-I) is still the most widely used university entrance test. Renorming of the SAT occurred in 1994 as an attempt to restore the national average to the 500 point level as it was in 1941. Even more recently changes were made, changing the number of scored sections to three, each scored from 200-600 points. This will likely lead to less interpretation errors due to interpreters no longer relying on old versions as points of reference. At 3 hours and 45 minutes long, the modern SAT is an endurance race which rewards determination, motivation, stamina, and persistent attention. The SAT is great predictor of first year college GPA.
Cooperative School and College Ability Tests (SCAT)
The SCAT, developed in 1955, is second only to the SAT, however it has not been updated since its implementation. It encompasses the college level as well as three precollege levels, starting at 4th grade. Its primary goal is to measure school-learned abilities and an individual’s potential to take on further schooling. In comparison to the SAT, the SCAT’s psychometric documentation is neither as strong nor as extensive. Revisions and extensions of the SCAT are encouraged, as currently it is unable to compete with the SAT.
The American College Test (ACT)
The American College Test is a widely used aptitude test for college entrants. Its biggest strength is that it is particularly useful for non-native-English speakers. Specific content scores and a composite form the results of the ACT. In comparison with the SAT, the ACT has similar success in predicting college GPA alone or in combination with high-school GPA. Despite this, internal consistency coefficients are not as strong as the SAT.
Graduate and Professional School Entrance Tests
Graduate Record Examination Aptitude Test (GRE)
The GRE is among the most widely used tests for graduate-school entrance. The primary measure is general scholastic ability. The test is administered throughout the year at various examination centres across the globe. The test consists of three parts: verbal (GRE-V), quantitative (GRE-Q), and analytical reasoning (GRE-A). Based on Kuder-Richardson and odd-even reliability, the GRE is stable, with coefficients just slightly lower than the SAT. False-negative rates are high, also the GRE has been found to not be a significant predictor for a group of Native American students. It has also shown a tendency to over-predict achievement in younger students while under-predicting the performance of older students. Despite this, many schools have developed their own methods of using the GRE which either use it independently or in combination with other sources of data. The best way of using the GRE score is to use it in conjunction with other data. When combined with GPA, graduate success can be predicted with great accuracy. A common problem among colleges is that of grade inflation. This refers to the rising average college grades in spite of the fact that the average SAT scores are declining.
Miller Analogies Test
Similar to the GRE is the Miller Analogies Test, another measure of scholastic aptitudes for graduate studies. The difference is that this test is strictly verbal. Hence, knowledge of specific content coupled with a proficient vocabulary are very useful tools. In terms of odd-even reliability, the Miller Analogies Test is sufficiently reliable. However, it does lack validity support. Also, this test tends to over-predict the GPAs of younger students and under-predict GPAs of older students, much like the GRE.
The Law School Admission Test (LSAT)
Taken under extreme time pressure, the LSAT is a test which requires almost no specific knowledge, and like the Miller Analogies Test, it contains some of the most difficult problems one can encounter on a standardized test. The three types of problems covered in the LSAT are related to: reading comprehension, logical reasoning, and analytical reasoning. Every single previously administered test since the format changed in 1991 is available for study. The LSAT has been found to be psychometrically sound. Researchers have raised concerns that the test favours whites over blacks and is biased. This and other concerns have led to a 10 million dollar initiative to increase diversity in American law schools.
Nonverbal Group Ability Tests
Raven Progressive Matrices (RPM)
The Raven Progressive Matrices test is among the most widely known and used nonverbal group tests. This test can be used anytime as an estimate of an individual’s intelligence, though it is most commonly used in an educational environment. The RPM instructions are very simple and can be given without the use of language. For this reason the test is used throughout the world. The test consists of 60 matrices, which contain a pattern with a piece missing. The RPM has the advantage of minimizing the effects of language and culture.
Goodenough-Harris Drawing Test (G-HDT)
Originally standardized in 1926 and then re-standardized in 1963, the Goodenough-Harris Drawing Test is one of the simplest, quickest, and cost efficient tests of nonverbal intelligence there is. Requiring just a pen and paper, the subjects are tasked with drawing a whole man and are instructed to do their best job possible. Subjects achieve credits for each item they include in the drawing. Because of the ease of administration of this test, it is commonly used. It gives a quick and rough estimation of the intelligence of the child. However, caution is advised as results based purely on the G-HDT can be misleading.
The Culture Fair Intelligence Test
One goal of nonverbal tests has always been to restrict cultural influences on scores. The Culture Fair Intelligence Test was designed with this in mind, to provide an estimate of intelligence which is free of cultural and linguistic influences. Research has shown that this test does not succeed any more than any other test, however its popularity reflects the desire for a test which reduces cultural factors. The test has been found to be best applied for measuring intelligence of a Western European or Australian individual. More work is needed if the Culture Fair Intelligence Test is to compete with the RPM.
Standardized Tests Used in the U.S. Civil Service
The General Aptitude Test battery (GATB), which measures aptitude for a number of occupations, is a widely used test for assisting employment decisions. It measures a wide range of aptitudes. The GATB has been the subject of controversy, as it used within-group norming prior to the Civil Rights Act of 1991. For example women would only be compared with other women, men only with other men, Latinos with only other Latinos, etc. The argument was that within-group testing was done on the basis of fairness, however it was outlawed and labelled as reverse discrimination.
Standardized Tests in the U.S. Military: The Armed Services Vocational Aptitude Battery (ASVAB)
The ASVAB is a test designed by the Department of Defence which is administered to over 1.3 million individuals per year. The test consists of 10 subtests, which consist of a wide range of factors. The psychometric characteristics of the ASVAB are exemplary. The test has been shown to be reliable and a valid predictor of performance during training for a variety of civilian and military occupations. The ASVAB has been moving away from the pen and paper format in favour of computerized testing. This allows the tests to be adapted based on the subject’s unique ability.
Hypothesis of projection
Projective tests are regarded as very controversial and often misunderstood ways of psychological testing. However, five out of ten most used testing procedures in clinical settings are projective techniques. The projective hypothesis is the basic concept of projective tests and suggests that when people want to understand a stimulus that is vague, then the interpretation of it will tell something about people’s feelings, experiences, thoughts, needs etc. The issue that rises is the fact that the examiners can never make secure assumptions about the responses of the test takers and their evaluation of what they see. Some research, however, supports the use of projective tests and its validity. Findings are contradictory.
Rorschach inkblot
Rorschach inkblot test has been regarded as the most powerful tool for psychometric measurements and also a test somewhat resembling a party game ‑ the support and findings are highly ambiguous. However, this test is still widely used. Rorschach’s research on the inkblots started in 1911 and soon after got published in the famous book Psychodiagnostik. Initially, the material was highly avoided, but after time, the use of the test became increasingly popular. Some people took it further and studied Rorschach thoroughly, even though they often disagreed with each other. In a way they all developed their own way of scoring and administering the test. Moreover, Rorschach is an individual test, it presents the test taker with 10 cards, five are black and grey; two consisted of black, grey, and red; and three consisted of various colours. When presented with cards, the subject is asked to elaborate on what that could be and no rules are present about the answer the subject is about to provide. The lack of clear rules and structure of what it is to be expected from the subject are the primary structures of the projective tests. The examiner should be as ambiguous as possible.
In the first phase ‑ free association phase ‑ the examiner presents the cards one at a time and if the subject responds with only one explanation then the examiner might encourage him/her to explain more by saying something like: “Most people see more than one thing” or “Take your time, since people usually see something here”. Secondly, in the inquiry phase, the examiner presents the subject with the cards again and scores the responses in five dimensions: location, determinant, form quality, content, and frequency of occurrence ‑ all in regard to what the subject spotted in the inkblot.
To score the location, a small version of the inkblot card is presented ‑ the location chart. The examiner records whether the subject made use of the whole blot (W), a common detail (D) or a detail that is not that usual (Dd). The confabulatory response (DW) is the situation when the subject overgeneralizes from a part to a whole. Normal subjects most likely end up having a balance in their W, D, and Dd responses. Otherwise, some problems are suspected. Furthermore, the examiners need to assess what it is that led the test taker to see that particular feature and this is known as the assessment of the determinant. Was it the movement, colour, shading, or a shape that led to a response ‑ if only shape is used for example, then this is called the pure form response. The movement feature is regarded an issue since it is an ambiguous concept in this situation. The identification of a determinant is regarded as the most difficult aspect of the Rorschach. To score the content appears to be quite simple, mostly we categorize in humans (H), animals (A), and nature (N). The populars are the general responses most frequently found. Form quality refers to the degree to which the response matches the features of the stimulus in the inkblot. Scoring this is quite hard. As obvious as it is, scoring Rorschach is very difficult and a complex process ‑ to use it, you need a higher graduate training.
The psychometric properties of the Rorschach indicate that after the 1960s the test was seen as much less astonishing than it was believed. After this, the Comprehensive System for scoring the Rorschach became prominent and very accepted until the present moment. However, this system failed to help Rorschach’s inadequacies. Research suggests that the Rorschach results tend to identify more than half of normal people as emotionally disturbed and this is referred to as overpathologizing.
Another problem refers to the fact that those people who tend to give more responses (Rs) to the inkblot, tend to evaluate the whole area bordering the inkblot, which is not the initial idea of the test. Furthermore, when administering the Rorschach, there are no fixed rules on how to do this. Reliability research provides inconsistent results. Even when the reliability is provided, the validity is always questionable.
An alternative for Rorschach was made and it is the Holtzman Inkblot Test that allows the test taker to give only one response per card and the administration and scoring of the test are standardized. Due to the apparent advantages of this test, it is still not even close as popular as the Rorschach.
TAT-Thematic Apperception Test
The Thematic Apperception Test (TAT) is a test that could be compared to Rorschach and is similar to it on several levels. The TAT became very popular after its appearance and is nowadays used more than any other projective test. It is based on Murray’s theory of needs while Rorschach is, on the other hand, not grounded on any theory. The TAT is not presented as an instrument for diagnoses, but as an instrument for evaluating human personality characteristics. This is regarded as one of the crucial techniques used in the personality research. It is more structured and less vague in comparison to Rorschach, there are 30 pictures and one empty card and some of the cards are specifically aimed at male subjects, while others are meant for the female ones. Also, some are more appropriate for people in different age groups and finally some of the cards are appropriate for everyone. However, the standardization and administration of the procedures of scoring are as bad with TAT, if not, worse than with Rorschach. When interpreting TAT the notions of needs, press (environment that influences the satisfaction of needs), themes (the frequency of ‑ for example ‑ depression), heroes (who you identify with) and outcomes (success/failure). The psychometric properties of TAT are inconsistent, due to the lack of standardization. The test-retest results also seem to be inconsistent. Validity has even blurrier findings. The content- validity has some research support while the criterion-validity has been hard to find.
Other projective procedures
The projective tests do not have to include any pictures, they might as well provide words or phrases as a stimulus. Word association tests consist of a psychologist saying a word out loud and the subjects’ task is to say whatever comes to his/her mind first. The use of this test is limited, but still present. Sentence completion tasks consist of words in incomplete sentence tasks (“I am…”, “Men…”). A possible best projective test ‑ psychometrics-wise ‑ is the Washington University Sentence Completion Test (WUSCT), which gives insight in ego development, self-acceptance, autonomy etc. Figure drawing tests use the expressive techniques in order to make the subject create something like a drawing. One of these tests is regarded as valid and useful in the clinical setting is the Goodenough Draw-a-Man Test. It is simple and practical.
Neuropsychological assessment measures
Clinical neuropsychology is a field of study that puts emphasis on the research of psychological deficiencies of the central nervous system and the treatments for it. It examines the relationships between brain functions and behaviour, covering the areas of cognitive, emotional, and sensory processing. The impairments in the spinal cord are also studied. This field is a mixture of the use of psychiatric, psychometric, and neurological practices. This field is concerned with memory, learning, spatial recognition, language, attention, and similar processes.
Neuroimaging has provided the field of clinical neuropsychology with remarkable opportunities for research and development. As the neuroimaging techniques developed over time, it became clearer that the brains of individuals differ in structure and organization. The cornerstone of the practice of clinical neuropsychology is work of Broca and Wernice who were studying the speech locations in the brain. Neuropsychologists within the field tend to be very specialized for certain areas or age groups. These specialists are usually concerned with brain dysfunctions but others work with brain injuries and similar problems.
Furthermore, memory is one of the most widely researched constructs within the field. Memory dysfunctions are assessed with the Wechsler Memory Scale-Revised (WMS-R), the RANDT Memory Test (RTM), the Memory Assessment Scales (MAS) and the Luria-Nebraska battery. Short-term memory, specifically, is best assessed with the use of verbal tests.
The latest research tries to beat the view that problems in functioning are related to problems in a specific location in the brain. The new view suggests that complex functioning is regulated by neural systems and not by specifics structures in the brain.
One of the more intensively studied areas is the attempt to assess the deficits of left hemisphere comparing to the right and vice versa. The findings of this research usually come from brain damage studies or stimulation during surgery of some patients that needed one (e.g. epilepsy). Different kinds of deficits and impairments are tested within the field of clinical neuropsychology, such as Wernicke’s aphasia, different apraxia, information-processing system’s deficits and so one.
Developmental neuropsychology is a field that focuses on different deficits and complications children experience and changes occurring during time. There are serious challenges with studying children and one of them is that children are still in the process of development and some deficits can be revealed much later. Another fact is that children’s brains have a remarkable potential to recognize the injury and actively strive for recovery. This process is called plasticity. High diversity in neuropsychological tests for children with examples such as Child Development Inventory, Children’s State-Trait Anxiety Scale, Reynolds Depression Scale and more. The mentioned tests focus on adaptation and development measures.
Another group of tests focuses on attention and executive function of children and an example of the test measuring these functions is the Trail Making Test and it assesses quite a few cognitive skills including attention, sequencing, and thought processing. Important to mention is that attention and executive functioning are not considered the same construct. Executive function embraces the notion of volition such as being capable of forming and achieving a specific goal and taking action in order to succeed in a task. Self-monitoring and self-control also fall within the same notion. As for the mental processing it includes four factors that are believed to be related to different regions of the brain. The factors are: focus execute - ability to scan information and give response to it in a right way, sustain ‑ capacity to be attentive for a sequence of time, encode ‑ capacity to store information and later recall them and finally the shift ‑ present flexibility.
Learning disabilities refer to the neuropsychological problems with reading and speech. Dyslexia is a type of reading disability when people experience difficulties to decode separate words. It may have a genetic basis for occurring or might be the result of processing phonemes with difficulties.
CRI ‑ Concussion Resolution Index is a development of neuropsychological research used to follow recovery of sportsmen that experienced a concussion.
Anxiety and stress assessment measures: State-trait Inventory, measures, coping measurements
Stress refers to a response to some happenings and situations that elicit constrains, demands, and similar. We could divide the concept of stress into three components and those are frustration, pressure, and conflict. We are frustrated when the road to achieving our goal is blocked, this can take physical as well as mental forms ‑ you can be stopped at the club entrance or not accepted to a university/job. Either way, it is likely that one would feel frustrated. Types of stress induced by conflicts appears when we have to make some decisions such as choosing between two important things. Finally, pressure stress is present when some tasks need to be sped up, it can take the form of external pressure where one for example has a set deadline by his boss, or internal, when you put pressure on yourself in order to reach a goal on time.
Reaction to these stressful situations most commonly lead to anxiety ‑ a state of emotions that is manifested by tension and worry. Physical changes occur as well when your heart is pounding fast, your hands sweat and similar. Two types of anxiety can be distinguished: state anxiety – characterized by a reaction that will change from one situation to another. Trait anxiety on the other hand is characterized as a personality feature that will stay unchanged across situations. State-Trait Anxiety Inventory – STAI is based on the anxiety theory and further explains two scores ‑ one for state anxiety (A-State) and one for trait anxiety (A-Trait). Validity and reliability for this inventory are high. Each component of the STAI is measuring what it is supposed to measure since the two components strive to measure different aspects. This inventory is available in many languages and is suitable for different age groups.
When measuring test anxiety, we can describe two ways in which it manifests itself and those are: task-relevant responses ‑ one that directs your energy toward your goal-achieving a fine grade, and task-irrelevant responses ‑ behaviour that restricts your performance. When taking a test, students usually respond in the second manner ‑ which interferes with their performance.
Test anxiety questionnaire – TAQ is one of the first tests used to measure anxiety. To make a difference between motivational states present in test-taking situation we divided them into learned task drive and learned anxiety drive. The first one refers to the motivation to give responses that correspond to the task you are dealing with and the second consists of task-relevant and task-irrelevant responses. Concerning the psychometric feature of this test, the reliability is known to be high. Some criticism of the TAQ is that it deals much more with the state anxiety than with trait anxiety.
Some other test anxiety measures suggest that test anxiety actually has different two components and those are emotionality and worry. The first one refers to the physical response we encounter when taking a test such as heart rate and muscle condition. The second refers to the mental preoccupation when thinking about possible failure and the personal consequences this situation would pose for the individual. Spielberger’s 20 - item Test Anxiety Inventory refers to worry as to a trait that will be consistent throughout time. Emotionality on the other hand is a way in which we express arousal in a specific situation. This theory poses the idea of emotional component as a state or a situational aspect.
A 5-item version of this test was recently published.
Another approach to the same topic is the Achievement Anxiety Test – AAT. This is an
18-item scale that provides two different components of anxiety and those are facilitating and debilitating. The first one refers to a state that motivates the person to perform in a certain way and the latter refers to the anxiety that influences the performance by interfering with it. The facilitating component simply gets a person to worry enough in order to for example do work before the deadline. In this respect, the facilitating anxiety is helpful while debilitating is not.
An important question is how do different people cope with anxiety? In order to assess this, we have a measure for it ‑ Ways of Coping Scale ‑ a 68-item checklist. The scale has seven subscales concerned with problem solving, wishful thinking, advice seeking, growth, support seeking, threat minimizing, and self-blaming. Furthermore, these seven are classified with either problem-focused or emotion-focused. The first one involves attempts to change the influence of stress and these are active ways of coping while the latter way does not try to change the stress course but one focuses on the ways to cope with emotional responses he/she is experiencing. This coping scale is widely used but some research failed to replicate the findings.
Closely related to this measure is the Coping Inventory – a 33-item measure that first describes the attitudes that people take to avoid stress, secondly it includes the items that explain the strategies for dealing with stressful situations, and finally it considers how each of those strategies would help the person with coping with situations. It is used with both adolescents and adults.
Life quality assessment: Quality of life measure, measuring methods: SF-36, NHP, Decision theory approaches
We find the two most common definitions of health and those are based on the facts that firstly: most people agree they would not like to die early-avoidance of death is an aspect of health. Secondly, people appreciate life quality, disease and disabilities are considered because they will influence the length of their lives.
Quality-of-life measurements are conceptualized in two different ways; where one is psychometric and the other is the decision theory. The first one tries to provide distinct measures for different views on quality of life. Best known example is the Sickness Impact Profile (SIP), a 136-item measure. The decision theory however, tries to make a distinction between different dimensions of health and by this tries to provide a united view on what health status is. In the end, quality-of-life measuring and views tend to be seen as highly subjective.
In order to measure quality of life we have several options. SF-36 is a commonly used tool that includes eight concepts of health including physical functioning, bodily pain, general heath perceptions, role-physical, social functioning, vitality, role-emotional, and mental health. There are several advantages of the SF-36 and those are the fact that it is a quick measure with a significant level of reliability and validity. However, it does not have questions that are age-specific (which is an obvious problem when assessing health and life qualities).
The Nottingham Health Profile (NHP) is another approach and consists of two parts ‑ 38-items divided into six categories: energy, pain, sleep, physical mobility, social isolation, and emotional reactions. Items in each of these units are rescaled so they vary between 0 and 100. The second part of the NHP approach is that it includes seven statements that are in relation to certain areas of life that are most likely to be affected by health such as social life, sex life, home life, hobbies, holidays, interests, employment, and household activities. The NHP is somewhat supported by reliability and validity measures.
Structured personality tests
Personality characteristics can be defined as nonintellective features of human behaviour and are essential in the clinical and counselling settings. Personality is generally defined as a set of somewhat stable and unique behavioural patterns that explains the ways an individual reacts to the world around him/her. Personality traits furthermore refer to certain persisting features of the way we act, feel, or think and that are distinctive from one person to another. Personality types are regarded as general portrayals of people, for example those people who are very social tend to go out and engage in conversation a lot. Personality states refer to the way we differently react emotionally in different situations. Lastly, self-concept is the way we self-define or an organized and somewhat consistent set of thinking one has about him/herself (Rogers, 1959). It took until the First World War for personality tests to begin developing. There was a need to come up with a test that can be used to assess and screen people who were not fit for the military. Psychologists then came up with self-report questionnaires which gave an opportunity for people to report things about themselves by ticking true or false to report if that applies to them or not. The discrepancy between the structured and projective method of assessment is that in the structured case the person is requested to respond to a written statement (yes-no, true-false). In the case of a projective assessment the stimulus itself is vague and there are few rules to which the person could respond.
Structured personality tests are broadly described in a deductive and empirical manner. Deductive strategies correspond to the logical-content and the theoretical approach while the empirical part embraces the notion of criterion-group and the factor analysis method. Sometimes these procedures are combined.
Deductive strategies use deductive logic and reason in order to come up with a meaning of a test response. Logical content strategy is characterized by the fact that it assumes that the test item defines the subject’s behaviour and personality, meaning if a person responds with “false” on a statement if he/she is friendly, the test administers will assume this is true and the person is not friendly. Theoretical strategy items firstly have to be consistent with the theory, this strategy tries to create a homogenous scale and may use statistics in order to analyse the item.
Empirical strategies rest on the collection of data and some statistical procedures and strive to find a meaning of the rest response or the type of personality and psychopathology. One feature of this strategy is that it tends to use the experimental research on its way to interpreting the meaning of a test response, some extreme dimensions of personality or those two combined. Criterion-group strategy initially represents a group of many individuals who share a specific characteristic like depression or leadership. Then, in order to test, we select and give out a group of items to those people in the group as well as to a control group that is representative of the general population. With the use of contrast the examiners try to compare the two and learn something from those results. The scale has been cross-validated when it distinguishes the two groups well. Factor analytic strategy is using factor analysis in order to empirically assess the basic personality dimensions. What it basically does is narrowing down the data and through this reduces them to a small number of units that are very descriptive. This further provides the results with the least variability in the data that is possible.
Logical-content strategy
Logical content strategy includes one of the first tests made and that is the Woodworth personal data sheet. It was developed during the first World War and its purpose was to identify those people that would not be fit enough to stay in combat. The manner with which the items were selected was a logical-content one and it additionally had two other features. Items recognized as falling within the 25% or higher of a normal sample in the direction of the scores were excluded from the test. This way the false positives (identified as a risky unfit but actually not) were reduced. Only symptoms that emerged in a double than normal manner were included in the test. Furthermore, two other well-known tests of the same category were the Bell Adjustment Inventory and the Bernreuter Personality Inventory. They have set the grounds for many modern tests that were multidimensional ‑ providing multiple scores rather than a single one. Criticism of the logical content approach is that after all this, subjects are not able to evaluate their own behaviour objectively and even if the response is close to accurate there is a risk of misinterpretation of the item in question. This is likely to lead to biases.
Criterion-group strategy
The criterion group strategy works on the idea that nothing should be assumed about the meaning of the person’s response to a test item, it should be determined by ways of empirical research. The Minnesota Multiphasic Personality Inventory (MMPI) is a true or false questionnaire based on the person’s self-report. The best things about it are its clinical and content scale, and its validity. The clinical scales were made in order to assess the psychological abnormalities and disorders, the content scales strive to group certain items that appear to have something in common, while the validity scale ensures the information about a subject’s approach to testing ‑ faking bad or faking good. The main purpose of the MMPI is to help in making a distinction between the normal and abnormal groups. It was initially designed to help major psychiatric and psychological disorders assessment. The original number of patients used for the development of the test’s criterion groups was 800, this number was later drastically reduced. Eight criterion groups with about fifty patients each were there in the end and the classification is: hypochondriacs, depressives, hysterics, psychopathic deviates, paranoids, psychasthenics, schizophrenics, and hypomanics. The major critic of the MMPI was that the control group consisted of patients’ relatives and visitors (officially excluding the mental patients).
The new version ‑ the MMPI-2 ‑ has a much better and a more representative control sample. To these eight scales the M-F masculinity-femininity one was added, as well as the social-introversion one that measures the obvious.
The validity scales were developed due to criticism of this realm and assess the way a subject is approaching the test ‑ was it a normal, honest approach or not. The L or lie scale was developed to spot the individuals who strived to present themselves in a more favourable way than in reality. Another scale is the K scale designed to spot the items that made distinction between the abnormal and normal groups in case both groups would produce a normal test pattern. It was believed that the pathological groups would express normal patterns when trying to be defensive ‑ to deny and hide the problems and the defensiveness would help recognize the pathology that was absent in normal individuals. The last one was the F scale that was made to spot the people who were “faking bad” trying to present the situation worse than it is – a person who is high in F scores is raising a validity concern because high F scores indicate a strive to over exaggerate. An additional scale is a “cannot say” one where the person simply fails to provide a true or false. One drawback is that a person with specific disturbance such as schizophrenia will never score high on only one of the scales, but more likely on two, three, or more. To deal with this issue of resulting high on multiple scales, we use the pattern analysis as described before. The idea of analysing the two highest scales ‑ two point code was highlighting the necessity to do research based on the individuals that show high scores specifically on two scales.
Further development was the re-standardization and improvement of the MMPI resulting in the emergence of the MMPI-2. This was done in order to revise the items that seemed problematic and increase the number and variety of items. An attempt to retain many MMPI features was present and a separate form of MMPI specifically for adolescents was a new idea. Major improvements with MMPI-2 were the validity scales that were added and those are the VRIN-Variable Response Inconsistency Scale ‑ trying to assess and evaluate the random responding and the TRIN-True Response Inconsistence Scale ‑ attempts to measure the acquiescence or the strive to mark “true” irrespective of the context.
Psychometric features of the MMPI and its revision are closely comparable. Reliability of both tests is high at .90. Intercorrelations between the scales are extremely high and because of that the validity of the pattern analysis is highly questionable. Another drawback is the instability of the ways the items are keyed ‑ this is a problem since many people tackle the tests with a specific response style which leads them to mark some items in a specific way regardless of its meaning. However, the validity of the two test versions is greatly supported by many research studies that are evaluating features of a specific profile pattern. For example after many studies on alcoholism and substance abuse using the MMPI test, we can be sure that the test can at least predict well who might become an alcoholic later on (having higher scores on the F-scale).
The California Psychological Inventory (CPI) is another type of structured personality test, primarily based on the criterion-group strategy. The third edition has 36 scales that are meant to measure personality features such as introversion-extraversion, self-realization and sense of integration, and conventionality versus unconventionality in following norms. Opposite of the MMPI and MMPI-2, the CPI attempts to measure the personality of normally accustomed individuals and due to this feature the CPI is used more in the counselling settings. Many items are very similar to those in the MMPI out of which many are identical. Reliability is also similar to the MMPI one. The advantage over the MMPI is in CPI’s possibility to be applied to the normal individuals.
Factor analytic strategy
Factor-analytic strategy as mentioned before, minimizes the number of common factors of an item by uniting the components and thus it reduces the variability. This is nowadays done with the use of computers, however it was used before computers and was based on the basic strategy of correlating the scores of the new test with the scores of the old one that is intending to measure the same thing. Guilford took this one step further and approached the intercorrelations of many tests and applied the factor analysis on the results, thus he found major dimensions that are underlying all personality tests. Guilford’s work was left behind due to the emergence of the MMPI.
Cattel started collecting all the adjectives that could be applied to the human beings and from a very large number he narrowed it down to 171, then to 36 items (surface traits). This number of adjectives was further cut to 16 different variables ‑ source traits. Out of this project the well-known 16pf ‑ Sixteen Personality Factor Questionnaire emerged. This test has a short-term test-retest correlation median (for all the 16 trait items) of .83. Professionals do not find the 16PF as useful as the MMPI. On the contrary, much research supports the validity of the 16PF, so the evaluation can be called inconsistent. Problems with the factor analysis in general is its highly subjective way of naming the factors ‑ in the end, the name in the test has many more factors that very simply narrowed down to the one we find in the test/questionnaire.
Theoretical strategy
Theoretical strategy has developed as an idea to use the theory to make the personality tests and in this way avoid biases and problems. Edwards Personal Preference Schedule (EPPS) is one of the first and most well-known of its kind. It is not an actually test since there are no right or wrong answers and it is widely used in counselling. What is theoretical about it is that it is based on the work of Murray who proposed that human needs are the need to accomplish, need for attention, and the need to conform. Edwards further selected 15 needs from Murray’s list of needs and developed construct validity for each of those. When taking the test, the subject would be requested to select one need over another (with that excluding the other one). From this, one can assess the selection of items made on the first scale with the selection on the other scale and this procedure is called the ipsative score. Those scores give results in very relative terms. They compare one individual to him/herself and further provide information that shows the relative strength of each need separately (of the same individual). Test-retest reliability reports the coefficient in the range from .74-.88, which is a high and satisfying number in testing personality. This test is widely used in applied settings and is one of the most intensely researched ones.
Combination strategies
Combination strategies are sort of a modern trend to use a mixture of the above mentioned ones in order to develop personality tests.
Positive Personality Measurement and the NEO Personality Inventory-Revised (NEO-PI-R) ‑ research suggests that it may be useful to use the positive characteristics of people in order to grasp the capacities an individual has and how those influence a person’s behaviour and life. The concept of hardiness refers to the way one copes with the stressful situation, in this case the hardiness actually means that the person sees stressful situations as meaningful and changeable (Kobasa, 1979). Furthermore, Bandura suggested that people with a strong sense of self efficacy tend to believe they are in control and face the hard times with “hardiness” (1986). Research supports the idea that to lead a satisfying life one needs to concentrate on the positive personal feature and not on the absence of psychopathology.
NEO-PI-R supports this idea and uses both theory and factor analysis in order to create scales. Three broad domains of the NEO-PI-R are N for Neuroticism, E for Extraversion, and O for Openness. Every one of these domains has six extra aspects. Neuroticism’s six facets are anxiety, depression, hostility, self-consciousness, vulnerability, and impulsiveness. Extraversion contains six following extra facets: warmth, activity, assertiveness, seeking excitement, positive emotions, and gregariousness. Finally, openness consists of: values, actions (trying out new activities), fantasy, feelings (openness to them), aesthetics, and ideas (intellective). The response of the NEO-PI-R is made on a Likert scale (5-point) and 14 out of 18 facets are written, while 7 are positively worded, the other 7 are negatively worded. The reliability for all three is quite high (high .80 to low .90) and this stands for both the test-retest reliability as well as for the internal consistency. The individual facets have a lower reliability.
The NEO-PI-R is supporting the notion of the well-known five dimensions of personality ‑ the five factor model and those are the following: extroversion, neuroticism, conscientiousness, agreeableness and openness to experience. Conscientiousness has been the most widely researched one and it consists of two major parts: dependability and achievement. It has been found that conscientiousness is a good, positive predictor of performance in all professions studied so far. Furthermore, it is positively correlated with having effective ways of coping with stress and also with the satisfaction of life. The facet of openness highly correlates with crystallized intelligence and further agreeableness, extraversion, and openness are very useful in predicting success in particular job environments.
The NEO-PI-R is one of the modern ways of personality test constructs that uses logic, theory, and statistical approaches combined with factor analysis in order to provide proper test results and behavioural insights with less biases and problems.
Personnel psychology-selecting employees
I/O Psychology (industrial/organizational) puts emphasis in structured psychological testing, relies on research, and quantitative methods. The main areas of IO psychology are: personnel psychology concerned with job recruitment, employee selection, and evaluating the performance and organizational psychology concerned with motivation and satisfaction of the employees and it considers leadership and some other factors present in the organizations.
The employment interview is used in industry and business and it helps in making selection and promotion decisions. Structured interviews are a supported format of interviews used. With employment interviews, it is known to involve a greater search for negative rather than for favourable evidence. Webster noted in his research that the first impression tends to have a great impact on the candidate’s evaluation, that is, if the early impression is negative the final rejection rate is about 90%, while it drops to 25% if the first impression made was positive. Negative factors that usually lead to rejection are: low enthusiasm, nervousness, no eye contact, lack of confidence, and poor communication skills. On the other hand, positive factors consist of poise and self-confidence, ability to sell one-self etc.
A good first impression goes with: wearing professional attire, seeming competitive, showing expertise, being friendly or warm by giving out non-verbal cues, and not overdoing it altogether. One study shows that female participants wearing perfume and expressing friendly nonverbal cues were negatively evaluated compared to the positive evaluation of wearing perfume or being friendly.
Base and hit rates
The hit rate is the percentage of cases when the test accurately predicts the success or failure for employment for example. In case we do not use a test in predicting success we have a base rate. The true value of a test is when we compare the hit rate with the base rate. In case of the use of dichotomous tests – two choice decisions a cut-off score is usually used. Those in the above cutting score are for example employed while those below are not. Establishing a cutting score does not ensure a correct decision.
Miss rates are found when concluding (employing) something is true or suitable when it is actually not. A false negative would be if we had diagnosed someone with a benign tumour while he/she actually suffers from a malign one. A false positive is another form of a miss rate where for example, one hires someone based on the test results, but that someone performs rather poorly later on. The hit rates can also be positive and negative – for example employing someone who ends up performing well - True positive or not hiring someone who would not have performed well. False negatives and false positives have different meanings depending on the environment and context. A child that is rated to be potentially aggressive in the future and ends up not being so is a false positive, even though it has a positive connotation. In order to assess the hits and misses we use cutting scores and those involve criterion validity.
Taylor-Russell tables
Taylor-Russell Tables were developed as a method for assessing the validity of tests, more specifically, to examine if the test is better than chance and serves its purpose. These tables require certain information in order to be properly used:
1. Definition of success – that is for each situation tested the success of the outcome must be clearly defined (e.g. over 5.5 is a pass, below is a fail)
2. Determination of base rate - must determine the number of people who would count as a success if there were no testing procedures present
3. Definition of selection ratio - must define the percentage of people who are admitted
4. Determination of validity coefficient - correlation between the test and the criterion is needed.
In short, the Taylor-Russell table provides the likelihood that a person we selected on the basis of the test score will succeed. A different table represents each one of the mentioned base-rates. The most useful tests are the ones with high validity and a low ratio. On the other hand if the validity is low and ratio high the test is not very useful. When we have no validity this means the test is no better than choosing someone by chance, while a high ratio indicated that almost everyone is chosen. Even though we will end up rejecting some applicants that would have performed well, the percentage of those who succeed among the people selected is higher than among the rejected ones. One drawback of these tables is that dichotomous responses are needed: success or failure.
Utility theory developed as an alternative to the Taylor-Russell Tables in order to assess levels beyond ‘success’ or ‘failure’. It is used in the selection procedures, mostly in personnel selection and lately it finds its place in education and some other fields. Research demonstrates advantages finance-wise when using the utility theory models to select employees.
Incremental validity
Incremental validity refers to the specific information gathered while using a test. Moreover, determining how much of that information gathered from the test contributes in contrast to some simpler measures that could lead to the same prediction. The idea is that validity and reliability are not enough, thus we need to assess how valuable the test is.
Furthermore, some results indicate that self-reported tests can be as good in predicting traits/responses as some complex personality tests (Hase & Goldberg, 1967). This is not always the case since it is known that supervisors are, for example, bad raters. Take interview validity as another example. Situational interviews have higher validity than job-related interviews, whereas psychologically based ones had the lowest validity of all interviews studied. The general thought is that one should consider using cheaper or simpler methods before involving more complex ones, since the fact that they are complex does not grant higher validity.
Employee’s perspective – fitting people to jobs
Focusing on personnel psychology now, its aim is to match people and jobs in a certain way. Very often temperament is seen as a critical component for reaching job satisfaction. The Myers-Briggs type indicator (MTBI) is based on Jung’s theory that introduces four main types (ways we experience the world around us) and those are: sensing, intuition, feeling, and thinking. Feeling refers to being attentive for emotional aspects while experiencing and sensing to gaining knowledge through hearing, touching etc. Jung believed that even though we all strive for some sort of balance taking the four types into account - every person has a tendency to emphasize one type. Another dimension Jung mentioned was in terms of extraversion and introversion. The use of the Myers-Briggs indicator is to assess the extrovert/introvert dimension and to place emphasis on one of the types. MTBI is widely used, mostly to explore communication styles, leadership skills, and self-efficacy.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Main summaries home pages:
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
3639 |
Add new contribution