Test development consists of several steps:
- The construct of interest
The latent variable (construct) that the test is supposed to measure has to be specified. Latent variables can vary in scope (1), content (2) and between educational and psychological variables (3). - The measurement mode
The test has to specify how the latent variable will be measured. This includes three modes: self-performance mode (1), self-evaluation mode (2) and other-evaluation mode (3). Modes can be reactive or nonreactive. In a reactive mode, test takers can deliberately distort their construct value. In a non-reactive mode, test takers cannot deliberately distort their construct value. - The objectives of the test
The test has to specify the objectives of the test. Tests can be used for practical and scientific purposes and objectives can be at the level of an individual test taker or at the level of a group of test takers. Objectives distinguish between description, diagnosis and decision making. - The population
The target population, the set of persons to whom the test has to be applied has to be specified. This often includes inclusion and exclusion criteria. - The conceptual framework
This is the theoretical framework on which the test is based. It is used to make conceptual distinctions and organize ideas. - The item response mode
The test has to specify how test takers respond to the items. This can be a free- or constructed response mode, which includes short-answer items and essay items. It can also be a choice or selected response mode. It can make use of frequency or intensity response scales and of endorsement response scales. - The administration mode
The test has to specify how the test is administered. Administration mode can be oral (1), paper and pencil (2), computerized (3) and computerized adaptive test administration (4). Computerized adaptive test administration adjusts the difficulty of the test according to the level of the test taker.
Response scales can be dichotomous (two ordered categories), partial ordinal-polytomous (more than two ordered categories) and ordinal-polytomous (completely ordered categories). There are several item-writing guidelines:
- Focus on one relevant aspect
This means that items should not test two aspects at the same time. - Use independent item content
Items should be independent of each other, but this is not necessary if the questions are based on a reading passage. - Avoid overly specific and overly general content
Overly specific and overly general content lead to ambiguity in the answers. - Avoid items that deliberately deceive test takers
Items that distract test takers’ attention from the problem that they have to solve should be avoided. - Keep vocabulary simple for the population of test takers
For native speakers, the items should not require the reading skill beyond that of a twelve year old. - Put item options vertically
- Minimize reading time and avoid unnecessary information
Unnecessary information obscures the content and distracts test takers from the problem that they are trying to solve. - Use correct language
- Use non-sensitive language
An item should not foster stereotypes, should not contain ethnocentric or gender-based underlying assumptions, should not be offensive, should not contain controversial material and should not be elitists or ethnocentric. - Use a clear stem and include the central idea in the stem
- Word items positively and avoid negatives
- Write three options unless it is easy to write plausible distractors
- Use one option that is unambiguously the correct or best answer
- Place the items in alphabetical, logical or numerical order
- Vary the location of the correct option across the test
- Keep the options homogeneous in length, content and grammar
Item-writers tend to make the correct option longer than the distractors. - Avoid ‘all-of-the-above’ as the last option
- Make distractors plausible
Distractors should all seem plausible to test takers who do not know the correct answer. - Avoid giving clues to the correct answer
The responses to free- or constructed response items have to be rated by raters. There are several item-rating guidelines:
- Rate responses anonymously
- Rate the responses to one item at a time
All the responses to one item should be rated before moving on to the next one - Provide the rater with a frame of reference
Raters should be given instructions, schemes or ideal responses that they can use as a frame of reference. - Separate irrelevant aspects from the relevant performance
Only concentrate on relevant performance (e.g: don’t focus on writing in a maths test). - Use more than one rater
- Re-rate the free responses
The same free responses should be rated on more than one occasion. - Rate all responses to an item on the same occasion
- Rearrange the order of responses
- Read a sample of responses
A sample of responses should be read before the start of the rating.
Pilot studies are conducted to test the quality of the concept items. Items are reviewed on content, technical aspects and sensitivity. Test takers are interviewed on their thinking while working on an item. Concurrent interviews include thinking aloud while working on the item and retrospective interviews include recollecting the thinking after an item.
Coefficient Kappa gives the consistency between different occasions for a rater and uses the following formula:
Kappa= O-E1-E(E<1)
E refers to the expected proportion of identical ratings and O refers to the observed proportion of identical ratings. E can be calculated by multiplying the marginal proportions.
| Correct | Partly correct | Incorrect | Marg. prop. |
Correct | .68 | .01 | 0 | .69 |
Partly correct | .02 | .09 | .01 | .12 |
Incorrect | .01 | 0 | .18 | .19 |
Marg. prop. | .71 | .10 | .19 | 1 |
E can be calculated by multiplying .69 with .71 + 12 * .10 and so on. O can be calculated by adding the diagonal cells with each other (.68+.09+.18).
Add new contribution