Understanding Psychology as a Science - Dienes - 2008 - Article
- Who are Fisher, Neyman and Pearson?
- What is probability?
- What are hypotheses?
- What is sensitivity?
- What are stopping rules?
- What is multiple testing?
- What are points concerning significance tests that are often misunderstood?
- What are confidence intervals?
- What is the criticism of the Neyman-Pearson approach?
- How do you use the Neyman-Pearson approach to critically evaluate a research article?
In this chapter the standard logic of statistical inference will be considered. The Neyman-Pearson approach is the name of this logic.
Before reading this chapter, make sure you understand the following concepts:
Standard deviation.
Standard error.
Null hypothesis.
Distribution.
Normal Distribution.
Population.
Sample.
Significance.
T-test.
Who are Fisher, Neyman and Pearson?
The British genius Sir Ronald Fisher thought of many of the techniques and concepts we use in statistics. He created much of what the users of statistics now recognize as statistical practice.
The Polish mathematician Jetzy Neyman and the British statistician Egon Pearson provided a firm consistent logical basis to hypothesis testing and statistical inference. Fisher did not appreciate it, but it still transformed the field of mathematical statistics and defined the logic that journal editors came to demand of the papers they publish.
What is probability?
The meaning of probability we choose determines what we can do with statistics. The Neyman-Pearson approach follows from one particular interpretation of probability.
Interpretations often start with a set of axioms that probabilities must follow. Probabilities are things that follow the axioms of probability.
According to the subjective interpretation of probability, a probability is a degree of conviction in a belief.
According to the objective interpretation of probability, a probability is set in the world instead of in the mind. Objective probabilities exist independently of our states of knowledge. They are to be discovered by examining the world.
The most influential objective interpretation of probability is the long-run relative frequency interpretation of von Mises. A probability is a relative frequency. The hypothetical infinite set of events is called the reference class or collective. Because the long-run relative frequency is a property of all the events in the collective, it follows that a probability applies to a collective, not to any single event. Objective probabilities do not apply to single cases. They also do not apply to the truth of hypotheses. A hypothesis is simply true or false, just as a single event either occurs or does not. A hypothesis is not a collective, it therefore does not have an objective probability.
What are hypotheses?
In this paragraph, some data will be symbolized by D and a hypothesis by H. The probability of obtaining some data given a hypothesis will then be P(D|H). P(H|D) is the inverse of the conditional probability p(D|H). Inverting conditional probabilities makes a big difference. In general, P(A|B) can have a very different value from P(B|A).
If you know P(D|H), it does not mean you know what P(H|D) is. This is the case for two reasons:
Inverse conditional probabilities can have very different values.
It is meaningless to assign an objective probability to a hypothesis.
Statistics cannot tell us how much to believe a certain hypothesis. According to Neyman and Pearson, we can set up decision rules for accepting or rejecting hypotheses, such that in following those rules in the long run we will not often be wrong. We can work out what the error rates are for certain decision procedures and we can choose procedures that control the long-run error rates at acceptable levels.
This decision rules work by setting up two contrasting hypotheses: null hypothesis (H0) and the alternative hypothesis (H1). Both kind of hypotheses can be of some specific difference as well as being a band. The only difference between the two is that the null hypothesis is the one most costly to reject falsely.
Parameters are properties of populations and are symbolized with Greek letters. Statistics are summaries of sample measurements and are symbolized with Roman letters. The null and alternative hypotheses are about population values. We try to use our samples to make inferences about the population.
For a given experiment we can calculate p = P('getting t as extreme or more extreme than obtained' I H0), which is a form of P(D I H). This p is the 'p-value' or simply ‘p’ statistical computer output. If p is less than α, the level of significance we have decided in advance (say 0.05), we reject H0. By following this rule, we know in the long run that when H0 is actually true, we will conclude it false only 5 % of the time. In this procedure, the p-value has no meaning in itself; it is just a convenient mechanical procedure for accepting or rejecting a hypothesis, given the current widespread use of computer output. Such output produces p-values as a matter of course.
α is an objective probability, a relative long-run frequency. It is the proportion of errors of a certain type we will make in the long run, if we follow the above procedure and the null hypothesis is in fact true. Conversely, neither α nor our calculated p tells us how probable the null hypothesis is.
α is the long-term error rate for one type of error: saying the null is false when it is true (Type I error). But there are two ways of making an error with the decision procedure. The type II error happens when one accepts the H0 as true when in fact it is false (this is symbolized as β). When the null is true, we will make a Type I error in α proportion of our decisions in the long run. Strictly using a significance level of 5% does not guarantee that only 5% of all published significant results are in error.
Controlling β does not mean you have controlled α as well. They can be very different from one and other. The definition of power is 1 – β. Power is the probability of detecting an effect, given effect really exists in the population.
In the Neyman-Pearson approach, one decides on acceptable α and β levels before an experiment is run. In order to control β, you need to:
Estimate the size of effect you think is interesting, given your theory is true.
Estimate the amount of noise your data will have.
Having determined (1) and (2) above, you can use standard statistics textbooks to tell you how many participants you need to run to keep β at 0.05 (equivalently, to keep power at 0.95).
Strict application of the Neyman-Pearson logic means setting the risks of both Type I and II errors (α and β) in advance. Many researchers are extremely worried about Type I errors, but allow Type II errors to go uncontrolled. Ignoring the systematic control of Type II errors leads to inappropriate judgments about what results mean and what research should be done next.
The process of combining groups of studies together to obtain overall tests of significance (or to estimate values or calculate confidence intervals) is called meta-analysis.
A set of null results does not mean you should accept the null; they may indicate that you should reject the null.
To summarize, if your study has low power, getting a null result tells you nothing in itself. You would expect a null result whether or not the null hypothesis was true. In the NeymanPearson approach, you set power at a high level in designing the experiment, before you run it. Then you are entitled to accept the null hypothesis when you obtain a null result. In following this procedure you will make errors at a small controlled rate, a rate you have decided in advance is acceptable to you.
What is sensitivity?
Sensitivity can be determined in three ways:
Power.
Confidence intervals.
Finding an effect significantly different from another reference one.
Whenever you find a null result and it is interesting to you that the result is null, you should always indicate the sensitivity of your analysis. Before you could accept the null result in the condition that was null, you would need to show you had appropriate power to pick a minimally interesting effect.
What are stopping rules?
The stopping rules you use are defined by the conditions under which you will stop collecting data for a study. Use power calculations in advance of running the study, the standart Neyman-Pearson stopping rule, to determine how many participants should be run to control power at a predetermined level. Both alfa and beta can then be controlled at known acceptable levels. The use of confidence intervals is another good stopping rule.
What is multiple testing?
In the Neyman-Pearson approach it is essential to know the collective or reference class for which we are calculating our objective probabilities α and β. The relevant collective is defined by a testing procedure applied an indefinite number of times.
In the Neyman-Pearson approach, in order to control overall Type I error, if we perform a number of tests we need to test each one at a stricter level of significance in order to keep overall α at 0.05. There are numerous corrections, but the easiest one to remember is Bonferroni. If you perform k tests, then conduct each individual test at the 0.05/k level of significance and overall a will be no higher than 0.05.
What are points concerning significance tests that are often misunderstood?
Significance is not a property of populations.
Decision rules are laid down before data are collected; we simply make black and white decisions with known risks of error.
A more significant result does not mean a more important result, or a larger effect size.
What are confidence intervals?
Confidence interval is a set of possible population values the data are consistent with. This concept was developed by Neyman. To calculate the 95 % confidence interval, find the set of all values of the dependent variable that are non-significantly different from your sample value at the 5 % level.
Use of the confidence interval overcomes some of the problems people have when using Neyman-Pearson statistics otherwise:
It tells the sensitivity of your experiment directly; if the confidence interval includes the value of both the null hypothesis and the interesting values of the alternative hypothesis, the experiment was not sensitive enough to draw definitive conclusions.
It turns out you can use the confidence interval to determine a useful stopping rule: when the interval is of a certain predetermined width, stop collecting data.
Confidence intervals are a very useful way of summarizing what a set of studies as a whole are telling us.
Like all statistics in the Neyman-Pearson approach, the 95 % confidence interval is interpreted in terms of an objective probability.
What is the criticism of the Neyman-Pearson approach?
Simple acceptance or rejection is all what inference consists of. Arguably, what a scientist wants to know is either how likely certain hypotheses are in the light of data or how strong the evidence supports one hypothesis rather than another.
Weak theorizing is encouraged by null hypothesis testing. A good theory should specify the size of effect not just that it is different from zero.
It is important to know the reference class in the Neyman-Pearson approach – we must know what endless series of trials might have happened but never did. This is important when considering both multiple testing and stopping rules. It strikes some as unreasonable that what never happened should determine what is concluded about what did happen.
How do you use the Neyman-Pearson approach to critically evaluate a research article?
If the article uses significance or hypothesis tests, then two hypotheses need to be specified for each test
Note from the introduction section of the paper whether any specific comparisons were highlighted as the main point of the experiment. These comparisons, if few in number, can be treated as planned comparisons later. If a direction is strongly predicted at this point, one-tailed tests could be considered later.
The stopping rule should be specified. If very different numbers of subjects are used in different experiments in the paper for no apparent reason, it may be a sign that multiple significance tests were conducted, as each experiment progressed and stopping occurred when the required results were obtained.
Even if minimally interesting effect sizes were not stated in advance and if power were not stated in advance, a crucial point is how the authors dealt with interesting null results. Given a null result was obtained, did the authors give some measure of sensitivity of the test?
- 1225 reads
Add new contribution