Neyman, Pearson and hypothesis testing - summary of an article by Dienes (2003)

Critical thinking
Article: Dienes (2003)
Neyman, Pearson and hypothesis testing

Introduction
Probability
Data and hypotheses
Hypothesis testing: α
Hypothesis testing: α en ß
Power in practice
Sensitivity
Stopping rules
Multiple testing
Fisherian inference
Further points concerning significance tests that are often misunderstood
Confidence intervals
Criticism of the Neyman-Pearson approach
Using the Neyman-Pearson approach to critically evaluate a research article

Introduction

In this article, we will consider the standard logic of statistical inference.
Statistical inference: the logic underlying all the statistics you see in the professional journals of psychology and most other disciplines that regularly use statistics.

The underlying logic of statistic (Neyman-Pearson) is both highly controversial, frequently attacked (and defended) by statisticians and philosophers, and more frequently misunderstood.

Probability

The meaning of probability we choose determines what we can do with statistics.
The proper way of interpreting probability remains controversial, so there is still debate over what can be achieved with statistics.
The Neyman-Pearson approach follows form one particular interpretation of probability. The Bayesian approach considered follows form another.

Interpretations often start with a set of axioms that probabilities must follow.
Two interpretations of probability:

the subjective interpretation: a probability is a degree of conviction of a belief
the objective interpretation: locate probability in the world.

The most influential objective interpretation of probability is the long-run relative frequency interpretation. Here, probability is a relative frequency.
Because the long-run relative frequency is a property of all the events in the collective, it follows that a probability applies to a collective, not to any single event.
A single event could be a member of different collectives. So a singular event does not have a probability, only collectives do.

Objective probabilities do not apply to single cases. They also do not apply to the truth of hypotheses.
A hypothesis is simply true or false, just as a single event either occurs or does not.
A hypothesis is not a collective, it therefore does not have an objective probability.

Data and hypotheses

Data = D

Hypothesis = H

inverse conditional probabilities can have very different values
in any case, it is meaningless to assign an objective probability to a hypothesis.

Hypothesis testing: α

Statistics cannot tell us how much to believe a certain hypothesis. What we can do, according to Neyman and Pearson, is set up decision rules for certain behaviours such that in following those rules in the long run we will not often be wrong. We can work out what the error rates are for certain decision procedures and we can choose procedures that control the long-run error rates at acceptable levels.

Decision rules work by setting up two contrasting hypotheses.

For a given experiment we can calculate p=(getting t as extreme or more extreme than obtained| H₀).
If p is less than alpha, the level of significance we have decided in advance, we reject H₀. By following this rule, we know in the long run that when H₀ is actually true, we will conclude it false only 5% of the time.
In this procedure, p-value has not meaning in itself. It is just a convenient mechanical procedure for accepting or rejecting a hypothesis.

Alpha is an objective probability, a relative long-run frequency.
It is the proportion of errors of a certain type we will make in the long run, if we follow the above procedure and the null hypothesis is in fact true.

Neither alpha nor our calculated p tells us how probable the null hypothesis is.

Hypothesis testing: α en ß

Alpha: the long-term error rate for one type of error: saying the null is false when it is true.
There are two ways of making an error with the decision procedure.

Type I error: when the null is true and we reject it
In the long run, when the null is true, we will make a Type I error in alpha proportion of our decisions
Type II error: accepting the null when it is false
In the long run, when the null is false, the proportion of times we nonetheless accept is is labelled beta.

Both alpha and beta should be controlled at acceptable levels.

Sometimes significance or alpha is defined simply as ‘the probability of a Type I error’. This is wrong.
Alpha is specifically the probability (long-run frequency) of a Type I error when the null hypothesis is true.

Strictly using a significance level of 5% does not guarantee that only 5% of all published significance results are in error.

Controlling for alpha does not mean you have controlled for beta.
Power is 1 – ß
Power is the probability of detecting an effect, given an effect really exists in the population.

In order to control ß, you need to:

Estimate the size of effect you think is interesting, given your theory is true.
Estimate the amount of noise your data will have

The more participants your run, the greater the power.

Power in practice

Studies should systematically use power calculations to determine the number of participants.

Significance of 5% means that, if the null hypothesis were true, one would expect 5% of studies to be significant.

Meta-analysis: the process of combining groups of studies together to obtain overall tests of significance.

A set of null results does not mean you should accept the null. They may indicate that you should reject the null.

If your study has low power, getting a null result tells you nothing in itself.
You would expect a null result whether or not the null hypothesis was true.
In the Neyman-Pearson approach, you set power at a high level in designing the experiment, before you run it. Then you are entitled to accept the null hypothesis when you obtain a null result. Doing this procedure you will make errors at a small controlled rate, a rate you have decided in advance is acceptable for you.

Statistics never allows absolute proof or disproof.

Sensitivity

Sensitivity can be determined in three ways:

power
confidence intervals
finding an effect significantly different from another reference one.

Whenever you find a null result and it is interesting to you that the result is null, you should always indicate the sensitivity of your analysis.

Stopping rules

The conditions under which you will stop collecting data for a study define the stopping rule you use.

The standard Neyman-Pearson stopping rule is to use power calculations in advance of running the study to determine how many participants should be run to control power at a predetermined level. Then run that number of subjects.
- Both alpha and beta can then be controlled at known acceptable levels.
Another legitimate stopping rule involves the use of confidence intervals.

Multiple testing

In the Neyman-Pearson approach it is essential to know the collective or reference class for which we are calculating our objective probabilities alpha and beta.
The relevant collective is defined by a testing procedure applied an indefinite number of times.

In the Neyman-Pearson approach, in order to control overall Type I error, if we perform a number of tests we need to test each one at a stricter level of significance in order to keep overall alpha at 0.05. There are numerous corrections.

A researcher might mainly want to look at one particular comparison, but threw in some other conditions out of curiosity while going to the effort of recruiting, running and paying participants. Then, it might feel unfair that the p level is to high just because you collected other conditions you didn’t need to have.
The solution is that if you planned one particular comparison in advance then you can test at the 0.05 level, because that one was picked out in advance of seeing the data.
But, the other tests must involve a correction.

Fisherian inference

Alpha is an objective probability and hence a property of a collective and not any individual event, not a particular sample.
In the Neyman-Pearson approach, the relevant probabilities alpha and beta are the long-run error rates you decide are acceptable and so must be set in advance.
If alpha is set at 0.05, the only meaningful claim to make about the p-value of a particular experiment is either it is less than 0.05 or not.

The statistics tell you nothing about how confident you should be in a hypothesis nor what strength of evidence there is for different hypotheses.

It is hard to construct an argument for why p-values should be taken as strength of evidence per se. Conceptually, the strength of evidence for or against a hypothesis is distinct from the probability of obtaining such evidence.
There is not need to force p-values into taking the role of measuring strength of evidence, a role for which they may often give a reasonable answer, but not always.

Further points concerning significance tests that are often misunderstood

Significance is not a property of populations.
Hypotheses are about population properties. Significance is not a property of population means or differences.

Decision rules are laid down before data are collected; we simply make black and white decisions with known risks of error.

A more significant result does not mean a more important result, or a larger effect size.

Confidence intervals

The Neyman-Pearson approach is not just about null hypothesis testing.
Neyman also developed the concept of confidence interval, a set of possible population values the data are consistent with.
Instead of saying merely we reject one value, one reports the set of values rejected, and the set of possible values remaining.
To calculate the 95% confidence interval, find the set of all values of the dependent variable that are non-significantly different from your sample value at the 5% level.

Use of the confidence interval overcome some of the problems people have when using Neyman-Pearson statistics otherwise:

it tells you the sensitivity of your experiment directly
it turns out you can use the confidence interval to determine a useful stopping rule. Stop collecting data when the interval is of a certain predetermined width. Such a stopping rule would ensure that people do not get into situations where illegitimate stopping rules are tempting.

Confidence intervals are a very useful way of summarizing what a set of studies as a whole are telling us. You an calculate the confidence intervals on the parameter of interest by combining the information provided in all the studies.

The 95% confidence interval is interpreted in terms of an objective probability.
The procedure of calculating 95% confidence intervals will produce intervals that include the true population value 95% of the time.
There is no probability attached to any one calculated interval. That interval either includes the population in value or it does not.
There is not a 95% probability that the 95% confidence limits for a particular sample includes the true population mean. But if you acted as if the true population value is included in your interval each time you calculate a 95% confidence interval, you would be right 95% of the time.

Criticism of the Neyman-Pearson approach

Inference consists of simple acceptance or rejection

Data seem to provide continuous support for or against different hypotheses.
What a scientists wants to know is either how likely certain hypotheses are in the light of data or how strong the evidence supports one hypothesis rather than another.
It is meaningless to use the tools and concepts developed in the Neyman-Pearson framework to draw inferences about the probability of hypotheses or the strength of evidence.

Null hypothesis testing encourages weak theorizing

It encourages ‘not the null hypothesis’ not a certain value.
The habitual use of confidence intervals instead of simple null hypothesis testing would overcome this objection.

In the Neyman-Pearson approach it is important to know the reference class, we must known that endless series of trials might have happened but never did.

This is important when considering both multiple testing and stopping rules.
The decision is basically arbitrary and tacit conventions determine practice.
In the Neyman-Pearson approach, the same data can lead to different conclusions.
The limits of the confidence intervals are sensitive to the stopping rule and multiple testing issues as well.

Using the Neyman-Pearson approach to critically evaluate a research article

If the article uses significance or hypothesis tests, then two hypotheses need to be specified for each tests.
Most papers fall down at the first hurdle because the alternative is not well specified.
The stopping rule should be specified in advantage at a fixed number and significance testing took place once at the end of data collection.

Even if minimally interesting effect sizes were not stated in advance and if power were not stated in advance, a crucial point is how the authors dealt with interesting null results.
Given a null result was obtained, did the authors give some measure of sensitivity of the test?

Access:

Public

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

This content is related to:

WSRt, critical thinking - a summary of all articles needed in the second block of second year psychology at the uva

Check more of topic:

Samenvattingen voor psychologie en gedrag

Universiteit Amsterdam: UVA

This content is used in:

WSRt, critical thinking - a summary of all articles needed in the second block of second year psychology at the uva

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Spotlight: topics

Check the related and most recent topics and summaries:

Activities abroad, study fields and working areas:

Samenvattingen voor psychologie en gedrag

Countries and regions:

The Netherlands

Institutions, jobs and organizations:

Universiteit Amsterdam: UVA

This content is also used in .....

WSRt, critical thinking - a summary of all articles needed in the second block of second year psychology at the uva

This is a summary of the articles and reading materials that are needed for the second block in the course WSR-t. This course is given to second year psychology students at the Uva. This block is about analysing and evaluating psychological research. The order in which the

...

bundel bok 2 cd.jpg

False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant - summary of an article by Simmons, Nelson, & Simonsohn (2011)

Scientific Utopia: II. Restructuring Incentives and Practices to Promote Truth Over Publishability - summary of an article by Nosek, Spies, & Motyl, (2012)

Neyman, Pearson and hypothesis testing - summary of an article by Dienes (2003)

Evaluating Theories - summary of an article by Dennis & Kintsch

Degrees of falsifiability - summary of an article by Dienes (2008)

Causal Inference and Developmental Psychology - summary of an article by Foster (2010)

Confounding and deconfounding: or, slaying the lurking variable - summary of an article by Pearl (2018)

Critical thinking in Quasi-Experimentation - summary of an article by Shadish (2008)

Beyond the null ritual, formal modeling of psychological processes - summary of an article by Marewski, & Olsson, (2009)

The two disciplines of scientific psychology - summary of an article by Cronbach (1957)

Simpson's paradox in psychological science: a practical guide - summary of an article by Kievit, Frankenhuis, Waldorp, & Borsboom (2013)

Fearing the future of empirical psychology - summary of an article by LeBel & Peters (2011)

The 10 commandments of helping students distinguish science from pseudoscience in psychology - summary of an article by Scott O. Lilienfeld (2005)

WSRt, critical thinking, a list of terms used in the articles of block 2

Everything you need for the course WSRt of the second year of Psychology at the Uva

Check how to use summaries on WorldSupporter.org

Submenu: Summaries & Activities

Follow the author: SanneA

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

Search a summary, study help or student organization

Select any filter and click on Search to see results

Neyman, Pearson and hypothesis testing - summary of an article by Dienes (2003)

Introduction

Probability

Data and hypotheses

Hypothesis testing: α

Hypothesis testing: α en ß

Power in practice

Sensitivity

Stopping rules

Multiple testing

Fisherian inference

Further points concerning significance tests that are often misunderstood

Confidence intervals

Criticism of the Neyman-Pearson approach

Using the Neyman-Pearson approach to critically evaluate a research article

WSRt, critical thinking - a summary of all articles needed in the second block of second year psychology at the uva

Samenvattingen voor psychologie en gedrag

Universiteit Amsterdam: UVA

WSRt, critical thinking - a summary of all articles needed in the second block of second year psychology at the uva

Contributions: posts

Original reference Per contributed on 18-02-2021 10:33

Add new contribution

Spotlight: topics

Samenvattingen voor psychologie en gedrag

The Netherlands

Universiteit Amsterdam: UVA

WSRt, critical thinking - a summary of all articles needed in the second block of second year psychology at the uva

bundel bok 2 cd.jpg

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

Quicklinks to fields of study for summaries and study assistance