Effect size, proportion of explained variance and power of tests
Some researchers critize the process of testing hypothesis. The main critique refers to the interpretation of a significant result. When testing hypotheses, most attention is paid to the data instead of to the hypotheses. When the null hypotheses is rejected, we make statements about the sample data and not about the null hypothesis. Based on the sample data, the null hypothesis is rejected or not. We do not know whether the null hypothesis is truly false or true. Another point of critique is that a significant effect does not imply anything about the effect size. Something is significant or not, but it does not imply anything about the size of the effect. Thus, a significant effect is not equal to a large effect. To provide more insight into the size of an effect, Cohen (1988) proposed the so-called effect size. His measure for effect size is called Cohen’s d.
\[Cohen's\: d =\frac{(\bar{x} - \mu)}{\sigma}\]
- Cohen's d: standardised difference between two means
- x̄: sample mean
- µ: population mean
- σ: population standard deviation
The outcome of Cohen’s d is classified as a small effect for d = 0.2, a medium effect for d = 0.5 and a large effect for d = 0.8.
Please note: APA style strongly recommends use of η2 instead of Cohen's d.
A different way to determine the effect size, is by looking at how much variance between the scores is explained by the effect. The proportion of explained variance can be found by squaring the t-statistic and dividing it by the same number plus the degrees of freedom. In formula:
\[r^2 = \frac{t^2}{t^2 + df}\]
- r2: proportion of explained variance
- t: t-statistic
- df: degrees of freedom: n-1
A proportion explained variance of 0.01 refers to a small effect. A value of 0.09 refers to a medium effect. A proportion of 0.25 refers to a large effect. The r2 is usually presented in percentages in the literature.
Confidence intervals can assist in describing the results from hypothesis tests. When we obtain a specific estimation of a parameter, we call this a point estimation. Next, there are interval estimations, which obtain the limits within the true population parameter(µ) likely is. These are called the confidence limits, that make the confidence interval. We want to know how high and how low the µ-value can be, for which we do not reject H0. This provides the limits within we keep the null hypothesis.
Besides measuring the effect size, it is also possible to measure the power of a statistical test. Power refers to the extent to which a study is capable of detecting the effects in the examined variable. A study with a high power is able to detect existing effects, while a study with a low power will likely not detect these effect. The power is influenced by many things. One of these is the number of participants. In general, it applies that the more participants there are, the higher the power is. Strong effects are easier to identify than weak effects. A study with a low power will often not identify weak effects, but may identify the strong effects. To identify a weak effect, a high power is required. For identifying weak effects, it is also useful to have data from many participants. Power can be calculated as:
\[Power = 1 - \beta\]
- β: the chance of a type-II error
Researcher often require a power of 0.80. The power of the test is influenced by three factors:
First, the sample size is important. The larger the sample size, the higher the chance of rejecting the null hypothesis when the null hypothesis is actually false. This means that the power of the test increases when the sample size increases.
Second, the power of the test decreases when the alfa level is lowered. When the alfa for example is decreased from 5 to 1%, the chance that a true effect is found (thus that the null hypothesis is rejected correctly) decreases.
Third, the power increases when a two-sided test is transferred to a one-sided test.