HC3: Sample size calculation
Motivation
In medical papers, there often is a statistical analysis paragraph with a motivation of the number of people in a sample of a study.
The aim of an RCT is to compare 2 treatments → patients are recruited to the study and randomized to treatment A or B. It also needs to be determined how many patients are included in the RCT → the sample size:
- Too few
- Imprecise results → no power to determine the effect of treatment
- Too many
- Takes a lot of time, effort and money
- It is unethical to include many patients in a study
Factors
Factors for deciding sample size are:
- Practical
- Number of eligible patients treated at the center
- Number of patients willing to participate
- Time
- Money
- Statistical
- How big of an effect can be detected with a given number of patients?
Hypothesis testing
Hypothesis testing yields P-values and statements of statistic significance. Hypothesis testing is done as follows:
- Decide on a null hypothesis (H0) about the population
- H0: there is no difference between the 2 groups
- Take a representative sample of the population
- Calculate the observed difference in the sample
- Calculate the p-value
- P-value: the probability to observe at least this difference if H0is true
- This is done by a statistical test
- If the p-value is smaller than the prespecified value α, H0is rejected
- The value αis called the significance level
Mistakes:
However, mistakes can be made in hypothesis testing. H0is rejected in case the observations are unlikely to occur if H0is true, and not if they are impossible:
- Correct decisions
- H0is not rejected + H0 is true
- H0is rejected + H0is not true
- Incorrect decisions
- H0is rejected + H0is true → a type 1 error
- α = the probability of a type 1 error
- H0is not rejected + H0is not true → a type 2 error
- β = the probability of a type 2 error
Power:
The power is the probability of finding a significant effect in a sample when the effect is really present in the population. This depends on:
- Relevant difference (effect size)
- Sample size
- If the sample size decreases, the power will also decrease
- Variance/standard deviation
- If there is more variation in a group the power will be smaller
- Significance level α
The aim is to have a study with a large power of 80-90%.
Example:
There is an RCT on patients with high blood pressure:
- Intervention: 40 mg of ReDuCe
- Comparator: 25 mg of hydrochlorothiazide
- Outcome: blood pressure after 6 weeks of treatment
In order to calculate the optimal sample size of this trial, some extra information is necessary:
- Standard deviation: 10 mmHg
- A difference of 5 mmHg between the 2 arms of the RCT is relevant
- Significance level α: 0,05
- P values <0,05 are statistically significant
In case the trial is done with 2 groups of 30 patients and H0is true, in 95% of cases the difference in the mean will lay between -5 and +5:
- 2,5% of cases is <-5
- 2,5% of cases is >+5
H0 is rejected if in those 95%, there also are values lower or higher than -5 and +5. For example, if the difference is 5 instead of 0 → H1= 5. If H0is only rejected if a value slightly higher than 5 is found, in many cases H0won’t be rejected, even if the alternative hypothesis is true → sometimes the value can also be lower than 5:
- In 49% of cases, H0 will be rejected
- In 51% of cases, H0 will not be rejected
In this situation, the power is 49% → the probability that significant differences are found if the alternative hypothesis is true (if the true difference is 5 instead of 0). There is a probability of 49% to detect a difference of at least 5 mm Hg.
The power increases as the sample size increases:
- If the group sizes increase to 2 groups of 50 patients, the probability to reject H0becomes 70% → the power is 70%
- If the groups increase to 70 patients each, the power becomes 84%
- If the groups increase to 90 patients each, the power becomes 92%
Formulas
There are formulas to calculate the optimal sample size for a given power. This can be done in 2 different situations:
- Continuous outcomes
- Number of patients per group: n= (2(zα/2+ zβ)2 s2)/d2
- s = standard deviation of the outcome
- d = effect size
- (Minimum) difference in means between the 2 groups
- α = significance level
- β = probability of a type II error
- Often β = 0,20 or β = 0,01
- Power = 1 – β
- z-values: relation between α, β and the calculated number
- Cut of values of a normal distribution
- Every significance level has a corresponding z-value
- Will be on a formula sheet during the exam
- Binary outcomes (there is a yes/no value)
- p1= the probability of the outcome in group 1
- p2= the probability of the outcome in group 2 (under H1)
- Number of patients per group: n= (2(zα/2+ zβ)2x
(1-
))/d2
= ½(p1+ p2)- d = p1– p1
Examples:
Example with continuous outcomes:
- d = 5 mm Hg
- α = 0,05
- s = 10
- Power = 80% (β = 0,20)
→ n = 2(1,96 + 0,84)2 x 102/52 = 64 patients per group.
Example with binary outcomes:
- p1= 0,06
- p2= 0,03
- p1– p2= 0,03
= (0,06 + 0,03)/2 = 0,045- α = 0,05
→ n = 2(1,96 + 0,84)2 x 0,045 x (1-0,045)/0,032 = 749 patients per group.
Pertinent questions
Important questions to ask are:
- What is the outcome
- What type of outcome is it?
- Numerical
- Categorical
- Survival
- For numeric outcomes: what is the standard deviation?
- What is the relevant difference (effect size d)?
- What power is desired?
- What is the significance level?
- Is it a one- or two-sided test?
- Usually the test is two-sided
Remarks
Important things to remember are:
- Similar formulas exist for more complex situations
- There is a lot of free and commercial software → need to be checked whether they yield the correct answers before use
- If results are based on one- or two-sided tests
- In medical situations, tests are always two sided
- If the software gives numbers per group or the total number
Continuous versus binary
Whether the continuous or binary formula needs to be used depends on the situation:
- Mean systolic blood pressure between 2 groups → continuous
- The number of seizures in group A versus group B → continuous
- Percentage of patients with at least 1 seizure per week → binary
- Per patient, the outcome can only have 2 values (yes or no)
- The proportion of patients with a score between 3 and 7 on a survey → binary
95% confidence interval
The 95% CI indicates that if the study is repeated multiple times, 95% of the intervals would contain the true effect.
Add new contribution