HC10+11: Regression analysis
Mean and standard deviation
Statistics consists of making statements about a population based on data observed from a sample. This is often done using means and standard deviations (σ). The bigger the standard deviation, the bigger the spread in the population.
For example, the lung function (FEV1 in L) of 40 children is measured:
- Mean FEV1 = 3,16 L
- σ = 0,41 L
This means that roughly 95% of the population has a FEV1 between 3,16 – 0,82 and 3,19 + 0,82 L → approximately 95% of observations are less than 2σ from the mean:
- 95% CI = (2,34 L, 3,84 L)
However, lung function depends on many factors such as age and gender. These factors also need to be taken into account.
Linear regression
Simple linear regression is regression for continuous outcomes. Linear regression tries to predict or explain a variable → the outcome or the dependent variable (x). This variable is explained by another variable → the explanatory variable (y). A regression line is based on a scatter plot and calculates the mean value of “y” for a value of “x”:
- y = the dependent variable, outcome and response variable
- x = the independent variable, covariate, risk factor, predictor and explanatory variable
- Mean y = β0+ β1x
- β0= the intercept (“constante”)
- The predicted value of “y” if “x” is equal to 0
- Not always clinically meaningful
- The predicted value of “y” if “x” is equal to 0
- β1= the slope (“richtingscoëfficiënt”)
- The expected change in the outcome by increasing the exposure of 1 unit if β1 is positive
- Or decrease, in case β1is negative
- The expected change in the outcome by increasing the exposure of 1 unit if β1 is positive
- β0= the intercept (“constante”)
For instance, a regression line can describe the mean FEV1 as function of age:
- Mean FEV1 = 2,281 + 0,119 x age
This means that for 2 children with an age difference of 1 year, the expected mean difference in the FEV1 is 0,119 L.
Error/residual:
Observations of (x1, y1), (x2, y2), …, (xn, yn) show that each pair represents the values of 1 person. Sometimes, the error can also be taken into account:
- y = β0+ β1x + e
The deviations of the regression line are called residuals, which are taken into the error. The error/residual is assumed to be normally distributed with the standard deviation σ. σ indicates how much the observations vary around the regression line:
- Small σ: all observations are close to the regression line
- Large σ: some observations are far from the regression line
The residual is the distance from a single observation to the regression line → the difference between what is observed and what is predicted:
- yi– (β0+ β1xi)
Least squares method:
The unknown true regression line in the population is line y = β0+ β1x. Using the least squares method, the regression line can be estimated by y = b0+ b1x. The b0and b1which minimize the sum of squared residuals need to be selected:
- ∑(yi– (β0+ β1xi))2
- b1=
- b0= 1
- s =
- sis an estimate for σ, the standard deviation around the regression line
95% confidence interval:
Because research is usually based on a sample, b0and b1are not exact. The standard error is the uncertainty of estimate in a and b (se(b0) and se(b1)), which is used to make confidence intervals for the true unknown β0and β1. The approximate 95% CI for β1can be calculated as follows:
- (b1– 2 x se(b1), b1+ 2 x se(b1)) → it is 95% sure that the true β1 lies in this interval
In case 0 is in the 95% CI, this indicates that there is no association. The 95% CI for the FEV1 of children is:
- (0.119 – 2×0.011, 0.119 + 2×0.011) = (0.097, 0.141) → a value of 0 between age and FEV1 is very unlikely
The 95% confidence interval for mean y = β0+ β1x for given value of “x” can be calculated as follows:
- (b0+ b1x – 2 se(b0+ b1x), b0+ b1x + 2 se(b0+ b1x))
- se(b0+ b1x) can be calculated in SPSS
If the 95% of a regression line is known, the true regression line is likely to be between these bounds.
Standard deviation versus standard error:
The standard deviation is often mixed up with the standard error:
- Standard deviation: a measure of variability in the population → indicates how much the FEV1 values in children vary
- Standard error: a measure of precision of an estimate (sample mean or estimated slope of the regression line) → used to calculate the 95% CI’s
Prediction:
The expected FEV1 of a 6-year-old child according to the formula is:
- 2,281 + 0,119x6 = 2,995 L
There are 2 sources of variation:
- Imprecision in the estimated regression line: se(b0+ b1x)
- Spread around regression line σ
Combining this gives the 95% reference or prediction interval for a new observation → the interval between which 95% of the values of the population fall into. For a 6-year-old child, values between 2,6 and 3,5 are considered normal.
Assumptions:
Simple linear regression relies on some assumptions:
- Linearity
- The scatterplot needs to be checked
- It is assumed that the relation between “x” and “y” is linear
- Nearly normal residuals
- Constant variability: homoscedasticity
- σ is constant
- This often isn’t a problem if the sample size is large → the estimate se, 95% CI and p-value are still valid
- If the “y” variable is very skewed, it may be log transformed
- Independent observations
- How the data was collected needs to be checked
Residual plot:
The residual plot is the plot of predicted values versus residuals. It is used to see if the assumptions are correct. A residual plot shouldn’t have a clear pattern and can be used to detect deviations of the model:
- Dots scattered everywhere → no constant variability
- Dots taking the shape of a parabola → no linear relation
Categorical variables:
If x is categorical, x is either 1 or 0, for example if x indicates asthma treatment:
- x = 0 → no treatment
- x = 1 → treatment
In this case, x can be taken as an independent variable in the regression model of the FEV1 of children. The FEV of treated children is on average 0,266 L larger with a p-value of 0,036 → there is a statistically significant difference between treated and untreated children.
The increase in the mean FEV between untreated (x = 0) and treated (x = 1) children is 0,226 → the slope of the regression line. Because the mean of the treated and untreated children is compared, this is equivalent to an unpaired t-test.
Multiple regression
Multiple linear regression means regression in multiple directions. It is characterized by the influence of several explanatory variables on the response:
- How does the average “y” vary as function of x1, x2, ..., xp?
- Can “y” be predicted if x1, x2,..., xpare known?
- What is the influence of x1on “y”, corrected for x2,.., xp?
- Which combination of x’s is related to “y”?
Multiple regression can be used to:
- Control for confounders
- Build a prediction model
- By adding extra information to the model to make a better guess
- E.g. age
- By adding extra information to the model to make a better guess
- Increase the precision
- By adding more information, less patients are needed to obtain the same precision for the treatment effect
Calculations:
The mean FEV1 is obtained with the formula 2,281 + 0,119 x age. This formula changes if height is added as explanatory variable to the model:
- Mean (FEV1) = 1,711 + (0,058 x age) + (0,008 x height)
- If the FEV of 2 children who have the same height is measured, a 1-year older child has on average 0,058 L more FEV
- If 2 children have the same age, a child who is 1 cm taller has on average 0,008 L more FEV
In short, in multiple regression:
- “y” is a numerical outcome
- The model has 2 independent variables (x1and x2) with e~ N(0, σ2)
- The estimated regression equation is y = b0+ b1x1+ b2x2
- If x1is increased by 1 unit, x2 is kept fixed → y = b0+ b1(x1+ 1) + b2x2
- The difference is b1
- The amount by which the mean of y increases if x1increases 1 unit and all other x’s are kept fixed
- If x1is increased by 1 unit, x2 is kept fixed → y = b0+ b1(x1+ 1) + b2x2
Testing and estimation:
Testing and estimation is done in a similar way. Coefficients are estimated with the least squares method. Here, standard errors, confidence intervals and p-values can be calculated. In the FEV example, after correction for height, the relation between age and FEV isn’t significant anymore.
In short, if age is added to the FEV model, the following is visible:
- The direction of the effect changes
- The effect is very small and no longer statistically significant
- Age is a confounder
- Young children have a lower FEV and are less often treated
- Adding age to the model adjusts for age
- Differences between treated and untreated for fixed ages should be considered
Confounding:
One of the main functions of multiple regression is to control for confounding. Confounding should be considered if the regression coefficient for a variable (e.g. treatment) changes if another variable (e.g. age) is added.
Functions of regression
Both linear and multiple regression have different uses:
- Linear regression
- To predict: e.g. what is the mean FEV for a 7-year-old child who is 1,30m tall and doesn’t use any medication?
- To correct for confounders: e.g. what is the effect of treatment on FEV, after adjustment for age?
- Multiple regression
- Increases precision of randomized trials → adjusts the variability of important risk variables → the σ around the regression line becomes smaller
Assumptions shouldn’t be made outside of the sample → the regression line may be different in extrapolation.
Types of regression models
There are different types of regression models for different types of outcomes:
- Numerical outcomes → linear or non-linear regression
- Binary outcome → logistic regression
- A 0-1, success/failure outcome
- Survival data → proportional hazard model (Cox regression)
Cox proportional hazards:
Cox proportional hazards (Cox PH) is a regression method for survival data for adjusted analysis. The assumption is that there’s a baseline hazard in a group, and a hazard ratio (HR) which increases or decreases the hazard:
- h1= h0 (t) x HR
In this case, the HR may depend on covariates, but not on the time (t) → proportional hazards do not change over time.
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
Blok AWV2 2020/2021 UL
- Blok AWV HC1: Research questions
- Blok AWV HC2: RCT
- Blok AWV HC3: Sample size calculation
- Blok AWV HC4: Cohort studies
- Blok AWV HC5: Case control studies
- Blok AWV HC6+7: Bias
- Blok AWV HC8+9: Survival analysis
- Blok AWV HC10+11: Regression analysis
- Blok AWV HC12: Diagnostische begrippen
- Blok AWV HC13: Beslisbomen
- Blok AWV HC14: Test en behandeldrempel
Contributions: posts
Spotlight: topics
Blok AWV2 2020/2021 UL
Deze bundel bevat alle aantekeningen van de colleges uit het blok AWV uit het 2e jaar van de bachelor Geneeskunde aan de Universiteit Leiden. Ook aantekeningen uit de werkgroepen zijn in de samenvattingen verwerkt.
- Lees verder over Blok AWV2 2020/2021 UL
- 1662 keer gelezen
Online access to all summaries, study notes en practice exams
- Check out: Register with JoHo WorldSupporter: starting page (EN)
- Check out: Aanmelden bij JoHo WorldSupporter - startpagina (NL)
How and why use WorldSupporter.org for your summaries and study assistance?
- For free use of many of the summaries and study aids provided or collected by your fellow students.
- For free use of many of the lecture and study group notes, exam questions and practice questions.
- For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
- For compiling your own materials and contributions with relevant study help
- For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.
Using and finding summaries, notes and practice exams on JoHo WorldSupporter
There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
- Use the summaries home pages for your study or field of study
- Use the check and search pages for summaries and study aids by field of study, subject or faculty
- Use and follow your (study) organization
- by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
- this option is only available through partner organizations
- Check or follow authors or other WorldSupporters
- Use the menu above each page to go to the main theme pages for summaries
- Theme pages can be found for international studies as well as Dutch studies
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
- Check out: Why and how to add a WorldSupporter contributions
- JoHo members: JoHo WorldSupporter members can share content directly and have access to all content: Join JoHo and become a JoHo member
- Non-members: When you are not a member you do not have full access, but if you want to share your own content with others you can fill out the contact form
Quicklinks to fields of study for summaries and study assistance
Main summaries home pages:
- Business organization and economics - Communication and marketing -International relations and international organizations - IT, logistics and technology - Law and administration - Leisure, sports and tourism - Medicine and healthcare - Pedagogy and educational science - Psychology and behavioral sciences - Society, culture and arts - Statistics and research
- Summaries: the best textbooks summarized per field of study
- Summaries: the best scientific articles summarized per field of study
- Summaries: the best definitions, descriptions and lists of terms per field of study
- Exams: home page for exams, exam tips and study tips
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
- Studies: Bedrijfskunde en economie, communicatie en marketing, geneeskunde en gezondheidszorg, internationale studies en betrekkingen, IT, Logistiek en technologie, maatschappij, cultuur en sociale studies, pedagogiek en onderwijskunde, rechten en bestuurskunde, statistiek, onderzoeksmethoden en SPSS
- Studie instellingen: Maatschappij: ISW in Utrecht - Pedagogiek: Groningen, Leiden , Utrecht - Psychologie: Amsterdam, Leiden, Nijmegen, Twente, Utrecht - Recht: Arresten en jurisprudentie, Groningen, Leiden
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
2021 |
Add new contribution