Blok AWV HC10+11: Regression analysis

HC10+11: Regression analysis

Mean and standard deviation

Statistics consists of making statements about a population based on data observed from a sample. This is often done using means and standard deviations (σ). The bigger the standard deviation, the bigger the spread in the population.

For example, the lung function (FEV1 in L) of 40 children is measured:

Mean FEV1 = 3,16 L
σ = 0,41 L

This means that roughly 95% of the population has a FEV1 between 3,16 – 0,82 and 3,19 + 0,82 L → approximately 95% of observations are less than 2σ from the mean:

95% CI = (2,34 L, 3,84 L)

However, lung function depends on many factors such as age and gender. These factors also need to be taken into account.

Linear regression

Simple linear regression is regression for continuous outcomes. Linear regression tries to predict or explain a variable → the outcome or the dependent variable (x). This variable is explained by another variable → the explanatory variable (y). A regression line is based on a scatter plot and calculates the mean value of “y” for a value of “x”:

y = the dependent variable, outcome and response variable
x = the independent variable, covariate, risk factor, predictor and explanatory variable
Mean y = β0+ β1x
- β0= the intercept (“constante”)
  - The predicted value of “y” if “x” is equal to 0
    - Not always clinically meaningful
- β1= the slope (“richtingscoëfficiënt”)
  - The expected change in the outcome by increasing the exposure of 1 unit if β1 is positive
    - Or decrease, in case β1is negative

For instance, a regression line can describe the mean FEV1 as function of age:

Mean FEV1 = 2,281 + 0,119 x age

This means that for 2 children with an age difference of 1 year, the expected mean difference in the FEV1 is 0,119 L.

Error/residual:

Observations of (x1, y1), (x2, y2), …, (xn, yn) show that each pair represents the values of 1 person. Sometimes, the error can also be taken into account:

y = β₀+ β₁x + e

The deviations of the regression line are called residuals, which are taken into the error. The error/residual is assumed to be normally distributed with the standard deviation σ. σ indicates how much the observations vary around the regression line:

Small σ: all observations are close to the regression line
Large σ: some observations are far from the regression line

The residual is the distance from a single observation to the regression line → the difference between what is observed and what is predicted:

y_i– (β₀+ β₁x_i)

Least squares method:

The unknown true regression line in the population is line y = β₀+ β₁x. Using the least squares method, the regression line can be estimated by y = b0+ b1x. The b0and b1which minimize the sum of squared residuals need to be selected:

∑(y_i– (β₀+ β₁x_i))²
- b₁=
- b₀= 1
- s =
  - sis an estimate for σ, the standard deviation around the regression line

95% confidence interval:

Because research is usually based on a sample, b0and b1are not exact. The standard error is the uncertainty of estimate in a and b (se(b0) and se(b1)), which is used to make confidence intervals for the true unknown β0and β1. The approximate 95% CI for β1can be calculated as follows:

(b₁– 2 x se(b₁), b₁+ 2 x se(b₁)) → it is 95% sure that the true β₁lies in this interval

In case 0 is in the 95% CI, this indicates that there is no association. The 95% CI for the FEV1 of children is:

(0.119 – 2×0.011, 0.119 + 2×0.011) = (0.097, 0.141) → a value of 0 between age and FEV1 is very unlikely

The 95% confidence interval for mean y = β0+ β1x for given value of “x” can be calculated as follows:

(b₀+ b₁x – 2 se(b₀+ b₁x), b₀+ b₁x + 2 se(b₀+ b₁x))
- se(b₀+ b₁x) can be calculated in SPSS

If the 95% of a regression line is known, the true regression line is likely to be between these bounds.

Standard deviation versus standard error:

The standard deviation is often mixed up with the standard error:

Standard deviation: a measure of variability in the population → indicates how much the FEV1 values in children vary
Standard error: a measure of precision of an estimate (sample mean or estimated slope of the regression line) → used to calculate the 95% CI’s

Prediction:

The expected FEV1 of a 6-year-old child according to the formula is:

2,281 + 0,119x6 = 2,995 L

There are 2 sources of variation:

Imprecision in the estimated regression line: se(b₀+ b₁x)
Spread around regression line σ

Combining this gives the 95% reference or prediction interval for a new observation → the interval between which 95% of the values of the population fall into. For a 6-year-old child, values between 2,6 and 3,5 are considered normal.

Assumptions:

Simple linear regression relies on some assumptions:

Linearity
- The scatterplot needs to be checked
- It is assumed that the relation between “x” and “y” is linear
Nearly normal residuals
Constant variability: homoscedasticity
- σ is constant
- This often isn’t a problem if the sample size is large → the estimate se, 95% CI and p-value are still valid
- If the “y” variable is very skewed, it may be log transformed
Independent observations
- How the data was collected needs to be checked

Residual plot:

The residual plot is the plot of predicted values versus residuals. It is used to see if the assumptions are correct. A residual plot shouldn’t have a clear pattern and can be used to detect deviations of the model:

Dots scattered everywhere → no constant variability
Dots taking the shape of a parabola → no linear relation

Categorical variables:

If x is categorical, x is either 1 or 0, for example if x indicates asthma treatment:

x = 0 → no treatment
x = 1 → treatment

In this case, x can be taken as an independent variable in the regression model of the FEV1 of children. The FEV of treated children is on average 0,266 L larger with a p-value of 0,036 → there is a statistically significant difference between treated and untreated children.

The increase in the mean FEV between untreated (x = 0) and treated (x = 1) children is 0,226 → the slope of the regression line. Because the mean of the treated and untreated children is compared, this is equivalent to an unpaired t-test.

Multiple regression

Multiple linear regression means regression in multiple directions. It is characterized by the influence of several explanatory variables on the response:

How does the average “y” vary as function of x1, x2, ..., xp?
Can “y” be predicted if x1, x2,..., xpare known?
What is the influence of x1on “y”, corrected for x2,.., xp?
Which combination of x’s is related to “y”?

Multiple regression can be used to:

Control for confounders
Build a prediction model
- By adding extra information to the model to make a better guess
  - E.g. age
Increase the precision
- By adding more information, less patients are needed to obtain the same precision for the treatment effect

Calculations:

The mean FEV1 is obtained with the formula 2,281 + 0,119 x age. This formula changes if height is added as explanatory variable to the model:

Mean (FEV1) = 1,711 + (0,058 x age) + (0,008 x height)
- If the FEV of 2 children who have the same height is measured, a 1-year older child has on average 0,058 L more FEV
- If 2 children have the same age, a child who is 1 cm taller has on average 0,008 L more FEV

In short, in multiple regression:

“y” is a numerical outcome
The model has 2 independent variables (x1and x2) with e~ N(0, σ²)
The estimated regression equation is y = b₀+ b₁x₁+ b₂x₂
- If x1is increased by 1 unit, x₂is kept fixed → y = b₀+ b₁(x₁+ 1) + b₂x₂
  - The difference is b₁
  - The amount by which the mean of y increases if x1increases 1 unit and all other x’s are kept fixed

Testing and estimation:

Testing and estimation is done in a similar way. Coefficients are estimated with the least squares method. Here, standard errors, confidence intervals and p-values can be calculated. In the FEV example, after correction for height, the relation between age and FEV isn’t significant anymore.

In short, if age is added to the FEV model, the following is visible:

The direction of the effect changes
- The effect is very small and no longer statistically significant
Age is a confounder
- Young children have a lower FEV and are less often treated
- Adding age to the model adjusts for age
  - Differences between treated and untreated for fixed ages should be considered

Confounding:

One of the main functions of multiple regression is to control for confounding. Confounding should be considered if the regression coefficient for a variable (e.g. treatment) changes if another variable (e.g. age) is added.

Functions of regression

Both linear and multiple regression have different uses:

Linear regression
- To predict: e.g. what is the mean FEV for a 7-year-old child who is 1,30m tall and doesn’t use any medication?
- To correct for confounders: e.g. what is the effect of treatment on FEV, after adjustment for age?
Multiple regression
- Increases precision of randomized trials → adjusts the variability of important risk variables → the σ around the regression line becomes smaller

Assumptions shouldn’t be made outside of the sample → the regression line may be different in extrapolation.

Types of regression models

There are different types of regression models for different types of outcomes:

Numerical outcomes → linear or non-linear regression
Binary outcome → logistic regression
- A 0-1, success/failure outcome
Survival data → proportional hazard model (Cox regression)

Cox proportional hazards:

Cox proportional hazards (Cox PH) is a regression method for survival data for adjusted analysis. The assumption is that there’s a baseline hazard in a group, and a hazard ratio (HR) which increases or decreases the hazard:

h₁= h₀ (t) x HR

In this case, the HR may depend on covariates, but not on the time (t) → proportional hazards do not change over time.

Access:

Public

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

Check more of topic:

Samenvattingen voor geneeskunde en gezondheidszorg

Universiteit Leiden en studieverenigingen

This content is used in:

Blok AWV2 2020/2021 UL

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Spotlight: topics

Check the related and most recent topics and summaries:

Activities abroad, study fields and working areas:

Blok AWV HC2: RCT

Blok AWV HC3: Sample size calculation

Blok AWV HC4: Cohort studies

Blok AWV HC5: Case control studies

Blok AWV HC6+7: Bias

Blok AWV HC8+9: Survival analysis

Blok AWV HC10+11: Regression analysis

Blok AWV HC12: Diagnostische begrippen

Blok AWV HC13: Beslisbomen

Blok AWV HC14: Test en behandeldrempel

Lees verder over Blok AWV2 2020/2021 UL
1953 keer gelezen

Check how to use summaries on WorldSupporter.org

Submenu: Summaries & Activities

Follow the author: nathalievlangen

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

Search a summary, study help or student organization

Select any filter and click on Search to see results

Blok AWV HC10+11: Regression analysis

HC10+11: Regression analysis

Mean and standard deviation

Linear regression

Error/residual:

Least squares method:

95% confidence interval:

Standard deviation versus standard error:

Prediction:

Assumptions:

Residual plot:

Categorical variables:

Multiple regression

Calculations:

Testing and estimation:

Confounding:

Functions of regression

Types of regression models

Cox proportional hazards:

Samenvattingen voor geneeskunde en gezondheidszorg

Universiteit Leiden en studieverenigingen

Blok AWV2 2020/2021 UL

Contributions: posts

Add new contribution

Spotlight: topics

Samenvattingen voor geneeskunde en gezondheidszorg

Universiteit Leiden en studieverenigingen

Blok AWV2 2020/2021 UL

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

Quicklinks to fields of study for summaries and study assistance