Blok AWV HC10+11: Regression analysis

HC10+11: Regression analysis

Mean and standard deviation

Statistics consists of making statements about a population based on data observed from a sample. This is often done using means and standard deviations (σ). The bigger the standard deviation, the bigger the spread in the population.

For example, the lung function (FEV1 in L) of 40 children is measured:

  • Mean FEV1 = 3,16 L
  • σ = 0,41 L

This means that roughly 95% of the population has a FEV1 between 3,16 – 0,82 and 3,19 + 0,82 L → approximately 95% of observations are less than 2σ from the mean:

  • 95% CI = (2,34 L, 3,84 L)

However, lung function depends on many factors such as age and gender. These factors also need to be taken into account.

Linear regression

Simple linear regression is regression for continuous outcomes. Linear regression tries to predict or explain a variable → the outcome or the dependent variable (x). This variable is explained by another variable → the explanatory variable (y). A regression line is based on a scatter plot and calculates the mean value of “y” for a value of “x”:

  • y = the dependent variable, outcome and response variable
  • x = the independent variable, covariate, risk factor, predictor and explanatory variable
  • Mean y = β0+ β1x
    • β0= the intercept (“constante”)
      • The predicted value of “y” if “x” is equal to 0
        • Not always clinically meaningful
    • β1= the slope (“richtingscoëfficiënt”)
      • The expected change in the outcome by increasing the exposure of 1 unit if β1 is positive
        • Or decrease, in case β1is negative

For instance, a regression line can describe the mean FEV1 as function of age:

  • Mean FEV1 = 2,281 + 0,119 x age

This means that for 2 children with an age difference of 1 year, the expected mean difference in the FEV1 is 0,119 L.

Error/residual:

Observations of (x1, y1), (x2, y2), …, (xn, yn) show that each pair represents the values of 1 person. Sometimes, the error can also be taken into account:

  • y = β0+ β1x + e

The deviations of the regression line are called residuals, which are taken into the error. The error/residual is assumed to be normally distributed with the standard deviation σ. σ indicates how much the observations vary around the regression line:

  • Small σ: all observations are close to the regression line
  • Large σ: some observations are far from the regression line

The residual is the distance from a single observation to the regression line → the difference between what is observed and what is predicted:

  • yi– (β0+ β1xi)

Least squares method:

The unknown true regression line in the population is line y = β0+ β1x. Using the least squares method, the regression line can be estimated by y = b0+ b1x. The b0and b1which minimize the sum of squared residuals need to be selected:

  • ∑(yi– (β0+ β1xi))2

    • b1=
    • b0= 1
    • s =
      • sis an estimate for σ, the standard deviation around the regression line

95% confidence interval:

Because research is usually based on a sample, b0and b1are not exact. The standard error is the uncertainty of estimate in a and b (se(b0) and se(b1)), which is used to make confidence intervals for the true unknown β0and β1. The approximate 95% CI for β1can be calculated as follows:

  • (b1– 2 x se(b1), b1+ 2 x se(b1)) → it is 95% sure that the true βlies in this interval

In case 0 is in the 95% CI, this indicates that there is no association. The 95% CI for the FEV1 of children is:

  • (0.119 – 2×0.011, 0.119 + 2×0.011) = (0.097, 0.141) → a value of 0 between age and FEV1 is very unlikely

The 95% confidence interval for mean y = β0+ β1x for given value of “x” can be calculated as follows:

  • (b0+ b1x – 2 se(b0+ b1x), b0+ b1x + 2 se(b0+ b1x))

    • se(b0+ b1x) can be calculated in SPSS

If the 95% of a regression line is known, the true regression line is likely to be between these bounds.

Standard deviation versus standard error:

The standard deviation is often mixed up with the standard error:

  • Standard deviation: a measure of variability in the population → indicates how much the FEV1 values in children vary
  • Standard error: a measure of precision of an estimate (sample mean or estimated slope of the regression line) → used to calculate the 95% CI’s

Prediction:

The expected FEV1 of a 6-year-old child according to the formula is:

  • 2,281 + 0,119x6 = 2,995 L

There are 2 sources of variation:

  • Imprecision in the estimated regression line: se(b0+ b1x)
  • Spread around regression line σ

Combining this gives the 95% reference or prediction interval for a new observation → the interval between which 95% of the values of the population fall into. For a 6-year-old child, values between 2,6 and 3,5 are considered normal.

Assumptions:

Simple linear regression relies on some assumptions:

  • Linearity
    • The scatterplot needs to be checked
    • It is assumed that the relation between “x” and “y” is linear
  • Nearly normal residuals
  • Constant variability: homoscedasticity
    • σ is constant
    • This often isn’t a problem if the sample size is large → the estimate se, 95% CI and p-value are still valid
    • If the “y” variable is very skewed, it may be log transformed
  • Independent observations
    • How the data was collected needs to be checked

Residual plot:

The residual plot is the plot of predicted values versus residuals. It is used to see if the assumptions are correct. A residual plot shouldn’t have a clear pattern and can be used to detect deviations of the model:

  • Dots scattered everywhere → no constant variability
  • Dots taking the shape of a parabola → no linear relation

Categorical variables:

If x is categorical, x is either 1 or 0, for example if x indicates asthma treatment:

  • x = 0 → no treatment
  • x = 1 → treatment

In this case, x can be taken as an independent variable in the regression model of the FEV1 of children. The FEV of treated children is on average 0,266 L larger with a p-value of 0,036 → there is a statistically significant difference between treated and untreated children.

The increase in the mean FEV between untreated (x = 0) and treated (x = 1) children is 0,226 → the slope of the regression line. Because the mean of the treated and untreated children is compared, this is equivalent to an unpaired t-test.

Multiple regression

Multiple linear regression means regression in multiple directions. It is characterized by the influence of several explanatory variables on the response:

  • How does the average “y” vary as function of x1, x2, ..., xp?
  • Can “y” be predicted if x1, x2,..., xpare known?
  • What is the influence of x1on “y”, corrected for x2,.., xp?
  • Which combination of x’s is related to “y”?

Multiple regression can be used to:

  • Control for confounders
  • Build a prediction model
    • By adding extra information to the model to make a better guess
      • E.g. age
  • Increase the precision
    • By adding more information, less patients are needed to obtain the same precision for the treatment effect

Calculations:

The mean FEV1 is obtained with the formula 2,281 + 0,119 x age. This formula changes if height is added as explanatory variable to the model:

  • Mean (FEV1) = 1,711 + (0,058 x age) + (0,008 x height)
    • If the FEV of 2 children who have the same height is measured, a 1-year older child has on average 0,058 L more FEV
    • If 2 children have the same age, a child who is 1 cm taller has on average 0,008 L more FEV

In short, in multiple regression:

  • “y” is a numerical outcome
  • The model has 2 independent variables (x1and x2) with e~ N(0, σ2)
  • The estimated regression equation is y = b0+ b1x1+ b2x2
    • If x1is increased by 1 unit, xis kept fixed → y = b0+ b1(x1+ 1) + b2x2
      • The difference is b1
      • The amount by which the mean of y increases if x1increases 1 unit and all other x’s are kept fixed

Testing and estimation:

Testing and estimation is done in a similar way. Coefficients are estimated with the least squares method. Here, standard errors, confidence intervals and p-values can be calculated. In the FEV example, after correction for height, the relation between age and FEV isn’t significant anymore.

In short, if age is added to the FEV model, the following is visible:

  • The direction of the effect changes
    • The effect is very small and no longer statistically significant
  • Age is a confounder
    • Young children have a lower FEV and are less often treated
    • Adding age to the model adjusts for age
      • Differences between treated and untreated for fixed ages should be considered

Confounding:

One of the main functions of multiple regression is to control for confounding. Confounding should be considered if the regression coefficient for a variable (e.g. treatment) changes if another variable (e.g. age) is added.

Functions of regression

Both linear and multiple regression have different uses:

  • Linear regression
    • To predict: e.g. what is the mean FEV for a 7-year-old child who is 1,30m tall and doesn’t use any medication?
    • To correct for confounders: e.g. what is the effect of treatment on FEV, after adjustment for age?
  • Multiple regression
    • Increases precision of randomized trials → adjusts the variability of important risk variables → the σ around the regression line becomes smaller

Assumptions shouldn’t be made outside of the sample → the regression line may be different in extrapolation.

Types of regression models

There are different types of regression models for different types of outcomes:

  • Numerical outcomes → linear or non-linear regression
  • Binary outcome → logistic regression
    • A 0-1, success/failure outcome
  • Survival data → proportional hazard model (Cox regression)

Cox proportional hazards:

Cox proportional hazards (Cox PH) is a regression method for survival data for adjusted analysis. The assumption is that there’s a baseline hazard in a group, and a hazard ratio (HR) which increases or decreases the hazard:

  • h1= h0 (t) x HR

In this case, the HR may depend on covariates, but not on the time (t) → proportional hazards do not change over time.

Access: 
Public
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Comments, Compliments & Kudos:

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.
Promotions
Image
The JoHo Insurances Foundation is specialized in insurances for travel, work, study, volunteer, internships an long stay abroad
Check the options on joho.org (international insurances) or go direct to JoHo's https://www.expatinsurances.org

 

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams

How and why would you use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, study notes en practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the menu above every page to go to one of the main starting pages
    • Starting pages: for some fields of study and some university curricula editors have created (start) magazines where customised selections of summaries are put together to smoothen navigation. When you have found a magazine of your likings, add that page to your favorites so you can easily go to that starting point directly from your profile during future visits. Below you will find some start magazines per field of study
  2. Use the topics and taxonomy terms
    • The topics and taxonomy of the study and working fields gives you insight in the amount of summaries that are tagged by authors on specific subjects. This type of navigation can help find summaries that you could have missed when just using the search tools. Tags are organised per field of study and per study institution. Note: not all content is tagged thoroughly, so when this approach doesn't give the results you were looking for, please check the search tool as back up
  3. Check or follow your (study) organizations:
    • by checking or using your study organizations you are likely to discover all relevant study materials.
    • this option is only available trough partner organizations
  4. Check or follow authors or other WorldSupporters
    • by following individual users, authors  you are likely to discover more relevant study materials.
  5. Use the Search tools
    • 'Quick & Easy'- not very elegant but the fastest way to find a specific summary of a book or study assistance with a specific course or subject.
    • The search tool is also available at the bottom of most pages

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Field of study

Check the related and most recent topics and summaries:
Activity abroad, study field of working area:
Institutions, jobs and organizations:
Access level of this page
  • Public
  • WorldSupporters only
  • JoHo members
  • Private
Statistics
1767