Applying correlation, regression and linear regression

Correlation, Regression, Linear Regression

Correlation versus regression

Correlation and Regression are the two analysis based on multivariate distribution. A multivariate distribution is described as a distribution of multiple variables.

  • Correlation is described as the analysis which lets us know the association or the absence of the relationship between two variables ‘x’ and ‘y’.
  • Regression analysis predicts the value of the dependent variable based on the known value of the independent variable, assuming that average mathematical relationship between two or more variables.
 CorrelationRegression
MeaningCorrelation is a statistical measure which determines co-relationship or association of two variables.Regression describes how an independent variable is numerically related to the dependent variable.
UsageTo represent linear relationship between two variables.To fit a best line and estimate one variable on the basis of another variable.
Dependent and Independent variablesNo differenceBoth variables are different.
IndicatesCorrelation coefficient indicates the extent to which two variables move together.Regression indicates the impact of a unit change in the known variable (x) on the estimated variable (y).
ObjectiveTo find a numerical value expressing the relationship between variables.To estimate values of random variable on the basis of the values of fixed variable.

Correlation

A correlation measures three characteristics of the association between X and Y:

  1. The direction of the relation. A positive correlation (+) emerges when two variables are moving in the same direction. If the value of X increases (for example the length of a person), the value of Y also increased (for example the weight of a person). A negative correlation occurs when two variables are moving in different directions. If X increases, Y decreases (or vice versa).

  2. The form of the association. It can be for example linear.

  3. The degree of the association. A perfect correlation has a value of -1 or 1. A correlation of 0 implies that there is no association between the two variables. A correlation of 0.8 is therefore stronger than a correlation of for example 0.5

Pearson correlation

The most well-known measure for correlation is the Pearson correlation. This correlation measures the degree and direction of a linear relation between two variables. The Pearson correlation is denoted with r and calculated as follows:

\[r = \frac{covariance\:of\:x\:and\:y}{variability\:of\:x\:and\:y\:seperately}\]

or:

\[r = \frac{\sum{(x-\bar{x})(y-\bar{y})}}{\sqrt{\sum{(x-\bar{x})^2} \sum{(y-\bar{y})^2}}}\]

which is the same as:

\[r = \frac{N \sum{xy}-(\sum{x})(\sum{y})}{\sqrt{[N\sum{x^2}-(\sum{x})^2] [N\sum{y^2}-(\sum{y})^2]}}\]

  • N : number of pairs of scores
  • x : x scores
  • y : y scores
  • : mean of x scores
  • ȳ : mean of y scores

The Pearson correlation can also be calculated for z-scores with the following formula:

\[r = \frac{\sum{(z_x\cdot z_y)}}{N}\]

  • zy : z-score of x
  • zy : z-score of y
  • N : number of pairs of scores

The proportion explained variance

With the Pearson correlation itself, you can not do so much, because it is not ratio scaled and thus not suitable for calculations. Therefore, you have to multiply it. The value r2 is called the coefficient of determination. This value measures the proportion of variance within one variable, that can be explained by the association of this variable with another variable. A correlation of 0.80 (r = 0.80) implies for example that 0.64 (r2), that is 64% of the variance of scores on Y can be explained by variable X.

  • An r2 of 0.01 refers to a small correlation;
  • An r2 of 0.09 refers to a medium correlation;
  • A large correlation is characterized by an r2 of 0.25 or higher.

Spearman correlation

The Pearson correlation quantifies the linear relation between two variables. This correlation measure is used primarily when data are interval or ratio scaled - other correlation measures are developed for non-linear relations and other measurement scales.

The Spearman correlation measures the relation between two variables with an ordinal scale. The Spearman correlation can also be used when data are interval- or ratio scaled and there is no linear relation between X and Y.

The Spearman correlation looks for a consistent relation between X and Y, regardless of its form. The original values have to be ordered (from small to large). The Spearman correlation can be calculated as follows:

\[\rho = r_s = 1 - \frac{6\sum{d^2_i}}{n(n^2-1)}\]

  • p : Spearman correlation
  • rs : Spearman correlation
  • di : rg(Xi) - rg(Yi): difference between the two ranks of each observation (for example, one can have the second best score on variable X, but the ninth on variable Y.)
  • n : number of scores

The point-biserial correlation

A special variant of the Pearson correlation is called the point-biserial correlation. This correlation is used when one variable consists of number, but the other variable consists only of two categories. A variable with only two categories is called a dichotomous variable. An example is gender.

To calculate the point-biserial correlation, the dichotomous variable first has to be transferred to a variable with numerical values. One value (for example women) receives a zero and the other value (for example men) receives a one. Next, the formula for Pearson r is used.

The phi-coefficient

The phi-coefficient measures the relation between two variables that are both dichotomous. To do so, first the values 0 and 1 have to be given to both variables. Next, the Pearson r formula can be applied.

Strong and weak correlations

For large samples, even very small correlation may become statistically significant quickly. A significant correlation tells us nothing more than that the chance is very small that the correlation in the population equals zero. The presence of significance thus does not imply whether the relation between the variables is strong. The strength of a correlation is in accordance with the size of the correlation and not with the statistical significance of the correlation. The rule-of-thumb is that a correlation of .10 is weak, a correlation of .30 is moderate, and a correlation of .50 is strong.

Scatterplot

A useful way to examine the relation between two quantitative variables is a scatter plot. Each participants is displayed by a dot with coordinates, that refer to the values on the variables X and Y. Normally, the predictive variable is presented on the X-axis and the criterion variable is presented on the Y-axis. The criterion variable is predicted from the predictor variable. However, if it concerns a correlation coefficient, it is not always clear which variable is X and which is Y. In that case, it does not matter how the variables are labelled. In a scatterplot, a line is drawn through the cloud of dots as best as possible. That line is called the regression line of Y predicted by X (that is: Y on X) which gives the best approximation of Yi for a value Xi. If the regression line is straight, the relation between the variables is linear. If the regression line is curved, it is called a curvilinear relation.

The degree to which the dots lie around the regression line is related to the correlation (r) between X and Y. The closer the dots (the observed results) lie around the regression line (the predicted results), the higher the correlation. The correlation coefficient ranges from -1 to +1, in which a perfect correlation (all points are on the regression line) is referred with 1. The plus and minus sign indicate the direction and do not influence the relation between the variables.

Regression

The general formula for a simple regression is:

\[Y = a + bX + e\]

  • Y : dependent variable
  • X : independent variable
  • a : the intercept (the value of y when x = 0)
  • b : the slope of the regression line
  • e : error, or the difference between the estimated and observed value of Y

For example, you have to pay 5 euros per hour next to the 30 euros entrance fee for a tennis club. You stay for 3 hours. In this case, the regression formula is:

  • Y : dependent variable: how much you should pay
  • X: independent variable: number of hours
  • a : the intercept: entrance fee
  • b : slope of the regression line: euros per hour
  • e : error: additional tips that you would like to give

\[Y = a + bX + e = 30 + 5 \times 3 + tips = 45 + tips\]

Assumptions for regression

A few assumptions have to be met. First, there has to be homogeneity of variances. That means that the variance of Y is the same for each value of X in the population. In addition, the values of Y that are in accordance with the X-values have to be normally distributed.

When examining the sample correlation, we replace the regression model assumptions with the assumption that we draw a sample from a bivariate normal distribution. The conditional distributions in this distribution are the distributions of Y and X given a specific value of X or Y. When we look at all Y-values, independent of X, we call it the marginal distribution of Y. Finally, we assume that the relation between X and Y is linear.

Predicted values

To determine how well a line fits the data, we first have to calculate the distance between the line and each data point. For each X-value, the linear regression line determines the value for the Y variable. This value is called the predicted value (Ŷ). The distance between this predicted value and the actual Y-value is determined by the following steps:

  1. Distance = Y - Ŷ. This distance measures the error between the line and the actual data.

  2. Because some distances are negative, and others are positive, the next step is to square each distance, so that only positive values remain.

  3. Finally, the total distance between the line and the data has to be calculated, which is called the . The squared values from step 2 are summed up: ∑(Y - Ŷ)2. This is called the total squared error.

\[Total\:squared\:error = \sum{(Y - \hat{Y})^2}\]

  • Y : actual value of Y
  • Ŷ : predicted value of Y

Standardized regression coefficients

When the data is standardized, a difference of one unit in X refers to a difference of one standard deviation. If the slope is for example 0.75 (for standardized data), the Y will increase with 0.75 for each increase of one standard deviation of X. The slope of standardized data is called standardized regression coefficient or β.

For standardized data, it applies that sX = sY = s2X = 1, in which the slope and correlation coefficient are equal. A correlation of r = .80 implies that an increase of one standard deviation of X is associated with 8/10 standard deviation increase of Y. However, because it is a correlational association, we can not make claims about cause-and-effect.

Hypothesis tests for regression

The significance of b

If X and Y correlate, and there is a linear relation, the slope of the regression line will not be equal to zero and b will have a value different from zero. This is the case for one predictor variable, but when there are multiple predictor variables, the slope does not have to be significant for each of these variables.

b* is the parametric equivalent of b, namely the slope if we had X and Y measures on the whole population.

The standard error is:

\[s_b = \frac{Y - X}{X \sqrt{N - 1}}\]

  • sb : standard error
  • Y : value of Y measure
  • X : value of X measure
  • N : number of measures
Statistics: suggestions, summaries and tips for encountering Statistics

Statistics: suggestions, summaries and tips for encountering Statistics

Knowledge and assistance for discovering, identifying, recognizing, observing and defining statistics.

Startmagazine: Introduction to Statistics

Startmagazine: Introduction to Statistics

Image

Introduction to Statistics: in short

  • Statistics comprises the arithmetic procedures to organize, sum up and interpret information. By means of statistics you can note information in a compact manner.
  • The aim of statistics is twofold: 1) organizing and summing up of information, in order to publish research results and 2) answering research questions, which are formed by
........Read more
Stats for students: Simple steps for passing your statistics courses

Stats for students: Simple steps for passing your statistics courses

Image

How to triumph over the theory of statistics (without understanding everything)?

Stats of students

  • The first years that you follow statistics, it is often a case of taking knowledge for granted and simply trying to pass the courses. Don't worry if you don't understand everything right away: in later years it will fall into place, and you will see the importance of the theory you had to know before.
  • The book you need to study may be difficult to understand at first. Be patient: later in your studies, the effort you put in now will pay off.
  • Be a Gestalt Scientist! In other words, recognize that the whole of statistics is greater than the sum of its parts. It is very easy to get hung up on nit-picking details and fail to see the forest because of the trees
  • Tip: Precise use of language is important in research. Try to reproduce the theory verbatim (i.e. learn by heart) where possible. With that, you don't have to understand it yet, you show that you've been working on it, you can't go wrong by using the wrong word and you practice for later reporting of research.
  • Tip: Keep study material, handouts, sheets, and other publications from your teacher for future reference.

How to score points with formulas of statistics (without learning them all)?

  • The direct relationship between data and results consists of mathematical formulas. These follow their own logic, are written in their own language, and can therefore be complex to comprehend.
  • If you don't understand the math behind statistics, you don't understand statistics. This does not have to be a problem, because statistics is an applied science from which you can also get excellent results without understanding. None of your teachers will understand all the statistical formulas.
  • Please note: you will probably have to know and understand a number of formulas, so that you can demonstrate that you know the principle of how statistics work. Which formulas you need to know differs from subject to subject and lecturer to lecturer, but in general these are relatively simple formulas that occur frequently, and your lecturer will likely tell you (often several times) that you should know this formula.
  • Tip: if you want to recognize statistical symbols, you can use: Recognizing commonly used statistical symbols
  • Tip: have fun with LaTeX! LaTeX code gives us a simple way to write out mathematical formulas and make them look professional. Play with LaTeX. With that, you can include used formulas in your own papers and you learn to understand how a formula is built up – which greatly benefits your understanding and remembering that formula. See also (in Dutch): How to create formulas like a pro on JoHo WorldSupporter?
  • Tip: Are you interested in a career in sciences or programming? Then take your formulas seriously and go through them again after your course.

How to practice your statistics (with minimal effort)?

How to select your data?

  • Your teacher will regularly use a dataset for lessons during the first years of your studying. It is instructive (and can be a lot of fun) to set up your own research for once with real data that is also used by other researchers.
  • Tip: scientific articles often indicate which datasets have been used for the research. There is a good chance that those datasets are valid. Sometimes there are also studies that determine which datasets are more valid for the topic you want to study than others. Make use of datasets other researchers point out.
  • Tip: Do you want an interesting research result? You can use the same method and question, but use an alternative dataset, and/or alternative variables, and/or alternative location, and/or alternative time span. This allows you to validate or falsify the results of earlier research.
  • Tip: for datasets you can look at Discovering datasets for statistical research

How to operationalize clearly and smartly?

  • For the operationalization, it is usually sufficient to indicate the following three things:
    • What is the concept you want to study?
    • Which variable does that concept represent?
    • Which indicators do you select for those variables?
  • It is smart to argue that a variable is valid, or why you choose that indicator.
  • For example, if you want to know whether someone is currently a father or mother (concept), you can search the variables for how many children the respondent has (variable) and then select on the indicators greater than 0, or is not 0 (indicators). Where possible, use the terms 'concept', 'variable', 'indicator' and 'valid' in your communication. For example, as follows: “The variable [variable name] is a valid measure of the concept [concept name] (if applicable: source). The value [description of the value] is an indicator of [what you want to measure].” (ie.: The variable "Number of children" is a valid measure of the concept of parenthood. A value greater than 0 is an indicator of whether someone is currently a father or mother.)

How to run analyses and draw your conclusions?

  • The choice of your analyses depends, among other things, on what your research goal is, which methods are often used in the existing literature, and practical issues and limitations.
  • The more you learn, the more independently you can choose research methods that suit your research goal. In the beginning, follow the lecturer – at the end of your studies you will have a toolbox with which you can vary in your research yourself.
  • Try to link up as much as possible with research methods that are used in the existing literature, because otherwise you could be comparing apples with oranges. Deviating can sometimes lead to interesting results, but discuss this with your teacher first.
  • For as long as you need, keep a step-by-step plan at hand on how you can best run your analysis and achieve results. For every analysis you run, there is a step-by-step explanation of how to perform it; if you do not find it in your study literature, it can often be found quickly on the internet.
  • Tip: Practice a lot with statistics, so that you can show results quickly. You cannot learn statistics by just reading about it.
  • Tip: The measurement level of the variables you use (ratio, interval, ordinal, nominal) largely determines the research method you can use. Show your audience that you recognize this.
  • Tip: conclusions from statistical analyses will never be certain, but at the most likely. There is usually a standard formulation for each research method with which you can express the conclusions from that analysis and at the same time indicate that it is not certain. Use that standard wording when communicating about results from your analysis.
  • Tip: see explanation for various analyses: Introduction to statistics
Statistics: suggestions, summaries and tips for understanding statistics

Statistics: suggestions, summaries and tips for understanding statistics

Knowledge and assistance for classifying, illustrating, interpreting, demonstrating and discussing statistics.

Startmagazine: Introduction to Statistics

Startmagazine: Introduction to Statistics

Image

Introduction to Statistics: in short

  • Statistics comprises the arithmetic procedures to organize, sum up and interpret information. By means of statistics you can note information in a compact manner.
  • The aim of statistics is twofold: 1) organizing and summing up of information, in order to publish research results and 2) answering research questions, which are formed by
........Read more
Understanding data: distributions, connections and gatherings
Understanding reliability and validity
Statistics Magazine: Understanding statistical samples
Understanding distributions in statistics
Understanding variability, variance and standard deviation
Understanding inferential statistics
Understanding type-I and type-II errors
Understanding effect size, proportion of explained variance and power of tests to your significant results
Statistiek en onderzoek - Thema
Statistics: suggestions, summaries and tips for applying statistics

Statistics: suggestions, summaries and tips for applying statistics

Knowledge and assistance for choosing, modeling, organizing, planning and utilizing statistics.

Applying z-tests and t-tests
Applying correlation, regression and linear regression
Applying spearman's correlation - Theme
Applying multiple regression

More knowledge and assistance for Encountering, Understanding and Applying Statistics

For personal progression in the field of statistics

What can you do on a WorldSupporter Statistics Topic?
Crossroads: activities, countries, competences, study fields and goals
Activities abroad, study fields and working areas:
Statistics
5528 1 1