Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Chapter 3 summary

THE ASSOCIATION BETWEEN TWO CATEGORICAL VARIABLES
When analysing data the first step is to distinguish between the response variable and the explanatory variable. The response variable is the outcome variable on which comparisons are made. If the explanatory variable is categorical, it defines the groups to be compared with respect to values for the response variable. If the explanatory variable is quantitative, it defines the change in different numerical values to be compared with respect to values for the response variable. The explanatory variable should explain the response variable (e.g: survival status is a response variable and smoking status is the explanatory variable).

An association exists between two variables if a particular value for one variable is more likely to occur with certain values of the other variable.

A contingency table is a display for two categorical variables. Conditional proportions are proportions which formation is conditional on ‘x’. A conditional proportion should be conditional to something. A conditional proportion is also a percentage. The proportion of the totals (e.g: percentage of total amount of ‘no’) is called a marginal proportion.

There is probably an association between two variables if there is a clear explanatory/response relationship, that dictates which way we compute the conditional proportions. Conditional proportions are useful in determining if there’s an association. A variable can be independent from another variable.

THE ASSOCIATION BETWEEN TWO QUANTITATIVE VARIABLES
We examine a scatterplot to study association. There is a difference between a positive association and a negative association. If there is a positive association, x goes up as y goes up. If there is a negative association, x goes up as y goes down.

Correlation describes the strength of the linear association. Correlation (r) summarizes th direction of the association between two quantitative variables and the strength of its linear trend. It can take a value between -1 and 1. A positive value for r indicates a positive association and a negative value for r indicates a negative association. The closer r is to 1, the closer the data points fall to a straight line and the stronger the linear association is. The closer r is to 0, the weaker the linear association is.

The properties of the correlation:

  • The correlation always falls between -1 and +1.
  • A positive correlation indicates a positive association and a negative correlation indicates a negative association.
  • The value of the correlation does not depend on the variables’ unit (e.g: euros or dollars)
  • Two variables have the same correlation no matter which is treated as the response variable and which is treated at the explanatory variable.
 

 

The correlation r can be calculated as following:

N is the number of points.  and ȳ are means and  and  are standard deviations for x and y. The sum is taken over all n observations.

The product of the z-scores for any point in the upper-right quadrant is positive. The product is also positive for each point in the lower-left quadrant. Such points contribute to a positive correlation. The product of the z-scores for any point in the upper-left and lower-right quadrants are negative. Such points contribute to a negative association.

PREDICTING THE OUTCOME OF A VARIABLE
The regression line predicts the value for the response variable y as a straight line function of the value x of the explanator variable. The equation for the regression line has the form:

‘a’ denotes the y-intercept and ‘b’ denotes the slope. A regression equation is often called a prediction equation. The prediction error is the difference between the actual y and the predicted y. The prediction error can be calculated as following:

The outcomes of the prediction error formula are called residuals. The summary measure to evaluate regression lines is:

Choosing the line that has the minimum residual sum of squares is called the least squares method. This gives us the regression line. The slope equals b. The y-intercept equals a. The regression formulas for y-intercept and slope are:

  and

The slope can’t be used to determine the strength of the association, because the slope depends on the units for the variables. A slope using dollars would look different than a slope using euros, thus it is not possible to say something about the strength of the association using the slope. Correlation and regression methods serve different purposes, but there are strong connections between them:

  • They are both appropriate when the relationship between two quantitative variables can be approximated by a straight line.
  • The correlation and the slope of the regression line have the same sign. If one is positive, so is the other one. If one is negative, so is the other one. If one is zero, the other is also zero.

r2 is the proportion of the variation in the y-values that is accounted for by the linear relationship of y with x.

CAUTIONS IN ANALYZING ASSOCIATIONS
Extrapolation refers to using a regression line to predict y values for x values outside the observed range of data. This is not always a good method to predict future data, because if the trend changes in the future, extrapolation gives poor predictions. Predictions about the future using time series data are called forecasts.

Regression outliers are outliers that are well removed from the trend that the rest of the data follow. An observation is influential if it has a large effect on results of a regression analysis. For an observation to be influential, two conditions must hold:

  • Its x value is relatively low or high compared to the rest of the data
  • The observation is a regression outlier, falling quite far from the trend that the rest of the data follow

Correlation and the regression line are non-resistant: they are prone to distortion by outliers. Correlation does not imply causation. Also, an association does not imply causation.

A third variable that is not measured in a study (or perhaps even known about to the researchers) but that influences the association between the response variable and the explanatory variable is referred to as a lurking variable. A lurking variable is a variable, usually unobserved, that influences the association between the variables of primary interest.

The direction of an association between two variables can change after we include a third variable and analyse the data at separate levels of that variable. This is known as Simpson’s Paradox (e.g: a positive correlation between crime rate and education changed to a negative correlation when data were considered at separate levels of urbanization).

A lurking variable may be a common cause of both the explanatory and the response variable. There could also be multiple causes. Some things are merely associated because they both have a time trend (e.g: two things both have a rising trend over the course of 10 years, then they will be positively associated with each other).

When two explanatory variables are both associated with a response variable but are also associated with each other, confounding occurs. It is difficult to determine which one really causes the response variable. The difference between a confounding variable and a lurking variable is that a lurking variable is not measured. A lurking variable has potential for confounding.

 

 

Image

Access: 
Public

Image

Join WorldSupporter!
This content is used in:

Statistics, the art and science of learning from data by A. Agresti (fourth edition) – Book summary

Research Methods & Statistics – Interim exam 1 (UNIVERSITY OF AMSTERDAM)

Search a summary

Image

 

 

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Wrong Chapter

Thank you for the great summaries. They are really helpful. Just a heads up: this is chapter 3 from Agresti, it should be chapter 6.

Kind regards

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Image

Spotlight: topics

Check the related and most recent topics and summaries:
Institutions, jobs and organizations:
Activities abroad, study fields and working areas:
This content is also used in .....

Image

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the summaries home pages for your study or field of study
  2. Use the check and search pages for summaries and study aids by field of study, subject or faculty
  3. Use and follow your (study) organization
    • by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
    • this option is only available through partner organizations
  4. Check or follow authors or other WorldSupporters
  5. Use the menu above each page to go to the main theme pages for summaries
    • Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Main study fields NL:

Follow the author: JesperN
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics
2441 1 1