How do you analyze the association between categorical variables? – Chapter 8

8.1 How do you create and interpret a contingency table?
8.2 What is a chi-squared test?
8.3 In which way do residuals help to analyze the association between variables?
8.4 How is the association in a contingency table measured?
8.5 How do you measure the association between ordinal variables?

8.1 How do you create and interpret a contingency table?

A contingency table contains the outcomes of all possible combinations of categorical data. A 4x5 contingency table has 4 rows and 5 columns. It often indicates percentages, this is called relative data.

A conditional distribution means that the data is dependent on a certain condition and shown as percentages of a subtotal, like women that have a cold. A marginal distribution contains the separate numbers. A simultaneous distribution shows the percentages with respect to the entire sample.

Two categorical variables are statistically independent when the probability that one occurs is unrelated to the probability that the other occurs. So this is when the probability distribution of one variable is not influenced by the outcome of the other variable. If this does happen, they are statistically dependent.

8.2 What is a chi-squared test?

When two variables are independent, this gives information about variables in the population. Probably the sample will be similarly distributed, but not necessarily. The variability can be high. A significance test tells whether it's plausible that the variables really are independent in the population. The hypotheses for this test are:

H₀: the variables are statistically independent

H_a: the variables are statistically dependent

A cell in a contingency table shows the observed frequency (f_o), the number of times that an observation is made. The expected frequency (f_e) is the number that is expected if the null hypothesis is true, so when the variables are independent. The expected frequency is calculated by adding the total of a row to a total of a column and then dividing this number by the sample size.

A significance test for independence uses a special test statistic. X² says how close the expected frequencies are to the observed frequencies. The test that is performed, is called the chi-squared test (of indepence). The formula for this test is:

$X^2 = \sum \frac{(f_o-f_e)^2}{f_e}$

This method was developed by Karl Pearson. When X² is small, the expected and observed frequencies are close together. The bigger X², the further they are apart. So this test statistic gives information on the level of coincidence.

A binomial distribution shows the probabilities of outcomes of a small sample with categorical discrete variables, like tossing a coin. This is not a distribution of observations or a sample but a distribution of probabilities. A multinomial distribution is the same, except that it has more than two categories.

The probability distribution of X² is a multinomial distribution. This is called the chi-squared probability distribution. The symbol χ² of the chi-squared distribution is alike the letter X² of the test statistic.

The most important characteristics of the chi-squared distribution are:

The distribution is always positive, X² can never be negative.
The distribution is skewed to the right.
The exact shape of the distribution depends on the degrees of freedom (df). For the chi-suared distribution, µ = df and σ = the root of 2df. The curve gets flatter when df gets bigger.
If r is the number of rows and c the columns, df = (r – 1)(c – 1).
When the contingency tables become bigger, so do the degrees of freedom and so does X².
The stronger X², the stronger the evidence is against H₀.

X² is used both for means and proportions. For proportions, research results (such as 'yes' and 'no') can be divided in success and failure. π₁ is the proportion of success in group 1, π₂ the proportion of success in group 2. When the response variable is independent of the populations, then π₁= π₂. This is called a homogenity hypothesis. Chi-quared test is also called a homogenity test. The test statistic is:

$z = \frac{{\hat{\pi}_2 - \hat{\pi}_1}}{se_0}$ in which X² = z²

The test statistics z-score and X² are used in different cases. Z-score is applicable for instance for one-sided alternative hypotheses. But for a contingency table larger than 2x2, the X² is better because it can handle multiple parameters. Df can be interpreted as the number of parameters required to describe the contingency table.

Chi-quared test does have limitations. It only works for large sample with an expected frequency higher than 5 per cel. For small samples Fisher's exact test is better. Chi-squared test works best for nominal scales. For ordinal scales other tests are preferred.

8.3 In which way do residuals help to analyze the association between variables?

When the P-value of a chi-squared test is very small, then there is strong evidence of an association between the variables. This says nothing about which way that the variables are connected or how strong this association is. That's why residuals are important. A residual is the difference between the observed and expected frequency of a cel: f_o – f_e. When a residual is positive, the observed frequency is bigger. A standardized residual indicates for which number H₀ is true and when there is indepence. The formula for a standardized residual is:

$z = \frac{f_o-F_e}{se}$ = $\frac{f_o-f_e}{\sqrt{f_e(1-row proportion)(1-column proportion)}}$

A big standardized residual is the evidence against independence in a certain cell. When the null hypothesis is true, the probability is only 5% that a standardized residual has a value higher than 2. So a residual of under -3 or above 3 is very convincing evidence. Software gives both the test statistic X² and the residuals. In a 2x2 contingency table the standardized residual is the same as the z test statistic for comparing two proportions.

8.4 How is the association in a contingency table measured?

In analyzing a contingency table, research hopes to find out:

Whether there is an association (measured by chi-squared test)
How the data is different from indepence (measured by standardized residuals)
How strong the association is between variables

Several measures of association size up the connection between variables. They compare the most extreme form of an association and the most extreme depletion of it and decide where the data is located in between these two extremes.

The least strong association is for instance in a sample of 60% students and 40% non-students, where 30% of students say they love beer and 30% of non-students say they love beer. This is not a real situation. The most extreme association would be if 100% of students love beer and 0% of non-students. In reality the percentage lies in between.

In a simple binary 2x2 contingency table it's easy to compare proportions. If the association is strong, so is the absolute number of the difference.

Chi-quared test measures only how much evidence is provided of an association. Chi-squared test does not measure how strong an association is. For instance, a large sample can find strong evidence that there a weak association exists.

When the outcome of a binary response variable is labelled success or failure, then the odds can be calculated: odds of succes = probability of success – probability of failure. When the odds are 3, then success is three times as likely as failure. The probability of a certain outcome is odds / (odds + 1). The odds ratio of a 2x2 contingency table compares the odds of a group with the odds of another group: odds of row 1 / odds of row 2. The odds ratio is indicated as θ .

The odds ratio has the following characteristics:

The value doesn't depend on which variable is chosen as a response variable.
The odds ratio is the same as multiplying diagonal cells and hence it's also called the cross-product ratio.
The odds ratio can have any non-negative number.
When the probability of success is the same for two rows, then the odds ratio is 1.
An odds ratio smaller than 1 means that the odds of success are smaller for row 1 than for row 2.
The further the odds ratio is from 1, the stronger the association.
There can be two values for the odds ratio; two directions.

When a contingency table is more complex than 2x2, then the odds ratio is divided in smaller 2x2 contingency tables. Sometimes it's possible to capture a complex collection of data in a single number, but it's better to present multiple comparisons instead (like multiple odds ratios), to better reflect the data.

8.5 How do you measure the association between ordinal variables?

An association between ordinal variables can be positive or negative. A positive association means that a higher score on x goes along with a higher score on y. A negative association means that a higher score on x entails a lower score on y.

A pair of observations can be concordant (C) or disconcordant (D). A pair of observations is concordant when the subject that scores higher for one variable also scores higher for another variable (evidence of a positive association). A pair is disconcordant when the subject that scores higher for one, scores lower for the other (evidence of a negative association).

Because bigger samples have more pairs, the difference is standardized, which gives gamma, noted as ŷ (this is different from y-bar!). Gamma measures the association between variables. Its formula is: ŷ = (C – D) / (C + D).

The gamma value is between -1 and +1. It indicates whether the association is positive or native and how strong the association is. If gamma increases, so does the association. For instance, a gamma value of 0.17 indicates a positive but weak association. Gamma is the difference between ordinal proportions, it's the difference between the proportions of concordant and disconcordant pairs.

Other measures of association are Kendall's tau-b, Spearman's rho-b, and Somers' d. These methods measure the correlation between quantitative variables.

Also gamma can be calculated as a confidence interval. In this case ŷ denotes sample gamma, y population gamma, ŷ ± z(se) the confidence interval in which z = (ŷ – 0) / se. This formula works best if C and D are both higher than 50.

If two variables are ordinal, then an ordinal measure is preferable over chi-squared test, because chi-squared test ignores rankings.

Other ordinal methods work in similar ways like gamma. An alternative is a test of linear-by-linear association, in which each category of each variable is assigned a score and the correlation is analyzed by a z-test. This is a method to detect a trend.

For a mix of ordinal and nominal variables, especially if the nominal variable has more than two categories, it's better not to use gamma.

Access:

Public

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

This content is related to:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Check more of topic:

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

This content is used in:

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Spotlight: topics

Check the related and most recent topics and summaries:

Activities abroad, study fields and working areas:

Which kinds of samples and variables are possible? – Chapter 2

What are the main measures and graphs of descriptive statistics? - Chapter 3

What role do probability distributions play in statistical inference? – Chapter 4

How can you make estimates for statistical inference? – Chapter 5

How do you perform significance tests? – Chapter 6

How do you compare two groups in statistics? - Chapter 7

How do you analyze the association between categorical variables? – Chapter 8

How do linear regression and correlation work? – Chapter 9

Which types of multivariate relationships exist? – Chapter 10

What is multiple regression? – Chapter 11

What is ANOVA? – Chapter 12

How does multiple regression with both quantitative and categorical predictors work? – Chapter 13

How do you make a multiple regression model for extreme or strongly correlating data? – Chapter 14

What is logistic regression? – Chapter 15

Check how to use summaries on WorldSupporter.org

Submenu: Summaries & Activities

Follow the author: Annemarie JoHo

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

Search a summary, study help or student organization

Select any filter and click on Search to see results

How do you analyze the association between categorical variables? – Chapter 8

8.1 How do you create and interpret a contingency table?

8.2 What is a chi-squared test?

8.3 In which way do residuals help to analyze the association between variables?

8.4 How is the association in a contingency table measured?

8.5 How do you measure the association between ordinal variables?

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Samenvattingen voor psychologie en gedrag

Universiteit Groningen en studieverenigingen

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Contributions: posts

Add new contribution

Spotlight: topics

Samenvattingen voor psychologie en gedrag

Development Goal 04: Quality Education

Universiteit Groningen en studieverenigingen

Statistical methods for the social sciences - Agresti - 5th edition, 2018 - Summary (EN)

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

Quicklinks to fields of study for summaries and study assistance