Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

How do you analyze the association between categorical variables? – Chapter 8

8.1 How do you create and interpret a contingency table?

A contingency table contains the outcomes of all possible combinations of categorical data. A 4x5 contingency table has 4 rows and 5 columns. It often indicates percentages, this is called relative data.

A conditional distribution means that the data is dependent on a certain condition and shown as percentages of a subtotal, like women that have a cold. A marginal distribution contains the separate numbers. A simultaneous distribution shows the percentages with respect to the entire sample.

Two categorical variables are statistically independent when the probability that one occurs is unrelated to the probability that the other occurs. So this is when the probability distribution of one variable is not influenced by the outcome of the other variable. If this does happen, they are statistically dependent.

8.2 What is a chi-squared test?

When two variables are independent, this gives information about variables in the population. Probably the sample will be similarly distributed, but not necessarily. The variability can be high. A significance test tells whether it's plausible that the variables really are independent in the population. The hypotheses for this test are:

H0: the variables are statistically independent

Ha: the variables are statistically dependent

A cell in a contingency table shows the observed frequency (fo), the number of times that an observation is made. The expected frequency (fe) is the number that is expected if the null hypothesis is true, so when the variables are independent. The expected frequency is calculated by adding the total of a row to a total of a column and then dividing this number by the sample size.

A significance test for independence uses a special test statistic. X2 says how close the expected frequencies are to the observed frequencies. The test that is performed, is called the chi-squared test (of indepence). The formula for this test is:

This method was developed by Karl Pearson. When X2 is small, the expected and observed frequencies are close together. The bigger X2, the further they are apart. So this test statistic gives information on the level of coincidence.

A binomial distribution shows the probabilities of outcomes of a small sample with categorical discrete variables, like tossing a coin. This is not a distribution of observations or a sample but a distribution of probabilities. A multinomial distribution is the same, except that it has more than two categories.

The probability distribution of X2 is a multinomial distribution. This is called the chi-squared probability distribution. The symbol χ2 of the chi-squared distribution is alike the letter X2 of the test statistic.

The most important characteristics of the chi-squared distribution are:

  • The distribution is always positive, X² can never be negative.

  • The distribution is skewed to the right.

  • The exact shape of the distribution depends on the degrees of freedom (df). For the chi-suared distribution, µ = df and σ = the root of 2df. The curve gets flatter when df gets bigger.

  • If r is the number of rows and c the columns, df = (r – 1)(c – 1).

  • When the contingency tables become bigger, so do the degrees of freedom and so does X².

  • The stronger X², the stronger the evidence is against H0.

X² is used both for means and proportions. For proportions, research results (such as 'yes' and 'no') can be divided in success and failure. π1 is the proportion of success in group 1, π2 the proportion of success in group 2. When the response variable is independent of the populations, then π1 = π2. This is called a homogenity hypothesis. Chi-quared test is also called a homogenity test. The test statistic is:

in which X² = z2

The test statistics z-score and X² are used in different cases. Z-score is applicable for instance for one-sided alternative hypotheses. But for a contingency table larger than 2x2, the X² is better because it can handle multiple parameters. Df can be interpreted as the number of parameters required to describe the contingency table.

Chi-quared test does have limitations. It only works for large sample with an expected frequency higher than 5 per cel. For small samples Fisher's exact test is better. Chi-squared test works best for nominal scales. For ordinal scales other tests are preferred.

8.3 In which way do residuals help to analyze the association between variables?

When the P-value of a chi-squared test is very small, then there is strong evidence of an association between the variables. This says nothing about which way that the variables are connected or how strong this association is. That's why residuals are important. A residual is the difference between the observed and expected frequency of a cel: fo – fe. When a residual is positive, the observed frequency is bigger. A standardized residual indicates for which number H0 is true and when there is indepence. The formula for a standardized residual is:

=

A big standardized residual is the evidence against independence in a certain cell. When the null hypothesis is true, the probability is only 5% that a standardized residual has a value higher than 2. So a residual of under -3 or above 3 is very convincing evidence. Software gives both the test statistic X² and the residuals. In a 2x2 contingency table the standardized residual is the same as the z test statistic for comparing two proportions.

8.4 How is the association in a contingency table measured?

In analyzing a contingency table, research hopes to find out:

  • Whether there is an association (measured by chi-squared test)

  • How the data is different from indepence (measured by standardized residuals)

  • How strong the association is between variables

Several measures of association size up the connection between variables. They compare the most extreme form of an association and the most extreme depletion of it and decide where the data is located in between these two extremes.

The least strong association is for instance in a sample of 60% students and 40% non-students, where 30% of students say they love beer and 30% of non-students say they love beer. This is not a real situation. The most extreme association would be if 100% of students love beer and 0% of non-students. In reality the percentage lies in between.

In a simple binary 2x2 contingency table it's easy to compare proportions. If the association is strong, so is the absolute number of the difference.

Chi-quared test measures only how much evidence is provided of an association. Chi-squared test does not measure how strong an association is. For instance, a large sample can find strong evidence that there a weak association exists.

When the outcome of a binary response variable is labelled success or failure, then the odds can be calculated: odds of succes = probability of success – probability of failure. When the odds are 3, then success is three times as likely as failure. The probability of a certain outcome is odds / (odds + 1). The odds ratio of a 2x2 contingency table compares the odds of a group with the odds of another group: odds of row 1 / odds of row 2. The odds ratio is indicated as θ .

The odds ratio has the following characteristics:

  • The value doesn't depend on which variable is chosen as a response variable.

  • The odds ratio is the same as multiplying diagonal cells and hence it's also called the cross-product ratio.

  • The odds ratio can have any non-negative number.

  • When the probability of success is the same for two rows, then the odds ratio is 1.

  • An odds ratio smaller than 1 means that the odds of success are smaller for row 1 than for row 2.

  • The further the odds ratio is from 1, the stronger the association.

  • There can be two values for the odds ratio; two directions.

When a contingency table is more complex than 2x2, then the odds ratio is divided in smaller 2x2 contingency tables. Sometimes it's possible to capture a complex collection of data in a single number, but it's better to present multiple comparisons instead (like multiple odds ratios), to better reflect the data.

8.5 How do you measure the association between ordinal variables?

An association between ordinal variables can be positive or negative. A positive association means that a higher score on x goes along with a higher score on y. A negative association means that a higher score on x entails a lower score on y.

A pair of observations can be concordant (C) or disconcordant (D). A pair of observations is concordant when the subject that scores higher for one variable also scores higher for another variable (evidence of a positive association). A pair is disconcordant when the subject that scores higher for one, scores lower for the other (evidence of a negative association).

Because bigger samples have more pairs, the difference is standardized, which gives gamma, noted as ŷ (this is different from y-bar!). Gamma measures the association between variables. Its formula is: ŷ = (C – D) / (C + D).

The gamma value is between -1 and +1. It indicates whether the association is positive or native and how strong the association is. If gamma increases, so does the association. For instance, a gamma value of 0.17 indicates a positive but weak association. Gamma is the difference between ordinal proportions, it's the difference between the proportions of concordant and disconcordant pairs.

Other measures of association are Kendall's tau-b, Spearman's rho-b, and Somers' d. These methods measure the correlation between quantitative variables.

Also gamma can be calculated as a confidence interval. In this case ŷ denotes sample gamma, y population gamma, ŷ ± z(se) the confidence interval in which z = (ŷ – 0) / se. This formula works best if C and D are both higher than 50.

If two variables are ordinal, then an ordinal measure is preferable over chi-squared test, because chi-squared test ignores rankings.

Other ordinal methods work in similar ways like gamma. An alternative is a test of linear-by-linear association, in which each category of each variable is assigned a score and the correlation is analyzed by a z-test. This is a method to detect a trend.

For a mix of ordinal and nominal variables, especially if the nominal variable has more than two categories, it's better not to use gamma.

Image

Access: 
Public

Image

Search a summary

Image

 

 

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Image

Spotlight: topics

Check the related and most recent topics and summaries:
Institutions, jobs and organizations:
Activity abroad, study field of working area:
WorldSupporter and development goals:

Image

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the summaries home pages for your study or field of study
  2. Use the check and search pages for summaries and study aids by field of study, subject or faculty
  3. Use and follow your (study) organization
    • by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
    • this option is only available through partner organizations
  4. Check or follow authors or other WorldSupporters
  5. Use the menu above each page to go to the main theme pages for summaries
    • Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Main study fields NL:

Follow the author: Annemarie JoHo
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics
1867