Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

Categorical outcomes: chi-square and loglinear analysis - summary of chapter 19 of Statistics by A. Field

Statistics
Chapter 19
Categorical outcomes: chi-square and loglinear analysis

Analysing categorical data

Sometimes we want to predict categorical outcome variables. We want to predict into which category an entity falls.

Associations between two categorical variables

With categorical variables we can’t use the mean or any similar statistic because the mean of a categorical variable is meaningless: the numeric values you attach to different categories are arbitrary, and the mean of those numeric values will depend on how many members each category has.

When we’ve measured only categorical variables, we analyse the number of things that fall into each combination of categories (the frequencies).

Pearson’s chi-square test

To see whether there’s a relationship between two categorical variables we can use the Pearson’s chi-square test.
This statistic is based on the simple idea of comparing the frequencies you observe in certain categories to the frequencies you might expect to get in those categories by chance.

X2 = Σ(observedij-modelij)2 / modelij

i represents the rows in the contingency table
j represents the columns in the contingency table.

As model we use ‘expected frequencies’.

To adjust for inequalities, we calculate frequencies for each cell in the table using the column and row totals for that cell.
By doing so we factor in the total number of observations that could have contributed to that cell.

Modelij = Eij = (row totali x column totalj) / n

X2 has a distribution with known properties called the chi-square distribution. This has a shape determined by the degrees of freedom: (r-1)(c-1)

r = the number of rows

c = the number of columns

Fischer’s exact test

The chi-square statistic has a sampling distribution that is only approximately a chi-square distribution.
The larger the sample is, the better this approximation becomes. In large samples the approximation is good enough not to worry about the fact that it is an approximation.
In small samples, the approximation is not good enough, making significance tests of the chi-square statistic inaccurate.

Fischer’s exact tests: a way to compute the exact probability of the chi-square statistic in small samples.

The likelihood ratio

An alternative to Pearson’s chi-square.
Based on maximum-likelihood theory.

General idea: you collect some data and create a model for which the probability of obtaining the observed set of data is maximized, then you compare this model to the probability of obtaining those data under the null hypothesis.
The resulting statistic is based on comparing observed frequencies with those predicted by the model.

LX2= sΣobservedij In( Observedij / modelij)

In = the natural logarithm.

This statistic has a chi-square distribution with the same degrees of freedom.

Yates’s correction

A correction to the Pearson formula.
Basic idea: when you calculate the deviation from the model you substract 0.5 from the absolute value of this deviation before you square it.
So:

X2 = Σ(|observedij – modelij| - 0.5)2 / modelij

The correction lowers the value of the chi-square statistic and therefore makes it less significant.

Other measures of association

There are measures of the strength of association that modify the chi-square statistic to take account of sample size and degrees of freedom and try to restrict the range of the test statistic from 0 to 1.
Three measures are:

  • Phi: accurate for 2x2 contingency tables, but not bigger
  • Contingency coefficient: ensures a value between 0 and 1, but seldom reaches its upper limit of 1.
  • Cramér’s V: when both variables have only two categories, phi and Cramér’s V are identical, but when variables have more than two categories, Cramér’s statistic can attain its maximum of 1 and so it is the most useful

The chi-square test as a linear model

The chi-square test can be conceptualized as a general linear model if we use log values.

Yi = b0+ b1X1i + b2X2i ………. etcetera

Everything is the same as in factorial design except that we deal with log-transformed values.
Saturated model: a lack of error because of the various combinations of coding variables completely explain the observed values.

The chi-square test looks at whether two variables are independent, therefore, it has not interest in their combined effect, only their main effect.

Chi-square can be thought of as a linear model in which the beta values tell us something about the relative differences in frequencies across categories of our two variables.

Associations between several categorical variables: loglinear analysis

Often we want to analyse more complex contingency tables in which there are three or more variables.
This has to be analysed with a loglinear analysis.

In(Oijk) = b0 + b1Ai + b2Bj +b3Ck+b4ABij+b5ACjk+b7ABCijk+In(Ɛij)

When our outcome is categorical and we include all the available terms (main effects and interactions) we get no error: our predictors perfectly predict the outcome (the expected values). The model is saturated.

Loglinear analysis typicall works on a princple of backward elimination.
We begin with the saturated model, remove a predictor from the model, re-estimate the model and use it to predict our outcome and see how well it fits the data.
We assume the term we removed was not having a significant impact on the ability of our model to predict the observed outcome.

We don’t remove terms randomly, we do it hierarchically.
We start with the saturated model, remove the highest-order interaction, and assess the effect it has. If removing the highest-order interaction term has not substantial impact on the model we get rid of it and move on to remove the next highest-order interactions.
We carry on until we find and effect that does affect the fit of the model when it is removed.

The likelihood ratio statistic is used to assess each model.
This equation can be adapted to fit any model: the observed values are the same throughout, and the model frequencies are the expected frequencies from he model being tested.
For the saturated model, this statistic will always be 0 (because the observed and model frequencies are the same, so the ratio of observed to model frequencies will be 1, and Ln(1) =0).
In other situations it will provide a measure of how well the model first the observed frequencies.

To test whether a new model has changed the likelihood ratio, we take the likelihood ratio for a model and subtract from it the likelihood statistic for the previous model (provided the models are hierarchically structured):

Lx2change= Lx2current model – Lx2previous model

Assumptions when analysing categorical data

The chi-square has two important assumptions relating to

  • independence
  • expected frequencies

Independence

The general linear model makes an assumption about the independence of residuals, and the chi-square test, being a linear model of sorts, is no exception.
For the chi-square test to be meaningful each person, item, or entity must contribute to only one cell of the contingency table.
You cannot use a chi-square test on a repeated-measures design.

Expected frequencies

With 2x2 contingency tables, no expected values should be below 5.
In larger tables, and when looking at associations between three or more categorical variables, the rule is that all expected counts should be greater than 1 and no more than 20% of expected counts should be less than 5.

If this assumption is broken, the result is a radical reduction in test power.

In terms of remedies, if you’re looking at associations between only two variables then consider using Fischerś exact test.
With three or more variables your options are to:

  • collapse data across one or more variables (prefrably the one you least expect to have an effect)
  • collapse levels of one of the variables
  • collect more data
  • accept the loss of power

If you want to collapse data across one of the variables then:

  • the highest-order interaction should be non-significant
  • at least one of the lower-order interaction terms involving the variable to be deleted should be non-significant

More doom and gloom

Not an assumption.
Proportionately small differences in cell frequencies can result in statistically significant associations between variables if the sample is large enough.
Therefore, we must look at row and column percentages to interpret the significant effects that we get. These percentages will reflect the patterns of data far better than the frequencies themselves.

Interpreting the chi-square test

The contingency table contains the number of cases that fall into each combination of categories.

The test compares the proportion and not the counts themselves.
If columns have different subscripts (like a and b), that means that they are significantly different.

Using standardized residuals

In a 2x2 contingency table the nature of a significant association can be clear from just the cell percentages or counts. In larger contingency tables, this may not be the cause and you need a finer-grained investigation of the contingency table.
You can look at the standardized residual.

Standardized residual: (observedij-modelij)/ wortel(modelij)

Two important things about standardized residuals:

  • given that the chi-square statistic is the sum of these standardized residuals (sort of), we want to decompose what contributes to the overall association that the chi-square statistic measures, then looking at the individual standardized residuals is a good idea because they have a direct relationship with the test statistic.
  • these standardized residuals behave like any other: each one is a z-score. This is very useful because by looking at a standardized residual we can assess its significance.

Reporting the results of a chi-square test

When reporting Pearson’s chi-square we report the value of the test statistic with its associated degrees of freedom and the significance value.
The test statistic is X2.

For example:

X2(1) = 25,36, p<0.001

SPSS

  • To test the relationship between two categorical variables use Pearson’s chi-square test or the likelihood ratio statistic.
  • Look at the table labelled chi-square tests, if the Exact sig. Value is less than 0.05 for the row labelled Pearson chi-square then there is a significant relationship between your two variables.
  • Check underneath this table to make sure that no expected frequencies are less than 5
  • Look at the contingency table to work out what the relationship between the variables is: look out for significant standardized residuals (values outside +/- 1.96), and columns that have different letters as subscripts (this indicates a significant difference).
  • Calculate odds ratio
  • The Bayes factor reported by SPSS statistics tells you the probability of the data under the null hypothesis relative to the alternative. Divide 1 by this value to see the probability of the data under the alternative hypothesis relative to the null. Values greater than 1 indicate that your belief should change towards the alternative hypothesis, whit values greater than 3 stating to indicate a change in beliefs that has substance.
  • Report the X2 statistic, the degrees of freedom, the significance value and odds ratio. Also report the contingency table.

Interpreting loglinear analysis in SPSS

The output contains three tables.

  • the first table tells us how many cases we have.
  • the second table gives use the observed and expected counts for each of the combinations of categories in our model. These values should be the same as the original contingency table, except that each cell has 0.5 added to it.
  • the final table contains two goodness-of-fit statistics
    • Pearson’s chi-square
    • the likelihood ratio.

These statistics are testing the hypothesis that the frequencies predicted by the model are significantly different from the observed frequencies in the data.
If our model is a good fit of the data then the observed and expected frequencies should be very similar.
A significant result means that our model predictions are significantly different from our data.

The second output tells us about the effects of removing parts of the model. It is labelled K-Way or Higher-Order effects.
It shows the likelihood ratio and Pearson’s chi-square statistics when K= 1, 2 and 3.

  • the first row (K =1) tells us whether removing the one-way effect and higher-order effects well significantly affect the fit of the model.
  • the next row (K=2) tells us whether removing the two-way interactions and any higher-order effects will affect the model.
  • the final row (K=3) tests whether removing the three-way effect and higher-order effects will significantly affect the model.

The parameter estimates output
Tests each effect in the model with a z-score, and gives us confidence intervals.

Reporting the results of loglinear analysis

For loglinear analysis report the likelihood ratio statistic for the final model, usually denoted just be X2.
For any terms that are significant, you should report the chi-square change, or you could consider reporting the z-score for the effect and its associated confidence interval.
If you break down any higher-order interactions in subsequent analyses then you need to report the relevant chi-square statistics (and odds ratios).

SPSS

  • Test the relationship between more than two categorical variables with loglinear analysis.
  • loglinear analysis is hierarchical: the initial model contains all main effects and interactions. Starting with the highest-order interaction, terms are removed to see whether their removal significantly affects the fit of the model. If it does then this term is not removed and all lower-order effects are ignored.
  • Look at the table labelled K-way and higher-order effects to see which effect have been retained in the final model. Then look at the table labelled partial associations to see the individual significance of the retained effects (look at the column labelled sig. - values less than 0.05 indicate significance)
  • Look at the Goodness-of-fit tests for the final model: if this model is a good fit of the data then this statistic should be non-significant (Sig. Should be bigger than 0.05)
  • Look at the contingency table to interpret any significant effects (percentage of total for cells is the best thing to look at).

Image

Access: 
Public

Image

This content is used in:

Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition

Search a summary

Image

 

 

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Add new contribution

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
Image CAPTCHA
Enter the characters shown in the image.

Image

Spotlight: topics

Check the related and most recent topics and summaries:
Institutions, jobs and organizations:
Countries and regions:
WorldSupporter and development goals:
This content is also used in .....

Image

Check how to use summaries on WorldSupporter.org

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

  • For free use of many of the summaries and study aids provided or collected by your fellow students.
  • For free use of many of the lecture and study group notes, exam questions and practice questions.
  • For use of all exclusive summaries and study assistance for those who are member with JoHo WorldSupporter with online access
  • For compiling your own materials and contributions with relevant study help
  • For sharing and finding relevant and interesting summaries, documents, notes, blogs, tips, videos, discussions, activities, recipes, side jobs and more.

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.

  1. Use the summaries home pages for your study or field of study
  2. Use the check and search pages for summaries and study aids by field of study, subject or faculty
  3. Use and follow your (study) organization
    • by using your own student organization as a starting point, and continuing to follow it, easily discover which study materials are relevant to you
    • this option is only available through partner organizations
  4. Check or follow authors or other WorldSupporters
  5. Use the menu above each page to go to the main theme pages for summaries
    • Theme pages can be found for international studies as well as Dutch studies

Do you want to share your summaries with JoHo WorldSupporter and its visitors?

Quicklinks to fields of study for summaries and study assistance

Main summaries home pages:

Main study fields:

Main study fields NL:

Follow the author: SanneA
Work for WorldSupporter

Image

JoHo can really use your help!  Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics
4261