Categorical outcomes: chi-square and loglinear analysis - summary of chapter 19 of Statistics by A. Field

Statistics
Chapter 19
Categorical outcomes: chi-square and loglinear analysis

Analysing categorical data

Sometimes we want to predict categorical outcome variables. We want to predict into which category an entity falls.

Associations between two categorical variables
Associations between several categorical variables: loglinear analysis
Assumptions when analysing categorical data
Interpreting the chi-square test
SPSS
Interpreting loglinear analysis in SPSS
Reporting the results of loglinear analysis

Associations between two categorical variables

With categorical variables we can’t use the mean or any similar statistic because the mean of a categorical variable is meaningless: the numeric values you attach to different categories are arbitrary, and the mean of those numeric values will depend on how many members each category has.

When we’ve measured only categorical variables, we analyse the number of things that fall into each combination of categories (the frequencies).

Pearson’s chi-square test

To see whether there’s a relationship between two categorical variables we can use the Pearson’s chi-square test.
This statistic is based on the simple idea of comparing the frequencies you observe in certain categories to the frequencies you might expect to get in those categories by chance.

X² = Σ(observed_ij-model_ij)²/ model_ij

i represents the rows in the contingency table
j represents the columns in the contingency table.

As model we use ‘expected frequencies’.

To adjust for inequalities, we calculate frequencies for each cell in the table using the column and row totals for that cell.
By doing so we factor in the total number of observations that could have contributed to that cell.

Model_ij = E_ij = (row total_i x column total_j) / n

X² has a distribution with known properties called the chi-square distribution. This has a shape determined by the degrees of freedom: (r-1)(c-1)

r = the number of rows

c = the number of columns

Fischer’s exact test

The chi-square statistic has a sampling distribution that is only approximately a chi-square distribution.
The larger the sample is, the better this approximation becomes. In large samples the approximation is good enough not to worry about the fact that it is an approximation.
In small samples, the approximation is not good enough, making significance tests of the chi-square statistic inaccurate.

Fischer’s exact tests: a way to compute the exact probability of the chi-square statistic in small samples.

The likelihood ratio

An alternative to Pearson’s chi-square.
Based on maximum-likelihood theory.

General idea: you collect some data and create a model for which the probability of obtaining the observed set of data is maximized, then you compare this model to the probability of obtaining those data under the null hypothesis.
The resulting statistic is based on comparing observed frequencies with those predicted by the model.

LX²= sΣobserved_ij In( Observed_ij / model_ij)

In = the natural logarithm.

This statistic has a chi-square distribution with the same degrees of freedom.

Yates’s correction

A correction to the Pearson formula.
Basic idea: when you calculate the deviation from the model you substract 0.5 from the absolute value of this deviation before you square it.
So:

X² = Σ(|observed_ij – model_ij| - 0.5)² / model_ij

The correction lowers the value of the chi-square statistic and therefore makes it less significant.

Other measures of association

There are measures of the strength of association that modify the chi-square statistic to take account of sample size and degrees of freedom and try to restrict the range of the test statistic from 0 to 1.
Three measures are:

Phi: accurate for 2x2 contingency tables, but not bigger
Contingency coefficient: ensures a value between 0 and 1, but seldom reaches its upper limit of 1.
Cramér’s V: when both variables have only two categories, phi and Cramér’s V are identical, but when variables have more than two categories, Cramér’s statistic can attain its maximum of 1 and so it is the most useful

The chi-square test as a linear model

The chi-square test can be conceptualized as a general linear model if we use log values.

Y_i = b₀+ b₁X_1i+ b₂X_{2i ……….}etcetera

Everything is the same as in factorial design except that we deal with log-transformed values.
Saturated model: a lack of error because of the various combinations of coding variables completely explain the observed values.

The chi-square test looks at whether two variables are independent, therefore, it has not interest in their combined effect, only their main effect.

Chi-square can be thought of as a linear model in which the beta values tell us something about the relative differences in frequencies across categories of our two variables.

Associations between several categorical variables: loglinear analysis

Often we want to analyse more complex contingency tables in which there are three or more variables.
This has to be analysed with a loglinear analysis.

In(O_ijk) = b₀ + b₁A_i + b₂B_j+b₃C_k+b₄AB_ij+b₅AC_jk+b₇ABC_ijk+In(Ɛ_ij)

When our outcome is categorical and we include all the available terms (main effects and interactions) we get no error: our predictors perfectly predict the outcome (the expected values). The model is saturated.

Loglinear analysis typicall works on a princple of backward elimination.
We begin with the saturated model, remove a predictor from the model, re-estimate the model and use it to predict our outcome and see how well it fits the data.
We assume the term we removed was not having a significant impact on the ability of our model to predict the observed outcome.

We don’t remove terms randomly, we do it hierarchically.
We start with the saturated model, remove the highest-order interaction, and assess the effect it has. If removing the highest-order interaction term has not substantial impact on the model we get rid of it and move on to remove the next highest-order interactions.
We carry on until we find and effect that does affect the fit of the model when it is removed.

The likelihood ratio statistic is used to assess each model.
This equation can be adapted to fit any model: the observed values are the same throughout, and the model frequencies are the expected frequencies from he model being tested.
For the saturated model, this statistic will always be 0 (because the observed and model frequencies are the same, so the ratio of observed to model frequencies will be 1, and Ln(1) =0).
In other situations it will provide a measure of how well the model first the observed frequencies.

To test whether a new model has changed the likelihood ratio, we take the likelihood ratio for a model and subtract from it the likelihood statistic for the previous model (provided the models are hierarchically structured):

Lx²_change= Lx²_{current model} – Lx²_{previous model}

Assumptions when analysing categorical data

The chi-square has two important assumptions relating to

independence
expected frequencies

Independence

The general linear model makes an assumption about the independence of residuals, and the chi-square test, being a linear model of sorts, is no exception.
For the chi-square test to be meaningful each person, item, or entity must contribute to only one cell of the contingency table.
You cannot use a chi-square test on a repeated-measures design.

Expected frequencies

With 2x2 contingency tables, no expected values should be below 5.
In larger tables, and when looking at associations between three or more categorical variables, the rule is that all expected counts should be greater than 1 and no more than 20% of expected counts should be less than 5.

If this assumption is broken, the result is a radical reduction in test power.

In terms of remedies, if you’re looking at associations between only two variables then consider using Fischerś exact test.
With three or more variables your options are to:

collapse data across one or more variables (prefrably the one you least expect to have an effect)
collapse levels of one of the variables
collect more data
accept the loss of power

If you want to collapse data across one of the variables then:

the highest-order interaction should be non-significant
at least one of the lower-order interaction terms involving the variable to be deleted should be non-significant

More doom and gloom

Not an assumption.
Proportionately small differences in cell frequencies can result in statistically significant associations between variables if the sample is large enough.
Therefore, we must look at row and column percentages to interpret the significant effects that we get. These percentages will reflect the patterns of data far better than the frequencies themselves.

Interpreting the chi-square test

The contingency table contains the number of cases that fall into each combination of categories.

The test compares the proportion and not the counts themselves.
If columns have different subscripts (like a and b), that means that they are significantly different.

Using standardized residuals

In a 2x2 contingency table the nature of a significant association can be clear from just the cell percentages or counts. In larger contingency tables, this may not be the cause and you need a finer-grained investigation of the contingency table.
You can look at the standardized residual.

Standardized residual: (observed_ij-model_ij)/ wortel(model_ij)

Two important things about standardized residuals:

given that the chi-square statistic is the sum of these standardized residuals (sort of), we want to decompose what contributes to the overall association that the chi-square statistic measures, then looking at the individual standardized residuals is a good idea because they have a direct relationship with the test statistic.
these standardized residuals behave like any other: each one is a z-score. This is very useful because by looking at a standardized residual we can assess its significance.

Reporting the results of a chi-square test

When reporting Pearson’s chi-square we report the value of the test statistic with its associated degrees of freedom and the significance value.
The test statistic is X².

For example:

X²(1) = 25,36, p<0.001

SPSS

To test the relationship between two categorical variables use Pearson’s chi-square test or the likelihood ratio statistic.
Look at the table labelled chi-square tests, if the Exact sig. Value is less than 0.05 for the row labelled Pearson chi-square then there is a significant relationship between your two variables.
Check underneath this table to make sure that no expected frequencies are less than 5
Look at the contingency table to work out what the relationship between the variables is: look out for significant standardized residuals (values outside +/- 1.96), and columns that have different letters as subscripts (this indicates a significant difference).
Calculate odds ratio
The Bayes factor reported by SPSS statistics tells you the probability of the data under the null hypothesis relative to the alternative. Divide 1 by this value to see the probability of the data under the alternative hypothesis relative to the null. Values greater than 1 indicate that your belief should change towards the alternative hypothesis, whit values greater than 3 stating to indicate a change in beliefs that has substance.
Report the X² statistic, the degrees of freedom, the significance value and odds ratio. Also report the contingency table.

Interpreting loglinear analysis in SPSS

The output contains three tables.

the first table tells us how many cases we have.
the second table gives use the observed and expected counts for each of the combinations of categories in our model. These values should be the same as the original contingency table, except that each cell has 0.5 added to it.
the final table contains two goodness-of-fit statistics
- Pearson’s chi-square
- the likelihood ratio.

These statistics are testing the hypothesis that the frequencies predicted by the model are significantly different from the observed frequencies in the data.
If our model is a good fit of the data then the observed and expected frequencies should be very similar.
A significant result means that our model predictions are significantly different from our data.

The second output tells us about the effects of removing parts of the model. It is labelled K-Way or Higher-Order effects.
It shows the likelihood ratio and Pearson’s chi-square statistics when K= 1, 2 and 3.

the first row (K =1) tells us whether removing the one-way effect and higher-order effects well significantly affect the fit of the model.
the next row (K=2) tells us whether removing the two-way interactions and any higher-order effects will affect the model.
the final row (K=3) tests whether removing the three-way effect and higher-order effects will significantly affect the model.

The parameter estimates output
Tests each effect in the model with a z-score, and gives us confidence intervals.

Reporting the results of loglinear analysis

For loglinear analysis report the likelihood ratio statistic for the final model, usually denoted just be X².
For any terms that are significant, you should report the chi-square change, or you could consider reporting the z-score for the effect and its associated confidence interval.
If you break down any higher-order interactions in subsequent analyses then you need to report the relevant chi-square statistics (and odds ratios).

SPSS

Test the relationship between more than two categorical variables with loglinear analysis.
loglinear analysis is hierarchical: the initial model contains all main effects and interactions. Starting with the highest-order interaction, terms are removed to see whether their removal significantly affects the fit of the model. If it does then this term is not removed and all lower-order effects are ignored.
Look at the table labelled K-way and higher-order effects to see which effect have been retained in the final model. Then look at the table labelled partial associations to see the individual significance of the retained effects (look at the column labelled sig. - values less than 0.05 indicate significance)
Look at the Goodness-of-fit tests for the final model: if this model is a good fit of the data then this statistic should be non-significant (Sig. Should be bigger than 0.05)
Look at the contingency table to interpret any significant effects (percentage of total for cells is the best thing to look at).

Access:

Public

Join WorldSupporter!

Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>

This content is related to:

Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition

Check more of topic:

Samenvattingen voor psychologie en gedrag

Universiteit Amsterdam: UVA

This content is used in:

Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition

Going abroad?

Insure your way around the world

International expat insurances

Travel & Worldsupporter insurances (NL)

Study with summaries

Contributions: posts

Help other WorldSupporters with additions, improvements and tips

Spotlight: topics

Check the related and most recent topics and summaries:

Activities abroad, study fields and working areas:

Samenvattingen voor psychologie en gedrag

Research, science and statistics

Countries and regions:

The Netherlands

WorldSupporter and development goals:

Development Goal 04: Quality Education

Institutions, jobs and organizations:

Universiteit Amsterdam: UVA

This content is also used in .....

Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition

This is a summary of the book "Discovering statistics using IBM SPSS statistics" by A. Field. In this summary, everything students at the second year of psychology at the Uva will need is present. The content needed in the thirst three blocks are already online, and the rest

...

analysis-2958826_960_720.jpg

Why is my evil lecturer forcing me to learn statisics? - summary of chapter 1 of statistics by A. Field (5th edition)

The spine of statistics - summary of chapter 2 of Statistics by A. Field (5th edition)

The beast of bias - summary of chapter 6 of Statistics by A. Field (5th edition)

Non-parametric models - summary of chapter 7 of Statistics by A. Field (5h edition)

Correlation - summary of chapter 8 of Statistics by A. Field (5th edition)

The linear model - summary of Chapter 9 by A. Field 5th edition

Comparing two means - summary of chapter 10 of Statistics by A. Field (5th edition)

Moderation, mediation, and multi-category predictors - summary of chapter 11 of Statistics by A. Field (5th edition),

Comparing several independent means - summary of chapter 12 of Statistics by A. Field (5th edition)

Analysis of covariance - summary of chapter 13 of Statistics by A. Field (5th edition)

Factorial designs - summary of chapter 14 of statistics by A. Field (5th edition)

Repeated measures designs - summary of chapter 15 of Statistics by A. Field (5th edition)

Mixed designs - summary of chapter 16 of Statistics by A. Field (5th edition)

Multivariate analysis of variance (MANOVA) - summary of chapter 17 of Statistics by A. Field (5th edition)

Exploratory factor analysis - summary of chapter 18 of Statistics by A. Field (5th edition)

Categorical outcomes: chi-square and loglinear analysis - summary of chapter 19 of Statistics by A. Field

WSRt using SPSS, manual for tests in the third block of the second year of psychology at the uva

Everything you need for the course WSRt of the second year of Psychology at the Uva

Categorical outcomes: logistic regression - summary of (part of) chapter 20 of Statistics by A. Field

Check how to use summaries on WorldSupporter.org

Submenu: Summaries & Activities

Follow the author: SanneA

Work for WorldSupporter

JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world

Working for JoHo as a student in Leyden

Parttime werken voor JoHo

Statistics

Search a summary, study help or student organization

Select any filter and click on Search to see results

Categorical outcomes: chi-square and loglinear analysis - summary of chapter 19 of Statistics by A. Field

Associations between two categorical variables

Associations between several categorical variables: loglinear analysis

Assumptions when analysing categorical data

Interpreting the chi-square test

SPSS

Interpreting loglinear analysis in SPSS

Reporting the results of loglinear analysis

Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition

Samenvattingen voor psychologie en gedrag

Universiteit Amsterdam: UVA

Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition

Contributions: posts

Add new contribution

Spotlight: topics

Samenvattingen voor psychologie en gedrag

Research, science and statistics

The Netherlands

Development Goal 04: Quality Education

Universiteit Amsterdam: UVA

Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition

analysis-2958826_960_720.jpg

Online access to all summaries, study notes en practice exams

How and why use WorldSupporter.org for your summaries and study assistance?

Using and finding summaries, notes and practice exams on JoHo WorldSupporter

Quicklinks to fields of study for summaries and study assistance