Summary of Discovering statistics using IBM SPSS statistics by Field - 5th edition
- 11818 reads
Statistics
Chapter 19
Categorical outcomes: chi-square and loglinear analysis
Analysing categorical data
Sometimes we want to predict categorical outcome variables. We want to predict into which category an entity falls.
With categorical variables we can’t use the mean or any similar statistic because the mean of a categorical variable is meaningless: the numeric values you attach to different categories are arbitrary, and the mean of those numeric values will depend on how many members each category has.
When we’ve measured only categorical variables, we analyse the number of things that fall into each combination of categories (the frequencies).
Pearson’s chi-square test
To see whether there’s a relationship between two categorical variables we can use the Pearson’s chi-square test.
This statistic is based on the simple idea of comparing the frequencies you observe in certain categories to the frequencies you might expect to get in those categories by chance.
X2 = Σ(observedij-modelij)2 / modelij
i represents the rows in the contingency table
j represents the columns in the contingency table.
As model we use ‘expected frequencies’.
To adjust for inequalities, we calculate frequencies for each cell in the table using the column and row totals for that cell.
By doing so we factor in the total number of observations that could have contributed to that cell.
Modelij = Eij = (row totali x column totalj) / n
X2 has a distribution with known properties called the chi-square distribution. This has a shape determined by the degrees of freedom: (r-1)(c-1)
r = the number of rows
c = the number of columns
Fischer’s exact test
The chi-square statistic has a sampling distribution that is only approximately a chi-square distribution.
The larger the sample is, the better this approximation becomes. In large samples the approximation is good enough not to worry about the fact that it is an approximation.
In small samples, the approximation is not good enough, making significance tests of the chi-square statistic inaccurate.
Fischer’s exact tests: a way to compute the exact probability of the chi-square statistic in small samples.
The likelihood ratio
An alternative to Pearson’s chi-square.
Based on maximum-likelihood theory.
General idea: you collect some data and create a model for which the probability of obtaining the observed set of data is maximized, then you compare this model to the probability of obtaining those data under the null hypothesis.
The resulting statistic is based on comparing observed frequencies with those predicted by the model.
LX2= sΣobservedij In( Observedij / modelij)
In = the natural logarithm.
This statistic has a chi-square distribution with the same degrees of freedom.
Yates’s correction
A correction to the Pearson formula.
Basic idea: when you calculate the deviation from the model you substract 0.5 from the absolute value of this deviation before you square it.
So:
X2 = Σ(|observedij – modelij| - 0.5)2 / modelij
The correction lowers the value of the chi-square statistic and therefore makes it less significant.
Other measures of association
There are measures of the strength of association that modify the chi-square statistic to take account of sample size and degrees of freedom and try to restrict the range of the test statistic from 0 to 1.
Three measures are:
The chi-square test as a linear model
The chi-square test can be conceptualized as a general linear model if we use log values.
Yi = b0+ b1X1i + b2X2i ………. etcetera
Everything is the same as in factorial design except that we deal with log-transformed values.
Saturated model: a lack of error because of the various combinations of coding variables completely explain the observed values.
The chi-square test looks at whether two variables are independent, therefore, it has not interest in their combined effect, only their main effect.
Chi-square can be thought of as a linear model in which the beta values tell us something about the relative differences in frequencies across categories of our two variables.
Often we want to analyse more complex contingency tables in which there are three or more variables.
This has to be analysed with a loglinear analysis.
In(Oijk) = b0 + b1Ai + b2Bj +b3Ck+b4ABij+b5ACjk+b7ABCijk+In(Ɛij)
When our outcome is categorical and we include all the available terms (main effects and interactions) we get no error: our predictors perfectly predict the outcome (the expected values). The model is saturated.
Loglinear analysis typicall works on a princple of backward elimination.
We begin with the saturated model, remove a predictor from the model, re-estimate the model and use it to predict our outcome and see how well it fits the data.
We assume the term we removed was not having a significant impact on the ability of our model to predict the observed outcome.
We don’t remove terms randomly, we do it hierarchically.
We start with the saturated model, remove the highest-order interaction, and assess the effect it has. If removing the highest-order interaction term has not substantial impact on the model we get rid of it and move on to remove the next highest-order interactions.
We carry on until we find and effect that does affect the fit of the model when it is removed.
The likelihood ratio statistic is used to assess each model.
This equation can be adapted to fit any model: the observed values are the same throughout, and the model frequencies are the expected frequencies from he model being tested.
For the saturated model, this statistic will always be 0 (because the observed and model frequencies are the same, so the ratio of observed to model frequencies will be 1, and Ln(1) =0).
In other situations it will provide a measure of how well the model first the observed frequencies.
To test whether a new model has changed the likelihood ratio, we take the likelihood ratio for a model and subtract from it the likelihood statistic for the previous model (provided the models are hierarchically structured):
Lx2change= Lx2current model – Lx2previous model
The chi-square has two important assumptions relating to
Independence
The general linear model makes an assumption about the independence of residuals, and the chi-square test, being a linear model of sorts, is no exception.
For the chi-square test to be meaningful each person, item, or entity must contribute to only one cell of the contingency table.
You cannot use a chi-square test on a repeated-measures design.
Expected frequencies
With 2x2 contingency tables, no expected values should be below 5.
In larger tables, and when looking at associations between three or more categorical variables, the rule is that all expected counts should be greater than 1 and no more than 20% of expected counts should be less than 5.
If this assumption is broken, the result is a radical reduction in test power.
In terms of remedies, if you’re looking at associations between only two variables then consider using Fischerś exact test.
With three or more variables your options are to:
If you want to collapse data across one of the variables then:
More doom and gloom
Not an assumption.
Proportionately small differences in cell frequencies can result in statistically significant associations between variables if the sample is large enough.
Therefore, we must look at row and column percentages to interpret the significant effects that we get. These percentages will reflect the patterns of data far better than the frequencies themselves.
The contingency table contains the number of cases that fall into each combination of categories.
The test compares the proportion and not the counts themselves.
If columns have different subscripts (like a and b), that means that they are significantly different.
Using standardized residuals
In a 2x2 contingency table the nature of a significant association can be clear from just the cell percentages or counts. In larger contingency tables, this may not be the cause and you need a finer-grained investigation of the contingency table.
You can look at the standardized residual.
Standardized residual: (observedij-modelij)/ wortel(modelij)
Two important things about standardized residuals:
Reporting the results of a chi-square test
When reporting Pearson’s chi-square we report the value of the test statistic with its associated degrees of freedom and the significance value.
The test statistic is X2.
For example:
X2(1) = 25,36, p<0.001
The output contains three tables.
These statistics are testing the hypothesis that the frequencies predicted by the model are significantly different from the observed frequencies in the data.
If our model is a good fit of the data then the observed and expected frequencies should be very similar.
A significant result means that our model predictions are significantly different from our data.
The second output tells us about the effects of removing parts of the model. It is labelled K-Way or Higher-Order effects.
It shows the likelihood ratio and Pearson’s chi-square statistics when K= 1, 2 and 3.
The parameter estimates output
Tests each effect in the model with a z-score, and gives us confidence intervals.
For loglinear analysis report the likelihood ratio statistic for the final model, usually denoted just be X2.
For any terms that are significant, you should report the chi-square change, or you could consider reporting the z-score for the effect and its associated confidence interval.
If you break down any higher-order interactions in subsequent analyses then you need to report the relevant chi-square statistics (and odds ratios).
SPSS
Join with a free account for more service, or become a member for full access to exclusives and extra support of WorldSupporter >>
This is a summary of the book "Discovering statistics using IBM SPSS statistics" by A. Field. In this summary, everything students at the second year of psychology at the Uva will need is present. The content needed in the thirst three blocks are already online, and the rest
...There are several ways to navigate the large amount of summaries, study notes en practice exams on JoHo WorldSupporter.
Do you want to share your summaries with JoHo WorldSupporter and its visitors?
Main summaries home pages:
Main study fields:
Business organization and economics, Communication & Marketing, Education & Pedagogic Sciences, International Relations and Politics, IT and Technology, Law & Administration, Medicine & Health Care, Nature & Environmental Sciences, Psychology and behavioral sciences, Science and academic Research, Society & Culture, Tourisme & Sports
Main study fields NL:
JoHo can really use your help! Check out the various student jobs here that match your studies, improve your competencies, strengthen your CV and contribute to a more tolerant world
4305 |
Add new contribution