Chi-squared test for the relationship between two categorical variables - overview

This page offers structured overviews of one or more selected methods. Add additional methods for comparisons by clicking on the dropdown button in the right-hand column. To practice with a specific method click the button at the bottom row of the table

Chi-squared test for the relationship between two categorical variables
$z$ test for a single proportion
Independent /column variableIndependent variable
One categorical with $I$ independent groups ($I \geqslant 2$)None
Dependent /row variableDependent variable
One categorical with $J$ independent groups ($J \geqslant 2$)One categorical with 2 independent groups
Null hypothesisNull hypothesis
  • There is no association between the row and column variable
    More precise statement:
    • If there are $I$ independent random samples of size $n_i$ from each of $I$ populations, defined by the independent variable:
      The distribution of the dependent variable is the same in each of the $I$ populations
    • If there is one random sample of size $N$ from the total population:
      The row and column variables are independent
$\pi = \pi_0$
$\pi$ is the population proportion of "successes"; $\pi_0$ is the population proportion of successes according to the null hypothesis
Alternative hypothesisAlternative hypothesis
  • There is an association between the row and column variable
    More precise statement:
    • If there are $I$ independent random samples of size $n_i$ from each of $I$ populations, defined by the independent variable:
      The distribution of the dependent variable is not the same in all of the $I$ populations
    • If there is one random sample of size $N$ from the total population:
      The row and column variables are dependent
Two sided: $\pi \neq \pi_0$
Right sided: $\pi > \pi_0$
Left sided: $\pi < \pi_0$
AssumptionsAssumptions
  • Sample size is large enough for $X^2$ to be approximately chi-squared distributed under the null hypothesis. Rule of thumb:
    • 2 $\times$ 2 table: all four expected cell counts are 5 or more
    • Larger than 2 $\times$ 2 tables: average of the expected cell counts is 5 or more, smallest expected cell count is 1 or more
  • There are $I$ independent simple random samples from each of $I$ populations defined by the independent variable, or there is one simple random sample from the total population
  • Sample size is large enough for $z$ to be approximately normally distributed. Rule of thumb:
    • Significance test: $N \times \pi_0$ and $N \times (1 - \pi_0)$ are each larger than 10
    • Regular (large sample) 90%, 95%, or 99% confidence interval: number of successes and number of failures in sample are each 15 or more
    • Plus four 90%, 95%, or 99% confidence interval: total sample size is 10 or more
  • Sample is a simple random sample from the population. That is, observations are independent of one another
If the sample size is too small for $z$ to be approximately normally distributed, the binomial test for a single proportion should be used.
Test statisticTest statistic
$X^2 = \sum{\frac{(\mbox{observed cell count} - \mbox{expected cell count})^2}{\mbox{expected cell count}}}$
where for each cell, the expected cell count = $\dfrac{\mbox{row total} \times \mbox{column total}}{\mbox{total sample size}}$, the observed cell count is the observed sample count in that same cell, and the sum is over all $I \times J$ cells
$z = \dfrac{p - \pi_0}{\sqrt{\dfrac{\pi_0(1 - \pi_0)}{N}}}$
$p$ is the sample proportion of successes: $\dfrac{X}{N}$, $N$ is the sample size
Sampling distribution of $X^2$ if H0 were trueSampling distribution of $z$ if H0 were true
Approximately a chi-squared distribution with $(I - 1) \times (J - 1)$ degrees of freedomApproximately standard normal
Significant?Significant?
  • Check if $X^2$ observed in sample is equal to or larger than critical value $X^{2*}$ or
  • Find $p$ value corresponding to observed $X^2$ and check if it is equal to or smaller than $\alpha$
Two sided: Right sided: Left sided:
n.a.Approximate $C\%$ confidence interval for $\pi$
-Regular (large sample):
  • $p \pm z^* \times \sqrt{\dfrac{p(1 - p)}{N}}$
    where $z^*$ is the value under the normal curve with the area $C / 100$ between $-z^*$ and $z^*$ (e.g. $z^*$ = 1.96 for a 95% confidence interval)
With plus four method:
  • $p_{plus} \pm z^* \times \sqrt{\dfrac{p_{plus}(1 - p_{plus})}{N + 4}}$
    where $p_{plus} = \dfrac{X + 2}{N + 4}$ and $z^*$ is the value under the normal curve with the area $C / 100$ between $-z^*$ and $z^*$ (e.g. $z^*$ = 1.96 for a 95% confidence interval)
n.a.Equivalent to
-
  • When testing two sided: goodness of fit test, with categorical variable with 2 levels
  • When $N$ is large, the $p$ value from the $z$ test for a single proportion approaches the $p$ value from the binomial test for a single proportion. The $z$ test for a single proportion is just a large sample approximation of the binomial test for a single proportion.
Example contextExample context
Is there an association between economic class and gender? Is the distribution of economic class different between men and women?Is the proportion smokers amongst office workers different from $\pi_0 = .2$? Use the normal approximation for the sampling distribution of the test statistic.
SPSSSPSS
Analyze > Descriptive Statistics > Crosstabs...
  • Put one of your two categorical variables in the box below Row(s), and the other categorical variable in the box below Column(s)
  • Click the Statistics... button, and click on the square in front of Chi-square
  • Continue and click OK
Analyze > Nonparametric Tests > Legacy Dialogs > Binomial...
  • Put your dichotomous variable in the box below Test Variable List
  • Fill in the value for $\pi_0$ in the box next to Test Proportion
If computation time allows, SPSS will give you the exact $p$ value based on the binomial distribution, rather than the approximate $p$ value based on the normal distribution
JamoviJamovi
Frequencies > Independent Samples - $\chi^2$ test of association
  • Put one of your two categorical variables in the box below Rows, and the other categorical variable in the box below Columns
Frequencies > 2 Outcomes - Binomial test
  • Put your dichotomous variable in the white box at the right
  • Fill in the value for $\pi_0$ in the box next to Test value
  • Under Hypothesis, select your alternative hypothesis
Jamovi will give you the exact $p$ value based on the binomial distribution, rather than the approximate $p$ value based on the normal distribution
Practice questionsPractice questions