##### Chi-squared test for the relationship between two categorical variables: sampling distribution of the chi-squared statistic

Definition of the sampling distribution of the chi-squared statistic

##### Sampling distribution of $X^2$:

As you may know, when we perform a chi-squared test for the relationship between two categorical variables, we compute the chi-squared statistic $$X^2 = \sum{\frac{(\mbox{observed cell count} - \mbox{expected cell count})^2}{\mbox{expected cell count}}}$$ based on our sample data. Now suppose that we drew many more samples. Specifically, suppose that we drew an infinite number of samples, each with the same sample size. In each sample, we could compute the chi-squared statistic $X^2 = \sum{\frac{(\mbox{observed cell count} - \mbox{expected cell count})^2}{\mbox{expected cell count}}}$. Different samples would give different values for $X^2$. The distribution of all these $X^2$ values is the sampling distribution of the chi-squared statistic $X^2$. Note that this sampling distribution is purely hypothetical. We would never really draw an infinite number of samples, but hypothetically, we could.

##### Sampling distribution of $X^2$ if H0 were true:

Suppose that the assumptions of the chi-squared test hold, and that the null hypothesis that there is no association between the two variables in the population is true. Then the sampling distribution of $X^2$ is approximately the chi-squared distribution with $(I - 1)\times (J - 1)$ degrees of freedom. That is, most of the time we would find relatively small $X^2$ values, and only sometimes we would find large $X^2$ values. If we find a $X^2$ value in our actual sample that is very large, this is a rare event if the null hypothesis were true, and is therefore considered evidence against the null hypothesis ($X^2$ value in rejection region, small $p$ value).