# Two sample $t$ test: sampling distribution of the difference between two sample means, and its standard error

Definition of the sampling distribution of the difference between two sample means $\bar{y}_1 - \bar{y}_2$, and its standard error

## Sampling distribution of the difference between two sample means $\bar{y}_1 - \bar{y}_2$:

When we draw a sample of size $n_1$ from population 1, and a sample of size $n_2$ from population 2, we can compute the mean of a variable $y$ in sample 1 and in sample 2, and then compute the difference between the two sample means: $\bar{y}_1 - \bar{y}_2$. Now suppose that we would repeat these steps many times. Specifically, suppose that we would draw an infinite number of of group 1 and group 2 samples, each time of size $n_1$ and $n_2$. Each time we have a group 1 and group 2 sample, we could compute the difference between the two sample means: $\bar{y}_1 - \bar{y}_2$. Different samples will give different sample means and differences. The distribution of all these differences $\bar{y}_1 - \bar{y}_2$ is the sampling distribution of $\bar{y}_1 - \bar{y}_2$. Note that this sampling distribution is purely hypothetical. We will never really draw an infinite number of group 1 and group 2 samples, but hypothetically, we could.

## Standard error:

Suppose that the assumptions of the two sample $t$ test (assuming equal population variances) hold:
• Within population 1, the variable $y$ is normally distributed with mean $\mu_1$ and standard deviation $\sigma_1$; within population 2, the variable $y$ is normally distributed with mean $\mu_2$ and standard deviation $\sigma_2$
• The population standard deviations $\sigma_1$ and $\sigma_2$ are the same: $\sigma_1 = \sigma_2 = \sigma$
• Group 1 sample is a simple random sample (SRS) from population 1, group 2 sample is an independent SRS from population 2. That is, within and between groups, observations are independent of one another
Then the sampling distribution of $\bar{y}_1 - \bar{y}_2$ is normal with mean $\mu_1 - \mu_2$ and standard deviation $\sqrt{\frac{\sigma^2}{n_1} + \frac{\sigma^2}{n_2}} = \sigma\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$. Since the two sample $t$ test does not make the assumption that the value of $\sigma$ is known (like the $z$ test does), we need to:
• estimate $\sigma$ with $s_p$: the pooled standard deviation, computed from the sample standard deviations $s_1$ and $s_2$. That is, the two sample standard deviations are combined into a single estimate $s_p$ of $\sigma$
• estimate $\sigma\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$ with $s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$
That is, we estimate the standard deviation of the sampling distribution of $\bar{y}_1 - \bar{y}_2$, $\sigma\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$, with $s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}$. We call this estimated standard deviation of the sampling distribution of $\bar{y}_1 - \bar{y}_2$ the standard error of $\bar{y}_1 - \bar{y}_2$.

Note that the $t$ statistic $t = \frac{(\bar{y}_1 - \bar{y}_2) - 0}{s_p\sqrt{\frac{1}{n_1} + \frac{1}{n_2}}}$ thus indicates how many standard errors the observed difference $\bar{y}_1 - \bar{y}_2$ is removed from 0: the difference $\mu_1 - \mu_2$ according to H0.