Session 12 — Chi-Square Test
Decision Making Statistics — S04
This session presents the Chi-square test of independence, used to study whether two characteristics are linked.
1 Introduction
2 Chi-square test formulation
2.1 Setup
- Variable \(X\) has \(p\) modalities: \(A_1,A_2,\ldots,A_p\)
- Variable \(Y\) has \(q\) modalities: \(B_1,B_2,\ldots,B_q\)
- A sample of size \(n\) is organized into a contingency table of dimension \(p \times q\)
2.2 Hypotheses
\[ H_0: X \text{ and } Y \text{ are independent} \qquad \text{vs.} \qquad H_1: X \text{ and } Y \text{ are dependent} \]
2.3 Notations for the contingency table
- \(n_{i,j}\): number of individuals simultaneously in category \(A_i\) and category \(B_j\),
- \(n_{i,.}=\sum_{j=1}^{q} n_{i,j}\): row total of row \(i\),
- \(n_{.,j}=\sum_{i=1}^{p} n_{i,j}\): column total of column \(j\),
- \(n=\sum_{i=1}^{p}\sum_{j=1}^{q} n_{i,j}\): grand total.
3 Expected counts and test statistic
3.1 Expected counts under independence
If \(H_0\) is true, the theoretical count in cell \((i,j)\) is:
\[ E_{i,j}=\frac{n_{i,.}\times n_{.,j}}{n} \]
The standard Chi-square test is considered valid when all expected counts satisfy:
\[ E_{i,j} \geq 5 \]
If this condition is not met, some categories may need to be grouped.
3.2 Test statistic
\[ U_{obs}=\sum_{i=1}^{p}\sum_{j=1}^{q}\frac{(n_{i,j}-E_{i,j})^2}{E_{i,j}} \]
Under \(H_0\), the statistic follows approximately a Chi-square distribution with:
\[ \nu=(p-1)(q-1) \]
degrees of freedom.
3.3 Decision rule
Reject \(H_0\) at risk \(\alpha\) if:
\[ U_{obs}>k \]
where \(k=\chi^2_{\alpha}(\nu)\) is the critical value read in the Chi-square table.
In a Chi-square exercise, always proceed in this order:
- identify the two variables and their numbers of categories,
- compute row totals, column totals, and the grand total,
- compute the expected counts,
- check the validity condition,
- compute \(U_{obs}\) and compare it with the critical value.
4 Conclusion
- \(H_0\) rejected: the data confirm a link between the two variables.
- \(H_0\) not rejected: the data do not confirm a link at the chosen risk level.
5 Application exercise
5.1 Coffee consumption and marital status
A marketing company selected a random sample of housewives (women under 50, assumed to be the main shopper in the household) to study the link between marital status and weekly coffee consumption.
| Less than 1 cup/day | 1–2 cups/day | 2–3 cups/day | More than 3 cups/day | |
|---|---|---|---|---|
| Single | 30 | 40 | 50 | 20 |
| Married | 50 | 60 | 80 | 30 |
| Other | 20 | 30 | 40 | 15 |
Can we say, with a \(5\%\) risk, that there is a link between coffee consumption level and marital status?
Step 1 — Formulation
- \(X\) = marital status, so \(p=3\)
- \(Y\) = coffee consumption level, so \(q=4\)
- \(H_0\): independence
- \(H_1\): dependence
Step 2 — Marginal totals
Row totals:
\[ 140,\quad 220,\quad 105 \]
Column totals:
\[ 100,\quad 130,\quad 170,\quad 65 \]
Grand total:
\[ n=465 \]
Step 3 — Expected counts
Using
\[ E_{i,j}=\frac{n_{i,.}n_{.,j}}{n} \]
we obtain approximately:
| Less than 1 | 1–2 | 2–3 | More than 3 | |
|---|---|---|---|---|
| Single | 30.108 | 39.140 | 51.183 | 19.570 |
| Married | 47.312 | 61.505 | 80.430 | 30.753 |
| Other | 22.581 | 29.355 | 38.387 | 14.677 |
All expected counts are greater than \(5\), so the test is valid.
Step 4 — Computed statistic
\[ U_{obs}=\sum \frac{(n_{i,j}-E_{i,j})^2}{E_{i,j}} \approx 0.650 \]
Step 5 — Critical value
The number of degrees of freedom is:
\[ \nu=(3-1)(4-1)=6 \]
At the \(5\%\) level:
\[ k=\chi^2_{5\%}(6)=12.592 \]
Since:
\[ 0.650<12.592 \]
we do not reject \(H_0\).
Conclusion: with the counts given in the table above, the data do not confirm a link between coffee consumption level and marital status at the \(5\%\) risk level.
5.2 Exercise 2 — Purchase frequency by age group
A retail chain surveys a random sample of \(300\) loyalty-card holders to study the link between age group and purchase frequency:
| Never | Occasionally | Regularly | |
|---|---|---|---|
| Under 35 | 30 | 50 | 70 |
| 35 and over | 45 | 60 | 45 |
Can we say, with a \(5\%\) risk, that there is a link between age group and purchase frequency? And at a \(1\%\) risk?
Step 1 — Formulation
- \(X\) = age group, so \(p=2\)
- \(Y\) = purchase frequency, so \(q=3\)
- \(H_0\): independence
- \(H_1\): dependence
- \(\nu=(2-1)(3-1)=2\)
Step 2 — Marginal totals
Row totals: \(150\), \(150\). Column totals: \(75\), \(110\), \(115\). Grand total: \(n=300\).
Step 3 — Expected counts
Since both row totals equal \(150\), each expected count equals \(n_{i,.}\times n_{.,j}/300 = 150\times n_{.,j}/300 = n_{.,j}/2\):
| Never | Occasionally | Regularly | |
|---|---|---|---|
| Under 35 | 37.5 | 55.0 | 57.5 |
| 35 and over | 37.5 | 55.0 | 57.5 |
All expected counts are \(\geq 5\) ✓.
Step 4 — Computed statistic
\[ U_{obs}=\frac{(30-37.5)^2}{37.5}+\frac{(50-55)^2}{55}+\frac{(70-57.5)^2}{57.5}+\frac{(45-37.5)^2}{37.5}+\frac{(60-55)^2}{55}+\frac{(45-57.5)^2}{57.5} \]
\[ =1.500+0.455+2.717+1.500+0.455+2.717\approx 9.344 \]
Step 5 — Critical values
\[ \chi^2_{5\%}(2)=5.991 \qquad \text{and} \qquad \chi^2_{1\%}(2)=9.210 \]
Since \(9.344>5.991\) and \(9.344>9.210\), we reject \(H_0\) at both risk levels.
Conclusion: with both \(5\%\) and \(1\%\) risk, the data confirm a link between age group and purchase frequency. Customers under 35 tend to purchase more regularly, whereas older customers purchase less frequently.
5.3 Exercise 3 — Education level and job satisfaction
An HR consultancy surveys \(245\) randomly selected employees to study the link between education level and job satisfaction:
| Low satisfaction | Medium satisfaction | High satisfaction | |
|---|---|---|---|
| No degree | 25 | 30 | 10 |
| Bachelor | 15 | 45 | 40 |
| Master or above | 5 | 20 | 55 |
Can we say, with a \(5\%\) risk, that education level and job satisfaction are linked?
Step 1 — Formulation
- \(X\) = education level (\(p=3\)), \(Y\) = satisfaction level (\(q=3\))
- \(H_0\): independence, \(H_1\): dependence
- \(\nu=(3-1)(3-1)=4\)
Step 2 — Marginal totals
Row totals: \(65\), \(100\), \(80\). Column totals: \(45\), \(95\), \(105\). Grand total: \(n=245\).
Step 3 — Expected counts
\[ E_{i,j}=\frac{n_{i,.}\times n_{.,j}}{245} \]
| Low | Medium | High | |
|---|---|---|---|
| No degree | 11.94 | 25.20 | 27.86 |
| Bachelor | 18.37 | 38.78 | 42.86 |
| Master+ | 14.69 | 31.02 | 34.29 |
All expected counts are \(\geq 5\) ✓.
Step 4 — Computed statistic
\[ U_{obs}=\frac{(25-11.94)^2}{11.94}+\frac{(30-25.20)^2}{25.20}+\frac{(10-27.86)^2}{27.86} +\frac{(15-18.37)^2}{18.37}+\frac{(45-38.78)^2}{38.78}+\frac{(40-42.86)^2}{42.86} \]
\[ +\frac{(5-14.69)^2}{14.69}+\frac{(20-31.02)^2}{31.02}+\frac{(55-34.29)^2}{34.29} \]
\[ \approx 14.29+0.91+11.45+0.62+1.00+0.19+6.40+3.91+12.51\approx 51.28 \]
Step 5 — Critical value
\[ \chi^2_{5\%}(4)=9.488 \]
Since \(51.28>9.488\), we reject \(H_0\).
Conclusion: with a \(5\%\) risk, the data confirm a strong link between education level and job satisfaction. Higher-educated employees report noticeably higher satisfaction levels.