Skip to content

Commit 11ba1a6

Browse files
Copilotd-morrison
andcommitted
Complete and polish classification section introduction
Co-authored-by: d-morrison <[email protected]>
1 parent bf43dee commit 11ba1a6

File tree

1 file changed

+87
-25
lines changed

1 file changed

+87
-25
lines changed

classification.qmd

Lines changed: 87 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -2,50 +2,112 @@
22

33
## Introduction to classification {#sec-classification}
44

5-
### Positive predictive value
5+
Classification is a fundamental concept in epidemiology and diagnostic medicine, where we need to determine whether an individual has a particular disease or condition based on test results or other indicators. Understanding how to interpret diagnostic tests requires knowledge of key statistical concepts including sensitivity, specificity, and predictive values.
66

7-
Suppose a test is 99 sensitive, 99 specific;
7+
In this section, we explore how Bayes' theorem allows us to calculate the probability that a person has a disease given a positive test result. This is particularly important in public health decision-making, where we must understand not just how accurate a test is in general, but how to interpret test results for individuals in specific populations.
88

9-
99% Sensitive means if the person has disease, the test is positive, 99% of
10-
time:
9+
### Diagnostic test characteristics
1110

12-
$$\pmf{ + | D} = .99$$
11+
When evaluating a diagnostic test, we consider several key performance measures:
1312

14-
99% specific means if they don't have covid, the test says no covid, 99%
15-
time
13+
- **Sensitivity**: The probability that the test is positive given that the person has the disease, denoted $\Pr(\text{positive} \mid \text{disease})$
14+
- **Specificity**: The probability that the test is negative given that the person does not have the disease, denoted $\Pr(\text{negative} \mid \text{no disease})$
15+
- **Positive Predictive Value (PPV)**: The probability that a person has the disease given that their test is positive, denoted $\Pr(\text{disease} \mid \text{positive})$
16+
- **Negative Predictive Value (NPV)**: The probability that a person does not have the disease given that their test is negative, denoted $\Pr(\text{no disease} \mid \text{negative})$
1617

17-
7% of people actually have covid:
18+
### Example: COVID-19 testing
1819

19-
$$\mass(A) = 0.07$$
20+
Suppose we have a COVID-19 test with the following characteristics:
2021

21-
$$\mass(\neg A) = .93$$
22+
- **99% sensitive**: If a person has COVID-19, the test will be positive 99% of the time
23+
- **99% specific**: If a person does not have COVID-19, the test will be negative 99% of the time
2224

25+
Let's define our events:
2326

27+
- Let $D$ denote the event "person has COVID-19"
28+
- Let $+$ denote the event "test is positive"
2429

25-
$p\left( negative \middle| no\ covid \right) = .99$:
26-
$p\left( B \middle| !A \right)$
30+
Then our test characteristics can be written as:
2731

28-
$$p\left( Covid \middle| positive \right) = ?$$
32+
$$
33+
\Pr(+ \mid D) = 0.99 \quad \text{(sensitivity)}
34+
$$
2935

30-
$$p\left( A \middle| B \right) = \frac{p\left( B \middle| A \right)p(A)}{p(B)}$$
36+
$$
37+
\Pr(- \mid \neg D) = 0.99 \quad \text{(specificity)}
38+
$$
3139

32-
$$p(B) = p\left( B \middle| A \right)p(A) + p\left( B \middle| !A \right)p(!A)$$
40+
Note that if specificity is 0.99, then the false positive rate is:
41+
$$
42+
\Pr(+ \mid \neg D) = 1 - 0.99 = 0.01
43+
$$
3344

34-
$$p\left( B \middle| A \right)p(A) = .99*\ .07 = .0693$$
45+
Suppose the **prevalence** of COVID-19 in the population is 7%:
3546

36-
$$\ p\left( B \middle| !A \right)p(!A) = .01*.93 = .0093$$
47+
$$
48+
\Pr(D) = 0.07
49+
$$
3750

38-
$$p(B) = .0693 + .0093 = .0786$$
51+
$$
52+
\Pr(\neg D) = 0.93
53+
$$
3954

40-
$$p\left( A \middle| B \right) = .0693/.0786$$
55+
### Calculating positive predictive value
4156

42-
$$= .88$$
57+
The key question we want to answer is: **If someone tests positive, what is the probability they actually have COVID-19?**
4358

44-
$${p\left( A \middle| B \right) = \frac{p\left( B \middle| A \right)p(A)}{p(B)}
45-
}{= p\left( B \middle| A \right)\frac{p(A)}{p(B)}
46-
}{= p\left( B \middle| A \right)\frac{p(A)}{p\left( B \middle| A \right)p(A) + p\left( B \middle| !A \right)p(!A)}}$$
59+
This is the positive predictive value:
60+
$$
61+
\Pr(D \mid +) = \, ?
62+
$$
4763

48-
$$= \frac{p(A)}{p(A) + \frac{p\left( B \middle| !A \right)}{p\left( B \middle| A \right)}p(!A)}$$
64+
We can use **Bayes' theorem** to calculate this:
4965

50-
$$= \frac{1}{1 + \frac{p\left( B \middle| !A \right)}{p\left( B \middle| A \right)}\frac{p(!A)}{p(A)}}
5166
$$
67+
\Pr(D \mid +) = \frac{\Pr(+ \mid D) \cdot \Pr(D)}{\Pr(+)}
68+
$$
69+
70+
To find $\Pr(+)$, we use the **law of total probability**:
71+
72+
$$
73+
\Pr(+) = \Pr(+ \mid D) \cdot \Pr(D) + \Pr(+ \mid \neg D) \cdot \Pr(\neg D)
74+
$$
75+
76+
Now we can calculate each component:
77+
78+
**Probability of being positive with disease:**
79+
$$
80+
\Pr(+ \mid D) \cdot \Pr(D) = 0.99 \times 0.07 = 0.0693
81+
$$
82+
83+
**Probability of being positive without disease (false positive):**
84+
$$
85+
\Pr(+ \mid \neg D) \cdot \Pr(\neg D) = 0.01 \times 0.93 = 0.0093
86+
$$
87+
88+
**Total probability of positive test:**
89+
$$
90+
\Pr(+) = 0.0693 + 0.0093 = 0.0786
91+
$$
92+
93+
**Positive predictive value:**
94+
$$
95+
\Pr(D \mid +) = \frac{0.0693}{0.0786} = 0.88
96+
$$
97+
98+
Therefore, even with a highly accurate test (99% sensitive and 99% specific), only about 88% of people who test positive actually have COVID-19. This is because the disease prevalence is relatively low (7%), so false positives make up a meaningful fraction of all positive tests.
99+
100+
### Alternative formulation
101+
102+
We can rearrange Bayes' theorem to express the positive predictive value in terms of the sensitivity, specificity, and disease prevalence:
103+
104+
$$
105+
\begin{align}
106+
\Pr(D \mid +) &= \frac{\Pr(+ \mid D) \cdot \Pr(D)}{\Pr(+)} \\
107+
&= \frac{\Pr(+ \mid D) \cdot \Pr(D)}{\Pr(+ \mid D) \cdot \Pr(D) + \Pr(+ \mid \neg D) \cdot \Pr(\neg D)} \\
108+
&= \frac{\Pr(D)}{\Pr(D) + \frac{\Pr(+ \mid \neg D)}{\Pr(+ \mid D)} \cdot \Pr(\neg D)} \\
109+
&= \frac{1}{1 + \frac{\Pr(+ \mid \neg D)}{\Pr(+ \mid D)} \cdot \frac{\Pr(\neg D)}{\Pr(D)}}
110+
\end{align}
111+
$$
112+
113+
This final form emphasizes the ratio of the false positive rate to the sensitivity, weighted by the ratio of non-diseased to diseased individuals in the population. It shows that even with a very high sensitivity and specificity, the positive predictive value depends strongly on disease prevalence.

0 commit comments

Comments
 (0)