|
2 | 2 |
|
3 | 3 | ## Introduction to classification {#sec-classification} |
4 | 4 |
|
5 | | -### Positive predictive value |
| 5 | +Classification is a fundamental concept in epidemiology and diagnostic medicine, where we need to determine whether an individual has a particular disease or condition based on test results or other indicators. Understanding how to interpret diagnostic tests requires knowledge of key statistical concepts including sensitivity, specificity, and predictive values. |
6 | 6 |
|
7 | | -Suppose a test is 99 sensitive, 99 specific; |
| 7 | +In this section, we explore how Bayes' theorem allows us to calculate the probability that a person has a disease given a positive test result. This is particularly important in public health decision-making, where we must understand not just how accurate a test is in general, but how to interpret test results for individuals in specific populations. |
8 | 8 |
|
9 | | -99% Sensitive means if the person has disease, the test is positive, 99% of |
10 | | -time: |
| 9 | +### Diagnostic test characteristics |
11 | 10 |
|
12 | | -$$\pmf{ + | D} = .99$$ |
| 11 | +When evaluating a diagnostic test, we consider several key performance measures: |
13 | 12 |
|
14 | | -99% specific means if they don't have covid, the test says no covid, 99% |
15 | | -time |
| 13 | +- **Sensitivity**: The probability that the test is positive given that the person has the disease, denoted $\Pr(\text{positive} \mid \text{disease})$ |
| 14 | +- **Specificity**: The probability that the test is negative given that the person does not have the disease, denoted $\Pr(\text{negative} \mid \text{no disease})$ |
| 15 | +- **Positive Predictive Value (PPV)**: The probability that a person has the disease given that their test is positive, denoted $\Pr(\text{disease} \mid \text{positive})$ |
| 16 | +- **Negative Predictive Value (NPV)**: The probability that a person does not have the disease given that their test is negative, denoted $\Pr(\text{no disease} \mid \text{negative})$ |
16 | 17 |
|
17 | | -7% of people actually have covid: |
| 18 | +### Example: COVID-19 testing |
18 | 19 |
|
19 | | -$$\mass(A) = 0.07$$ |
| 20 | +Suppose we have a COVID-19 test with the following characteristics: |
20 | 21 |
|
21 | | -$$\mass(\neg A) = .93$$ |
| 22 | +- **99% sensitive**: If a person has COVID-19, the test will be positive 99% of the time |
| 23 | +- **99% specific**: If a person does not have COVID-19, the test will be negative 99% of the time |
22 | 24 |
|
| 25 | +Let's define our events: |
23 | 26 |
|
| 27 | +- Let $D$ denote the event "person has COVID-19" |
| 28 | +- Let $+$ denote the event "test is positive" |
24 | 29 |
|
25 | | -$p\left( negative \middle| no\ covid \right) = .99$: |
26 | | -$p\left( B \middle| !A \right)$ |
| 30 | +Then our test characteristics can be written as: |
27 | 31 |
|
28 | | -$$p\left( Covid \middle| positive \right) = ?$$ |
| 32 | +$$ |
| 33 | +\Pr(+ \mid D) = 0.99 \quad \text{(sensitivity)} |
| 34 | +$$ |
29 | 35 |
|
30 | | -$$p\left( A \middle| B \right) = \frac{p\left( B \middle| A \right)p(A)}{p(B)}$$ |
| 36 | +$$ |
| 37 | +\Pr(- \mid \neg D) = 0.99 \quad \text{(specificity)} |
| 38 | +$$ |
31 | 39 |
|
32 | | -$$p(B) = p\left( B \middle| A \right)p(A) + p\left( B \middle| !A \right)p(!A)$$ |
| 40 | +Note that if specificity is 0.99, then the false positive rate is: |
| 41 | +$$ |
| 42 | +\Pr(+ \mid \neg D) = 1 - 0.99 = 0.01 |
| 43 | +$$ |
33 | 44 |
|
34 | | -$$p\left( B \middle| A \right)p(A) = .99*\ .07 = .0693$$ |
| 45 | +Suppose the **prevalence** of COVID-19 in the population is 7%: |
35 | 46 |
|
36 | | -$$\ p\left( B \middle| !A \right)p(!A) = .01*.93 = .0093$$ |
| 47 | +$$ |
| 48 | +\Pr(D) = 0.07 |
| 49 | +$$ |
37 | 50 |
|
38 | | -$$p(B) = .0693 + .0093 = .0786$$ |
| 51 | +$$ |
| 52 | +\Pr(\neg D) = 0.93 |
| 53 | +$$ |
39 | 54 |
|
40 | | -$$p\left( A \middle| B \right) = .0693/.0786$$ |
| 55 | +### Calculating positive predictive value |
41 | 56 |
|
42 | | -$$= .88$$ |
| 57 | +The key question we want to answer is: **If someone tests positive, what is the probability they actually have COVID-19?** |
43 | 58 |
|
44 | | -$${p\left( A \middle| B \right) = \frac{p\left( B \middle| A \right)p(A)}{p(B)} |
45 | | -}{= p\left( B \middle| A \right)\frac{p(A)}{p(B)} |
46 | | -}{= p\left( B \middle| A \right)\frac{p(A)}{p\left( B \middle| A \right)p(A) + p\left( B \middle| !A \right)p(!A)}}$$ |
| 59 | +This is the positive predictive value: |
| 60 | +$$ |
| 61 | +\Pr(D \mid +) = \, ? |
| 62 | +$$ |
47 | 63 |
|
48 | | -$$= \frac{p(A)}{p(A) + \frac{p\left( B \middle| !A \right)}{p\left( B \middle| A \right)}p(!A)}$$ |
| 64 | +We can use **Bayes' theorem** to calculate this: |
49 | 65 |
|
50 | | -$$= \frac{1}{1 + \frac{p\left( B \middle| !A \right)}{p\left( B \middle| A \right)}\frac{p(!A)}{p(A)}} |
51 | 66 | $$ |
| 67 | +\Pr(D \mid +) = \frac{\Pr(+ \mid D) \cdot \Pr(D)}{\Pr(+)} |
| 68 | +$$ |
| 69 | + |
| 70 | +To find $\Pr(+)$, we use the **law of total probability**: |
| 71 | + |
| 72 | +$$ |
| 73 | +\Pr(+) = \Pr(+ \mid D) \cdot \Pr(D) + \Pr(+ \mid \neg D) \cdot \Pr(\neg D) |
| 74 | +$$ |
| 75 | + |
| 76 | +Now we can calculate each component: |
| 77 | + |
| 78 | +**Probability of being positive with disease:** |
| 79 | +$$ |
| 80 | +\Pr(+ \mid D) \cdot \Pr(D) = 0.99 \times 0.07 = 0.0693 |
| 81 | +$$ |
| 82 | + |
| 83 | +**Probability of being positive without disease (false positive):** |
| 84 | +$$ |
| 85 | +\Pr(+ \mid \neg D) \cdot \Pr(\neg D) = 0.01 \times 0.93 = 0.0093 |
| 86 | +$$ |
| 87 | + |
| 88 | +**Total probability of positive test:** |
| 89 | +$$ |
| 90 | +\Pr(+) = 0.0693 + 0.0093 = 0.0786 |
| 91 | +$$ |
| 92 | + |
| 93 | +**Positive predictive value:** |
| 94 | +$$ |
| 95 | +\Pr(D \mid +) = \frac{0.0693}{0.0786} = 0.88 |
| 96 | +$$ |
| 97 | + |
| 98 | +Therefore, even with a highly accurate test (99% sensitive and 99% specific), only about 88% of people who test positive actually have COVID-19. This is because the disease prevalence is relatively low (7%), so false positives make up a meaningful fraction of all positive tests. |
| 99 | + |
| 100 | +### Alternative formulation |
| 101 | + |
| 102 | +We can rearrange Bayes' theorem to express the positive predictive value in terms of the sensitivity, specificity, and disease prevalence: |
| 103 | + |
| 104 | +$$ |
| 105 | +\begin{align} |
| 106 | +\Pr(D \mid +) &= \frac{\Pr(+ \mid D) \cdot \Pr(D)}{\Pr(+)} \\ |
| 107 | +&= \frac{\Pr(+ \mid D) \cdot \Pr(D)}{\Pr(+ \mid D) \cdot \Pr(D) + \Pr(+ \mid \neg D) \cdot \Pr(\neg D)} \\ |
| 108 | +&= \frac{\Pr(D)}{\Pr(D) + \frac{\Pr(+ \mid \neg D)}{\Pr(+ \mid D)} \cdot \Pr(\neg D)} \\ |
| 109 | +&= \frac{1}{1 + \frac{\Pr(+ \mid \neg D)}{\Pr(+ \mid D)} \cdot \frac{\Pr(\neg D)}{\Pr(D)}} |
| 110 | +\end{align} |
| 111 | +$$ |
| 112 | + |
| 113 | +This final form emphasizes the ratio of the false positive rate to the sensitivity, weighted by the ratio of non-diseased to diseased individuals in the population. It shows that even with a very high sensitivity and specificity, the positive predictive value depends strongly on disease prevalence. |
0 commit comments