Statistics IM Board Review

Sensitivity and Specificity

Sensitivity (True Positive Rate): Proportion of those with the disease who test positive. Calculated as TP / (TP + FN) (true positives divided by all diseased patients). High sensitivity means the test catches most cases (few false negatives). In practice, a highly sensitive test helps ensure you don’t miss the disease.
- Example: If 95 out of 100 diseased patients test positive, sensitivity = 95%.
Specificity (True Negative Rate): Proportion of those without the disease who test negative. Calculated as TN / (TN + FP) (true negatives divided by all healthy patients). High specificity means the test is good at ruling out disease (few false positives).
- Example: If 95 out of 100 healthy people test negative, specificity = 95%.
Memory Aids: SnNout: a highly Sensitive test, when Negative, rules out disease. SpPin: a highly Specific test, when Positive, rules in disease.
Using Sensitive vs Specific Tests: In clinical strategy, a high-sensitivity test is ideal for initial screening (to catch all possible cases). A high-specificity test is often used for confirmation.
- Examples: D-dimer (for DVT/PE) is very sensitive (negative result effectively rules out clot in low-risk patients, but positive is not definitive), whereas a confirmatory imaging test (ultrasound or CT angiography) is more specific to rule in the diagnosis. For HIV, an ELISA (high sensitivity) is used first, and positives are confirmed with a Western blot (high specificity).

Test Result	Disease Present	Disease Absent
Test Positive	True Positive (TP)	False Positive (FP)
Test Negative	False Negative (FN)	True Negative (TN)

Tip: Organize data in a 2×2 table (as above) when solving test questions. Sensitivity = TP/(TP+FN) (use the “Disease Present” column); Specificity = TN/(TN+FP) (use the “Disease Absent” column).

Positive and Negative Predictive Values

Positive Predictive Value (PPV): Probability that a patient actually has the disease given a positive test. Calculated as TP / (TP + FP): true positives out of all positive test results. In other words, PPV answers: “If the test is positive, what is the chance the disease is truly present?”
Negative Predictive Value (NPV): Probability that a patient truly does not have the disease given a negative test. Calculated as TN / (TN + FN): true negatives out of all negative results. This answers: “If the test is negative, what is the chance the person is disease-free?”
Prevalence Effect: **PPV and NPV are not fixed test characteristics: they depend on disease prevalence in the population being tested.
- When disease is common (high pre-test prevalence), PPV rises (a positive is more likely true) and NPV falls (a negative is more likely false).
- When disease is rare (low prevalence), PPV drops (most positives are false alarms) and NPV becomes very high (most negatives are true).
- Illustration: A test with 95% sensitivity and 95% specificity at 20% disease prevalence might have PPV ~83% and NPV ~98%. If the disease prevalence is only 0.2%, the PPV plummets to ~4% (almost every positive is a false positive) while NPV approaches ~100%.
Extreme example: A positive pregnancy test in a male patient has an extremely low PPV for pregnancy: essentially no chance of true pregnancy despite the test’s high accuracy in females. This highlights that predictive values are meaningful only in the appropriate clinical context (they must be interpreted with prevalence in mind).

Likelihood Ratios (LR)

Likelihood Ratio for a Positive Test (LR+): Ratio of the probability of a positive test in a person with disease to the probability of a positive test in a person without disease. Formula: LR+ = sensitivity / (1 − specificity). A high LR+ means a positive result is much more likely in the diseased vs. non-diseased: so it greatly increases the post-test probability of disease.
Likelihood Ratio for a Negative Test (LR-): Ratio of the probability of a negative test in someone with disease to the probability of a negative test in someone without disease. Formula: LR- = (1 − sensitivity) / specificity. A low LR- (closer to 0) means a negative result is far less likely in someone with the disease: so a negative greatly decreases the probability of disease.
Interpretation of LR Values: An LR = 1 indicates the test provides no discriminatory information. The further from 1, the more useful the test:
- LR+ > 10 (or >>1) provides strong evidence to rule in disease (large increase in likelihood if test is positive).
- LR+ ~5 is moderately useful; LR+ ~2 is weak.
- LR- < 0.1 provides strong evidence to rule out disease (large decrease in likelihood if test is negative).
- LR- ~0.5 is moderately useful; LR- ~0.2 is fairly good.
Applying LRs to Pre-Test Probability: Likelihood ratios allow you to revise an initial (pre-test) probability into a post-test probability. Technically, you convert pre-test probability to odds, multiply by the LR, then convert back to probability. In practice, certain LR magnitudes have an intuitive effect:
- An LR+ around 10 typically raises the post-test probability by ~45% absolute (e.g., from 50% pre-test to ~95% post-test).
- An LR- around 0.1 typically lowers the post-test probability by ~45% (e.g., from 50% down to ~5%).
- Note: LRs (like sensitivity & specificity) are independent of prevalence. They are intrinsic to the test, and you can use them for any patient by incorporating the patient’s specific pre-test likelihood.
Example: Procalcitonin for diagnosing pneumonia has about LR+ ≈ 8 and LR- ≈ 0.7. This means a positive procalcitonin markedly increases the likelihood of pneumonia (LR+ is high, good for ruling in), but a negative result doesn’t lower the probability much (LR- is not low enough to confidently rule out disease). In general, the ideal diagnostic test has both a very high LR+ and a very low LR-.

Absolute vs. Relative Risk

Risk (Event Rate): In any study group, “risk” refers to the probability (percentage) of the outcome/event occurring. For example, if 10 out of 200 patients on a new drug have heart attacks, the risk in that treatment group = 5%.
Absolute Risk Reduction (ARR): The absolute difference in outcome rates between two groups. Calculated as Risk_control − Risk_treatment for a therapy that reduces risk. It tells us how much the treatment actually lowers the chance of the outcome in absolute terms.
- If ARR is a positive number, the treatment prevented bad outcomes (reduced risk). If ARR is negative (i.e. treatment risk > control risk), it means the treatment increased risk (harm) by that amount. This difference is also called the absolute risk difference.
Relative Risk (RR): The ratio of outcome risk with treatment to the risk without treatment. Formula: RR = Risk_treatment / Risk_control. It indicates the proportion of baseline risk remaining under the treatment.
- RR < 1.0 means the outcome is less likely in the treatment group than control (risk reduced). For example, RR = 0.5 means the treatment group’s risk is half the control’s risk (a 50% relative risk reduction).
- RR > 1.0 means the outcome is more likely with the treatment (risk increased). For example, RR = 2.0 means the outcome rate is twice as high in the treatment group (100% relative increase in risk).
Relative vs Absolute Impact: Be cautious interpreting relative risk alone: a large relative risk reduction can correspond to a very small absolute benefit if the baseline risk is low.
- Example: Reducing risk from 10% to 5% is a 50% relative risk reduction (RR 0.5) and a 5% absolute reduction. But reducing risk from 0.0002% to 0.0001% is also ~50% relative (RR ~0.5) while the absolute reduction is only 0.0001%!
- Key Point: Always consider the absolute risk reduction (ARR) to understand clinical significance. Authors or drug reps may highlight a “25% relative risk reduction” because it sounds impressive, but the actual improvement might be just “5% less risk” in absolute terms.
Hazard Ratios: In clinical trials you may see a hazard ratio (HR) reported for outcomes. This is similar to a relative risk, but derived from time-to-event (survival) analysis. An HR of 1.59 (for example) means at any point in time, the hazard (event rate) in the treatment group is 59% higher than in the control group. HR < 1 indicates benefit (lower hazard with treatment), and HR > 1 indicates harm, analogous to RR < 1 or > 1.

Outcome/Metric	Treatment (n=1000)	Control (n=1000)
Deaths	150	200
Survivors	850	800
Risk of death	15%	20%
Absolute Risk Reduction (ARR)	5% (absolute decrease in death rate)
Relative Risk (RR)	0.75 (treatment death risk is 75% of control’s)
Relative Risk Reduction	25% reduction in relative terms (100% − 75%)

Example: In a trial of 1,000 patients per arm, suppose 20% of control patients die versus 15% of treated patients. The ARR is 20% − 15% = 5% (5% fewer deaths with treatment). The RR = 15%/20% = 0.75, which can be described as a 25% relative risk reduction in mortality. This means you would prevent one death for every 20 patients treated (see NNT below). While “25% reduction” sounds large, the absolute benefit is 5%: both perspectives are important.

Number Needed to Treat (NNT)

Number Needed to Treat: The number of patients who must receive the therapy for one patient to benefit (prevent one adverse outcome). Formula: NNT = 1 / ARR (using ARR in decimal form). Always round up to the next whole person.
- Example: If a drug reduces absolute risk by 10% (0.10), then NNT = 1/0.10 = 10. You must treat 10 patients to prevent 1 outcome. If ARR were 5% (0.05), NNT = 20.
Interpretation: A smaller NNT indicates a more effective intervention (fewer patients need treatment for one to benefit). An NNT of 1 would mean every patient treated benefits (100% efficacy). By contrast, very large NNTs (hundreds or thousands) may call into question a therapy’s practical value.
Number Needed to Harm (NNH): Similarly, for adverse outcomes or side effects, NNH = 1 / Absolute Risk Increase. This tells how many patients, on average, would need to be exposed to a risk factor or treatment for one additional harmful outcome to occur. A small NNH (close to 1) indicates a very dangerous exposure; a large NNH means harms are relatively rare.

P-Values (Statistical Significance)

P-Value Definition: The p-value is the probability of observing your study results (or something more extreme) if there truly is no difference (if the null hypothesis is true). In simpler terms, it estimates the chance that the findings are due to random chance alone. A low p-value suggests the observed effect is unlikely to be purely chance.
Significance Threshold: By convention, p < 0.05 is considered “statistically significant.” This means that if no real effect exists, there is less than a 5% chance of obtaining the observed results randomly. (Equivalently, there is >95% confidence the result is not just luck.) This 0.05 cutoff is arbitrary but commonly used.
Interpreting p ≈0.05: A p-value of 0.05 implies a 1 in 20 chance the result is a fluke. It’s a borderline value: not overwhelmingly strong evidence. In fact, getting something with ~5% probability isn’t that rare.
- Example: The probability of flipping a coin and getting heads 5 times in a row is ~3% (p = 0.03). If a study reports p ≈0.03–0.05, think of it like that coin-flip streak: it could happen by chance, even though it’s somewhat unlikely.
- Therefore, results with p just under 0.05 should be viewed with some caution. One study in 20 might show a “significant” finding just by chance alone.
“Statistically Significant” vs. “True” Effect: A smaller p-value (e.g. p < 0.001) gives stronger confidence that the effect is real (chance of a fluke <0.1%). Conversely, a result that just misses the cutoff (e.g. p = 0.06) may still suggest a real trend: it’s only slightly less significant than p = 0.05. Always consider the study context, effect size, and confidence intervals rather than focusing blindly on the 0.05 threshold.
Bottom line: The p-value measures statistical evidence, but it doesn’t tell you about clinical importance or the magnitude of effect. Also, “not significant” (p > 0.05) doesn’t always mean “no effect”: it might mean the study was underpowered or the effect is small.

Confidence Intervals (CI)

95% Confidence Interval: A confidence interval provides a range of values within which the true effect is likely to lie (with a given level of confidence, typically 95%). For example, a 95% CI of 2–8 for a treatment’s effect means we are 95% confident that the true effect is between 2 and 8. In essence, if the study were repeated many times, 95% of the time the true effect would fall inside that interval.
More Information than p-value: The CI conveys both the estimated effect size (the midpoint) and the precision of the estimate (the width of the interval). A narrow CI indicates a more precise estimate (usually from a large study with lots of data), whereas a wide CI indicates less precision (small sample or high variability). The p-value relates to whether the interval excludes the “no effect” value, but the CI tells you the range of plausible effects.
Null Hypothesis Reference: To interpret a CI, know what null result it’s testing:
- For differences (e.g., risk difference, mean difference), 0 represents no difference.
- For ratios (e.g., relative risk, odds ratio, hazard ratio), 1.0 represents no difference.
Statistical Significance via CI: If the 95% CI excludes the null value (does not contain 0 for differences, or 1.0 for ratios), the result is statistically significant at the 0.05 level. If the CI includes the null value, then p ≥ 0.05 (not statistically significant).
- Example: A study finds a risk difference of –5% (drug lowers risk by 5%) with 95% CI –10% to +1%. Because the interval spans 0 (goes from a 10% reduction up to a 1% increase), we cannot be sure there’s a true effect: it’s not statistically significant.
- Similarly, if a study reports an odds ratio = 0.8 with 95% CI 0.5–1.3, the interval overlaps 1.0: we can’t rule out no effect (not significant). If instead the 95% CI was 0.5–0.9 (all below 1.0), that would indicate a statistically significant reduction in odds.
Interpretation Tip: Always check the CI range. A result might be “statistically significant” but have a very wide CI (e.g., “significant” but ranging from a small to a very large effect), which limits confidence in the precision or clinical relevance of the finding.

Forest Plots and Meta-Analysis

Forest Plot Basics: A forest plot is a graphical summary of results from multiple studies (often used in meta-analyses). Each study is represented by a point (usually a square or dot) for the study’s effect estimate and a horizontal line for its confidence interval. There is a vertical line down the middle representing the line of no effect (0 if an absolute difference, 1 if a ratio measure).
- If a study’s CI line crosses the no-effect line, that study’s result is not statistically significant. You can visually see which studies had significant findings (their entire CI is on one side of the line) versus nonsignificant (CI overlaps the line).
- Studies to the left of the line might favor treatment (if outcome is bad, left = fewer events with treatment), and to the right might favor control or harm (depending on how the plot is labeled). The plot usually has labels or an x-axis indicating the measure (e.g., an odds ratio scale).
Meta-Analysis (Combined Results): At the bottom of a forest plot, a diamond often represents the pooled result of all studies combined. The center of the diamond is the overall effect estimate, and its width is the confidence interval for the combined data. Because combining studies increases the sample size, the pooled estimate typically has a narrower CI than those of individual studies.
- Interpretation: If the diamond (overall CI) does not cross the line of no effect, the meta-analysis finds a statistically significant overall result. If the diamond touches or crosses the line, the overall effect is not significant (even if some small studies suggested differences).
- Example: Imagine a meta-analysis of several trials examining whether ordering extra diagnostic tests reduces patient anxiety. In a forest plot, a few studies show slight anxiety reduction and a few show slight increase, all with wide CIs overlapping no effect. The combined result (diamond) also crosses the no-effect line, indicating no significant difference in anxiety with test-ordering. This tells us that, overall, the intervention didn’t meaningfully change the outcome.
Use of Meta-Analysis: Meta-analyses are performed to get a clearer picture when individual studies are inconclusive or too small. By pooling data, a meta-analysis can detect small but real effects or confirm that an apparent effect is inconsistent. Always examine heterogeneity (variation) between studies as well: but in general, the meta-analysis result provides the highest-level evidence by synthesizing all available data.

Study Designs: Observational vs. Randomized

Identifying Study Type: First determine if a study is observational or experimental. In an experimental study, investigators assign an intervention (e.g., a drug vs placebo): if participants are randomized to groups, it’s a Randomized Controlled Trial (RCT). In an observational study, researchers do not assign treatments; they simply observe people in different circumstances. Observational studies can be:
- Cross-Sectional (single time-point, looking at prevalence),
- Cohort (following people forward in time), or
- Case-Control (looking backward after selecting outcomes). To distinguish: If they identified an exposure and tracked outcomes forward in time, it’s cohort. If they started with an outcome and looked back for exposures, it’s case-control. If there was no time element (simultaneous assessment), it’s cross-sectional.
Cross-Sectional Studies: An observational design where data are collected at one point in time (a “snapshot”). These studies measure prevalence of outcomes or characteristics in a population. They can show associations but cannot establish causality or temporal sequence (you don’t know which came first, exposure or outcome).
- Often used for hypothesis generation or public health surveys.
- Example: Survey a population today and find that women have a lower prevalence of heart disease than men. This suggests a correlation (possibly hormonal factors), but since it’s just a snapshot, we can’t determine cause-effect or the timing of exposure vs outcome.
Cohort Studies: An observational study where a group (cohort) is defined by exposure status and followed forward in time to observe outcomes. Participants are free of the outcome at start; one group has a certain exposure/risk factor and the other does not, and the study tracks who develops the outcome. Cohort studies can be prospective (planned and followed going forward) or retrospective (using historical data/records to follow outcomes from a point in the past).
- Cohort studies measure incidence (new occurrence of outcome) and can establish a temporal relationship (exposure precedes outcome), so they provide stronger evidence for causation than cross-sectional studies (though not as strong as RCTs). They often report results as relative risk or hazard ratios for the outcome in exposed vs unexposed.
- Strengths: Good for studying the effect of a risk factor on multiple outcomes, and for examining outcomes of rare exposures (e.g., a cohort of people exposed to a chemical vs not).
- Weaknesses: Time-consuming and possibly expensive (especially prospective cohorts that may take years). Confounding is a major concern: exposed and unexposed groups might differ in other ways that affect outcome. Investigators try to measure and adjust for confounders, but unmeasured factors can bias results.
  - Example: The Framingham Heart Study is a classic prospective cohort: it followed a community over decades to link risk factors (like smoking, cholesterol) with outcomes (heart disease).
  - Confounding example: Early observational cohort studies suggested hormone replacement therapy (HRT) in postmenopausal women reduced heart disease. However, HRT users were generally healthier and of higher socioeconomic status (confounding variables). It wasn’t the estrogen itself providing benefit, but other factors. A later RCT (Women’s Health Initiative) revealed HRT actually increased heart risk, overturning the confounded observational findings.
Case-Control Studies: An observational study that works backwards. It starts by identifying subjects with the outcome of interest (cases, e.g. a disease) and a comparable group without the outcome (controls). Then it looks retrospectively for prior exposure to potential risk factors in each group. This design is efficient for studying rare diseases or outcomes that take a long time to appear, since it deliberately collects cases of the outcome instead of waiting for them to occur.
- Investigators calculate an Odds Ratio (OR) to estimate the association between exposure and outcome (because we start with cases/controls, we cannot directly get incidence or RR). The OR approximates the relative risk when the outcome is rare.
- Strengths: Faster and cheaper for initial studies of rare conditions. Can examine many different potential risk factors for the outcome.
- Weaknesses: Retrospective bias: prone to recall bias, since people who develop a disease might remember or report past exposures differently than those without the disease. Also, selecting appropriate controls can be challenging (they should be similar to cases except for the disease). Case-control cannot directly establish incidence or risk; it only compares odds of exposure.
- Example: A case-control study of endometrial cancer might find women with endometrial cancer (cases) and similar women without it (controls), then look back at their histories. If a significantly higher proportion of cancer cases had past unopposed estrogen exposure compared to controls, that suggests an association (odds ratio would be >1). This design efficiently identified the link between estrogen use and endometrial cancer risk.
Randomized Controlled Trials (RCTs): An experimental study where participants are randomly assigned to an intervention/treatment group or a control group (e.g. placebo or standard care). RCTs are conducted prospectively and are considered the gold standard for evaluating cause-and-effect relationships for interventions.
- Randomization: The random assignment balances both known and unknown confounding factors between groups on average. This means that aside from the treatment, the groups are similar, allowing a fair comparison of outcomes. If randomization is done properly (especially with a large sample), differences in outcome can be attributed to the intervention with greater confidence.
- RCTs often use blinding (single-blind, double-blind) so that participants and/or investigators don’t know who is receiving the treatment vs placebo. Blinding minimizes bias in outcome assessment and patient care.
- Strengths: Provides the most robust evidence for causality. Can directly measure absolute and relative risk reduction, NNT, etc., for the intervention. Baseline characteristics of groups should be similar (any differences usually due to chance if sample is small).
- Weaknesses: Can be expensive and time-intensive. May raise ethical issues (you can’t assign harmful exposures, and sometimes you must stop a trial early if one group clearly fares much worse or better). Some questions aren’t easily studied by RCT due to feasibility or ethics (e.g., you can’t randomize people to smoke or not, or to a toxic exposure).
- Example: The Women’s Health Initiative trial randomly assigned ~16,000 women to Hormone Replacement Therapy vs placebo. This RCT was able to conclusively show that HRT increased the risk of heart disease and breast cancer, contrary to earlier observational studies. The randomization eliminated the healthy-user bias that plagued the observational data.
Summary of Study Designs: For exam purposes, focus on identifying the study type from descriptions:
- If randomized intervention -> RCT (experimental study).
- If observational:
  - Prospective, exposure to outcome -> Cohort study (following a group forward in time).
  - Retrospective, outcome back to exposure -> Case-Control study (starting with cases and controls, looking back).
  - No time sequence (single time-point) -> Cross-sectional study (snapshot survey or prevalence study).
Each design has its role: RCTs best for questions of therapy (when feasible), cohort and case-control for studying risk factors and harms (especially when RCTs would be unethical or impractical), and cross-sectional for quick assessments and generating hypotheses. Understanding these designs and their limitations (confounding, bias, etc.) is crucial for evidence-based practice and board exams.