Beyond the Average

Understanding Heterogeneous Treatment Effects in Experimental Research

From One Answer to Many

Experimental research often starts by asking "Does it work on average?" This is the Average Treatment Effect (ATE). However, this single number can hide a more complex truth. Heterogeneous Treatment Effects (HTE) analysis allows us to ask a more nuanced question: "For whom does it work, and why?" This section visualizes the fundamental shift from a single average effect to multiple, conditional effects.

Average Treatment Effect (ATE)

A single measure representing the mean impact for the entire population. It's powerful but can be misleading.

+5% Average Effect

Conditional ATE (CATE)

Estimates the average effect for specific subgroups, revealing who benefits most, or who might even be harmed.

Group A +15%
Group B +2%
Group C -3%

Case Study: The Moving to Opportunity Experiment

The MTO experiment is a powerful example of why HTE matters. The program offered housing vouchers to families in high-poverty areas. Initial analyses found a zero ATE on adult earnings, suggesting the program failed. However, a later HTE analysis, presented below, revealed a dramatically different story based on the age of children when their families moved.

Long-Term Income Gains: A Tale of Two Childhoods

This chart shows that children who moved before age 13 saw a massive 31% increase in adult income, while those who moved as teens saw a slightly negative effect. This single finding, missed by ATE analysis, completely reframed the policy lessons and demonstrated the program's hidden success.

The HTE Methodological Toolkit

The methods for finding HTE have evolved. The traditional approach is confirmatory, testing specific theories. Modern methods are more exploratory, using machine learning to discover unexpected patterns in the data. Click below to compare these two powerful approaches.

Goal: Test a Theory

This approach starts with a pre-specified hypothesis about which subgroup should have a different effect. It is the gold standard for testing a theory, typically using interaction terms in a regression model. To be credible, the hypothesis must be declared *before* seeing the data.

Y = β₀ + β₁T + β₂S + β₃(T x S) + ε

Here, a significant β₃ coefficient provides evidence of HTE for subgroup S.

Perils and Best Practices

With great analytical power comes great responsibility. HTE analysis, especially when exploratory, requires rigor to avoid common pitfalls that can lead to false discoveries and undermine the credibility of research.

⚠️The Peril of "P-Hacking"

Testing many subgroups dramatically increases the chance of finding a "significant" result purely by luck. This practice, known as p-hacking or specification searching, is a major threat to scientific credibility.

As shown, testing 10 independent subgroups at a standard 5% significance level creates a nearly 40% chance of at least one false positive result.

Best Practice: Pre-Analysis Plans

The strongest defense against p-hacking is to create a Pre-Analysis Plan (PAP). This document, registered publicly before data analysis begins, creates a clear, auditable line between planned, confirmatory tests and later exploratory findings.

  • Hypotheses: Clearly state which subgroups are being tested and why.
  • Model: Specify the exact statistical model and control variables.
  • Multiple Testing: Define the correction procedure (e.g., FDR) to be used.
  • Power: Report power calculations for detecting a meaningful effect difference.