Back to Blog

Bootstrapping: When the Assumptions of t-testing fails

By Alexander Eriksson · February 14, 2026 (Updated February 14, 2026)

When introducing oneself in an introduction course for econometrics one is always introduced to the basics of hypothesis testing—or as in my case overwhelmed by information in the area. Most of the time when it comes to hypothesis testing will be dedicated on how to conduct so called "\( t \)-tests" which is a type of test where we compare two statistics—most commonly an estimate versus a true value—against one another to then determine how likely it is that the estimate we see is found by chance. This area of hypothesis testing is by far the most important area to fully understand as an econometrician, but the assumptions by students in bachelor and masters programs (including myself) often takes the assumptions of \( t \)-testing for granted and the outcome of the test is therefore likely to be inaccurate and the inference we therefore draw useless.

Revisiting the Basics of \( t \)-testing

To understand why \( t \)-testing works, we first need to recall the fundamental assumptions of the underlying model:

  • Independence: The data \( \mathbf{x} \) feeding the model must be independent. If observations are correlated (common in time-series), the standard errors used in the \( t \)-statistic formula will be underestimated, leading to "false positive" significance.
  • Normality of Errors: We formally assume that the population errors (residuals) follow a normal distribution, \( \epsilon \sim N(0, \sigma^2) \). While \( t \)-tests are often robust to non-normality for sample sizes larger than 30 (\( n > 30 \)) thanks to the Central Limit Theorem, this "bridge" can fail if the underlying distribution is heavily skewed or has extreme outliers.
  • Homogeneity of Variances: We assume that the spread (variance) of our data is roughly equal in both groups (homoscedasticity).
  • Random Sampling: We assume the data was gathered using random sampling, making it a representative snapshot of the broader population.

The \( t \)-test is a parametric method because its mathematical validity relies on the assumption that the sampling distribution of the mean follows a specific distribution defined by fixed parameters, namely the mean and variance of a \( t \)-distribution.

However, in fields like finance, this assumption often fails. Stock market returns are famously non-normally distributed; they have "fat tails" and volatility clusters. When the normality assumption for \( \varepsilon \) fails, the standard error (\( SE \)) in our \( t \)-statistic formula will likely become a biased or inefficient measure of true uncertainty:

\[ t = \frac{\hat\beta - \beta_0}{SE(\hat\beta)} \]

If we cannot trust that the errors follow a bell curve, we cannot trust the \( SE \) derived from these parametric formulas. To solve this, we turn to non-parametric methods. For this post, specifcially bootstrapping—which allow us to estimate \( SE(\hat\beta) \) empirically from the data itself, without forcing it into a pre-defined Gaussian shape.

The Bootstrap Algorithm

Before we show the implementation with real data, it is important to understand why we are deviating from the standard formula. In the parametric world, we assume the population follows a perfect bell curve and use the derived formula \( \frac{\sigma}{\sqrt{n}} \) to estimate our uncertainty.

Bootstrapping takes a different path. We treat our specific sample as a "mini-population" and resample from it with replacement thousands of times. This allows us to build an empirical distribution of our estimator. Instead of relying on a theoretical distribution, we observe how our estimate actually behaves across these simulated scenarios to find a non-parametric standard error.

To understand the mechanics, we can break the implementation down into a simple loop. The logic follows three repetitive steps:

  1. Resample: Pick \( n \) values from our original data at random, allowing the same value to be picked more than once (sampling with replacement).

  2. Estimate: Calculate our chosen statistic—whether it be a mean, a median, or a regression coefficient—for that specific resample.

  3. Store: Save that estimate and repeat the process thousands of times to see how much that estimate varies.

def bootstrap(data, estimator, B=5000):
    n = len(data)
    boot_estimates = []

    for _ in range(B):
        # 1. Create a "resample" by picking n values from our data with replacement
        resample = np.random.choice(data, size=n, replace=True)

        # 2. Calculate our chosen statistic (the estimator) for this resample
        stat = estimator(resample)

        # 3. Store the result in our list
        boot_estimates.append(stat)

    # The Standard Error is the Standard Deviation of our bootstrapped estimates
    boot_se = np.std(boot_estimates, ddof=1)

    return np.mean(boot_estimates), boot_se, np.array(boot_estimates)

Bootstrapping AAPL Stock Returns

To do this let us use some real world data, more specifically monthly stock return for Apple Inc. ($AAPL) between 2025-03—2026-02 and see whether the returns fulfill our assumptions of being normally distributed:

Month Adj. Close Price (USD)
March 2025 221.17
April 2025 211.58
May 2025 199.98
June 2025 204.55
July 2025 206.94
August 2025 231.44
September 2025 254.15
October 2025 269.86
November 2025 278.32
December 2025 271.61
January 2026 259.24
February 2026 255.54

Model Specification

Let's now assume that the stock returns of Apple can be explained by a simple Random Walk with Drift: \[ r_t = \mu + \varepsilon_t \] where \( r_t \) represents the percentage return (or log-return) of AAPL in month \( t \), \( \mu \) is a drift parameter representing the expected return \( \mathbb{E}[r_t] \), and \( \varepsilon_t \) is the random error term (the "shock").

The point estimator (\( \hat\mu \)) here is simply: \[ \mathbb{E}[r_t] = \mathbb{E}[\mu + \varepsilon_t] = \mu + \mathbb{E}[\varepsilon_t] = \mu \]

So the estimator in this case is simply the average 12-monthly returns: \[ \hat\mu = \bar r = \frac{1}{n} \sum_{t=1}^n r_t \]

And the parametric standard error is derived as: \[ SE(\hat{\mu}) = \sqrt{\text{Var}(\hat{\mu})} = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}} \]

The Normality Gap

With the knowledge of the definition for the parametric standard error, we can calculate it directly from our sample:

\[ SE(\hat{\mu}) = \frac{s}{\sqrt{n}} \approx 0.0177 \]

Then, using the bootstrap algorithm shown earlier (with \( B=5000 \) and seed 42), we find our non-parametric standard error to be:

\[ SE_{boot} \approx 0.0169 \]

In this case these numbers do look very similar. Let's look at how this impacts our \( t \)-statistic when testing if Apple’s monthly drift (\( \mu \)) is significantly different from zero:

Method Standard Error (\( SE \)) \( t \)-statistic (\( t = \frac{\hat{\mu}}{SE} \))
Parametric 0.0177 0.83
Bootstrap 0.0169 0.87

Even though both \( t \)-stats are far below the typical critical value of 1.96 (meaning we can't claim Apple had a statistically significant "drift" in this period), the bootstrap gives us a (somewhat) more precise estimate of our own uncertainty. It doesn't just guess based on a bell curve; it instead measures the stability of the mean using the actual history of the stock.


Back to Blog