Deconstructing the Decision Tree: A Manual Implementation of Recursive Binary Splitting

By Alexander Eriksson · February 15, 2026 (Updated February 15, 2026)

In a previous post, we looked at how bootstrapping allows us to bypass the rigid assumptions of t-tests. In this post, this step takes that logic a step further. While traditional econometrics often relies on global linear models, tree-based methods allow us to "carve" data into local, non-parametric regions—turning a simple sequence of binary choices into a powerful tool for modern real estate valuation.

Introduction: Tree Based Methods

Decision trees have for a long time been niche machine learning tools and has in the past rarely been used in the field of econometrics. As of the last years of progress within artificial intelligence and machine learning, this trend has shifted to now being an essential part of the arsenal for econometricians today.

The use cases for classification and regression trees (CART) is a wide spectrum, but in this post we will be using a dataset for real estate valuation provided by UC Irvine Machine Learning Repository. The dataset contains the following information:

Variable	Type	Description
Transaction date	Numeric	The year and month of the transaction (e.g., 2013.250).
House age	Numeric	The age of the house in years.
Distance to MRT	Numeric	Distance to the nearest Mass Rapid Transit station (meters).
Convenience stores	Integer	Number of convenience stores in the living resort on foot.
Latitude	Numeric	The geographic coordinate (North).
Longitude	Numeric	The geographic coordinate (East).
House price	Numeric	House price per unit area (10,000 New Taiwan Dollar/Ping).

Since our goal is to predict a continuous variable (house price) we are not interested in a classification tree, but rather a regression tree.

The Mechanics: How Trees "Think"

To follow how the tree actually works, one needs to get on the same page with a few basic terms:

Term	Definition
Root & Internal Nodes	The "decision gates" where the data is filtered based on a feature (e.g., \( X_j < s \)).
Leaf Node	The terminal point of a branch. It contains the local mean of the observations in that region.
Depth	The number of splits from the root to the furthest leaf; a proxy for model complexity.
Greedy Search	The algorithm's strategy to find the best split at the current moment without looking ahead.
Recursive Binary Splitting	The process of repeatedly dividing the data into two until a stopping criterion is met.

These components work together to build a model that looks very different from the ones we typically see in econometrics. Instead of calculating a single effect for the entire dataset, we use these nodes to carve the data into smaller, more manageable pieces.

From Global Slopes to Local Partitions

In a standard linear regression (OLS), we assume a global functional form: \[ y = \beta_0 + \beta_1 X_1 + \varepsilon. \]

Decision trees throw out this model assumption. The method is instead non-parametrics and uses Recursive Binary Splitting to partition the feature space into distinct, non-overlapping regions \( R_1, R_2,\dotsc, R_m \).

For any regression tree, the goal is to find a predictor \( j \) and the split points \( s \) that minimize the Residual Sum of Squares (RSS). We define the two resulting half-planes a: \[ R_1(j,s) = \{ X \mid X_j < s \} \quad \text{and} \quad R_2(j,s) = \{ X \mid X_j \geq s \} \]

The algorithm then searches for the \( j \) and \( s \) that minimize the following objective function \[ RSS = \sum_{i: x_i \in R_1(j,s)} (y_i - \hat{y}_{R_1})^2 + \sum_{i: x_i \in R_2(j,s)} (y_i - \hat{y}_{R_2})^2 \]

where \( \hat{y}_{R_1} \) and \( \hat{y}_{R_2} \) are simply the mean responses of the training observations in their respective regions. By minimizing the RSS at each gate, the tree ensures that the houses within each region are as similar in price as possible.

Implementation: Real Estate Data

Before we dive into the code, let’s look at what we are trying to build. A regression tree is essentially a series of binary decisions. For our real estate data, a simple tree might look like this:

Split: distance to the nearest MRT station < 826.83
├── Split: house age < 11.70
│   └──  (LEAF) Region Prediction: 52.25
│   └──  (LEAF) Region Prediction: 41.48
├── Split: latitude < 24.98
│   └──  (LEAF) Region Prediction: 23.37
│   └──  (LEAF) Region Prediction: 36.14

Let's walk through this prediction with an imaginary house, given these specs:

MRT Distance: 500
House Age: 5
Latitude: 24.90

Using these specs, the computer navgiates the tree like a Choose Your Own Adventure book:

Gate 1: Is MRT < 826.83? YES. (We move into the "Age" branch and ignore the "Latitude" branch entirely).
Gate 2: Is Age < 11.70? YES. (We hit a leaf).
Result: The final prediction is 52.25.

Why these specific numbers? The algorithm performs what we call a "greedy search" and finds the split that most effectively reduces the RSS—basically, it looks for the cut that makes the two resulting groups as similar as possible.

As we increase the max_depth, the tree becomes moer granular. While a shallow tree looks at broad categories (like "is it near a train?"), a deep tree will start looking at much more specific details (like: "was this house sold in year 2025?").

If we let the tree grow too deep, then it will start learning it's own dataset and will not be able to generalize to unseen data (overfitting). On the other hand, if we don't let the tree grow complex enough, then we are not using all the available data we have at hand and it will predict poorly (underfitting).

In a production environment we would not set max_depth this bluntly, but rather implement more advanced techniques such as fitting the model many times over a grid of hyperparameters, to then determine which model is best using some method like k-fold cross validation. For the sake of this post, this is not the focus.

Carving the Feature Space: Recursive Binary Splitting

To grow the tree, the computer will not just cut anywhere—instead it tries every possible way of cutting the tree. This process is what we call recursive binary splitting.

For every node in our tree, the algorithm goes through a very rigorous three-step search:

Iterative Scanning: Look at every feature \( \mathbf{x}_1, \mathbf{x}_2,\dotsc, \mathbf{x}_p \) and every possible value between the data points (thresholds). If we have a feature like "House Age", it will try to split it at 1 year, 1.5 years, 3 years and so on.
Binary Partioning: Then it creates two temporary groups: a left group (values below the threshold) and a right group (values above).
Optimization: For each temporary split, the algorithm calculates the Residual Sum of Squares (RSS). This is done by finding the average price (mean) of the left group and the right group, then measuring the total squared error within those regions:

The algorithm compares the RSS of the current split to the best_rss found so far. If the new RSS is lower, it "greedily" updates the split_info with the current feature and threshold, as this represents the most accurate division of the data found in the current scan.

Implementing this in Python, we have

def find_best_split(X_subset, y_subset): 
    """Greedy search strategy for regression tree splitting."""
    best_rss = float("inf")
    split_info = None

    # For every feature in { x_1, x_2, ..., x_p }
    for feature_idx in range(X_subset.shape[1]):

        # 1. Iterative Scanning: Look at every possible threshold between data points.
        # For a feature like "House Age", we try every midpoint (1 year, 1.5 years, etc.)
        unique_vals = np.unique(X_subset[:, feature_idx]) 
        thresholds = (unique_vals[:-1] + unique_vals[1:]) / 2 

        for s in thresholds:

            # 2. Binary Partitioning: Create a temporary left group (below threshold)
            # and a right group (above threshold).
            left_mask = X_subset[:, feature_idx] < s
            right_mask = ~left_mask

            y_l, y_r = y_subset[left_mask], y_subset[right_mask]

            # 3. Optimization: Calculate the Residual Sum of Squares (RSS).
            # We find the mean of the left and right groups and measure the 
            # total squared error within those regions.
            m_l, m_r = np.mean(y_l), np.mean(y_r)
            rss = np.sum((y_l - m_l)**2) + np.sum((y_r - m_r)**2)

            # If this split has the lowest RSS seen so far, store it.
            if rss < best_rss:
                best_rss = rss
                split_info = {
                    'feature_idx': feature_idx,
                    'threshold': s,
                    'left_mask': left_mask,
                    'right_mask': right_mask,
                    'rss': rss,
                    'means': (m_l, m_r)
                }

    return split_info

Now we know how to split the regions, next up we will be using our newly created function to actually grow the tree.

Growing the Tree

Since find_best_split only helps us carve out the feature space into regions \( R_i \), the next function build_tree is required to actually build this tree by usin the information of the best splits. Growing the tree requires logic known as recursion, if you're not familiar with recursion, it in simple terms mean that the function will call itself until a pre-defined condition is fulfilled.

In this case, we will be trying to grow new leaves in our tree until it is no longer possible to make any valid splits. If this condition is reached, we let this leaf (a.k.a. "node") be the terminal node at that path.

The first thing to declare in or recursion formula build_tree is to define a break condition if any of the following happens 1. The current_depth of our tree is greater than what we have pre-defined to accept (max_depth) 2. There are no longer unique values in our dataset which we can use for splitting the tree

def build_tree(X, y, current_depth, max_depth):
    if current_depth >= max_depth or len(np.unique(y_subset)) <= 1:
        return {"is_leaf": True, "prediction": np.mean(y_subset)}

Then next up, we use the function find_best_split that we previously defined and carve the feature space. We also need to check that if there are no more splits to be done, then simply return the final prediction at this terminal node

def build_tree(X, y, current_depth, max_depth, features):
    if current_depth >= max_depth or len(np.unique(y)) <= 1:
        return {"is_leaf": True, "prediction": np.mean(y)}

    split = find_best_split(X.values, y.values)

    if not split:
        return {"is_leaf": True, "prediction": np.mean(y)}

And lastly, we need to implement the recursion. We do this by calling the function build_tree in the function itself in order to build the left- and right node given the current node we are at:

def build_tree(X, y, current_depth, max_depth, features):
    if current_depth >= max_depth or len(np.unique(y)) <= 1:
        return {"is_leaf": True, "prediction": np.mean(y)}

    split = find_best_split(X.values, y.values)

    if not split:
        return {"is_leaf": True, "prediction": np.mean(y)}

    feature = features[split['feature_idx']] # simply for identifying the feature name

    return { 
        "is_leaf": False, # This is a Decision Node, not a leaf
        "feature": feature,
        "threshold": split['threshold'],
        "left": build_tree(X[split['left_mask']], y[split['left_mask']], 
                           current_depth + 1, max_depth, features),
        "right": build_tree(X[split['right_mask']], y[split['right_mask']], 
                            current_depth + 1, max_depth, features)
    }

Conclusion: From Single Trees to Ensembles

While the single tree we built is highly interpretable, it has a significant drawback: high variance. In this context, variance doesn't mean the algorithm is random—it is perfectly deterministic. Instead, it means the model is extremely sensitive to the specific training data it receives. A small difference in the dataset we feed to the model (like removing just a few houses) will make the splits, and therefore the predictions, entirely different.

To solve this, we move from a single tree to an Ensemble using a technique called Bootstrap Aggregation, or "Bagging".

The Logic of Bagging

In traditional econometrics, we might use a bootstrap to find the standard error of a mean—a concept I explored in a previous post regarding t-tests. In machine learning, we apply that same logic but treat the entire tree-building algorithm as the statistic we want to bootstrap.

The process follows three clear steps:

Bootstrapping: We create \( B \) different datasets by taking random samples from our original data with replacement.
Training: For each bootstrap sample, we grow a deep, unpruned regression tree. Because of the high variance we discussed, each sample produces a slightly different tree estimate, \( \hat{f}^{*b}(x) \).
Aggregation: To get our final prediction, we average the results of all \( B \) trees:

\[ \hat{f}_{\text{bag}}(x) = \frac{1}{B} \sum_{b=1}^B \hat{f}^{*b}(x) \]

Why this works

By averaging many (high-variance) trees, we cancel out the random errors inherent in any single tree while keeping the model's ability to find complex, non-linear patterns.

By shifting from one "best" tree to a "forest" of diverse, bagged trees, we trade a bit of simplicity for a massive leap in predictive stability. While more advanced algorithms like Random Forests take this a step further by de-correlating the trees (sampling features as well as observations), the foundational power comes from the Bagging process we've explored here.

Final Thoughts

We started this post by looking at how Distance to MRT and House Age impact property values. While a single tree provides a transparent, "Choose Your Own Adventure" look into those prices, its high sensitivity to data—the "topological earthquakes" we discussed—makes it a liability for stable inference.

By implementing Bagging, we move from relying on a single "lucky" split to a robust, averaged prediction.

If one understands the recursive logic of a single tree and the variance-reduction of bagging, then one has already conquered the steep part of the learning curve. More advanced methods are simply iterative refinements of these two concepts:

Random Forests: Bagging, but with an added layer of de-correlation by only allowing the tree to see a random subset of features at each split.
Boosting: Instead of growing trees in parallel (like bagging), trees are grown sequentially, with each new tree attempting to correct the errors of the previous one.

When you don't want to (or can't) impose a rigid parametric form like OLS, tree-based methods are no longer just "niche" tools—they are essential for capturing the non-linear realities of modern economic data.

Back to Blog