How to Effectively Split Data for Trading Backtesting and Machine Learning: A Step-by-Step Guide

The Data Split: A Trader’s Secret Weapon

A trader sits, eyes fixed on screens alive with price charts and indicators. What distinguishes the successful from the rest? Increasingly, it's not gut instinct, but the clever application of algorithms and machine learning. Yet these digital oracles are only as potent as the data they digest.

At the core of this digital alchemy lies a deceptively simple technique: splitting data into in-sample (IS) and out-of-sample (OOS) segments. It's akin to a chef's pre-service taste test—essential, not merely academic.

The IS segment serves as the model's training ground, the OOS its proving field. This approach wards off a common pitfall: models that recite history flawlessly but stumble when faced with the unknown.

Sloppy splitting carries risks. A model schooled only in bull markets might falter at the first whiff of bearish trends. Worse, overlapping IS and OOS data can spawn a mirage of accuracy—one that vanishes when real money is at stake.

An effective method, the Rolling Window, mimics a sliding puzzle. Imagine a clear frame that moves across your dataset, revealing only a portion at a time. As this frame slides forward, it continually redefines what's "past" (in-sample) and what's "future" (out-of-sample). This sliding action creates multiple IS/OOS pairs, each offering a unique snapshot of market conditions.

This approach stress-tests models across various timeframes, ensuring they perform well regardless of when they're applied. It's akin to training an athlete on different terrains—flat tracks, hills, and rough ground—to prepare them for any race condition.

Implementing this technique needn't be daunting. Python, that Swiss Army knife of data science, offers the necessary tools. The process distills to four steps:

Clean and focus the data on relevant trading hours.
Create splits using the sliding frame method.
Visualize the splits to spot any blind spots.
Generate a summary for quick reference.

Here's a snippet of Python code to illustrate the data loading and processing:

import pandas as pd

class DataLoader:
    @staticmethod
    def load_and_process(file_path):
        data = pd.read_csv(file_path, parse_dates=['datetime'], index_col='datetime')
        data = data.between_time('09:30', '14:00')
        data_30min = data.resample('30T').agg({'close': 'last'})
        return data_30min

And here's how you might create the splits:

class SplitCreator:
    @staticmethod
    def create_splits(data, num_splits, is_fraction, exclude_years):
        splits = []
        total_points = len(data)
        split_size = int(total_points / num_splits)
        is_size = int(split_size * is_fraction)
        for i in range(num_splits):
            is_start = i * split_size
            is_end = is_start + is_size
            oos_start = is_end
            oos_end = oos_start + (split_size - is_size)
            splits.append((data.index[is_start], data.index[is_end], data.index[oos_start], data.index[oos_end]))
        final_oos_start = splits[-1][3]
        final_oos_end = data.index[-1]
        final_oos = (final_oos_start, final_oos_end)
        return splits, final_oos

Data Overview

The data used in this example comes from the ES futures continuous contract, recorded in 1-minute bars over a substantial date range. An ideal period is at least 10 years.

To make this data more manageable and relevant, it has been filtered for trading hours from 09:30 to 14:00 Exchange time and resampled into 30-minute bars. This preprocessing ensures the data used for model training and testing reflects real market conditions, enhancing the robustness of your trading strategies.

The resulting visualizations—a full data window showing the splits and a timeline of their distribution—serve as a map, highlighting potential biases or gaps in coverage.

Proper data splitting forms the foundation of robust trading strategies. This approach primarily concerns two critical aspects: first, it helps avoid common pitfalls in model development, and second, it focuses on constructing models resilient enough to withstand the unpredictable nature of real markets. The process involves careful consideration of historical trends, potential future scenarios, and the inherent volatility of financial systems.

As financial markets grow increasingly complex, those who master the intricacies of data preparation may find themselves at a distinct advantage. For algorithmic traders, success hinges on a meticulous attention to detail, from data collection and preprocessing to model validation and continuous refinement.

Full Code

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.dates as mdates
from tabulate import tabulate
from typing import Tuple, List


class DataLoader:
    """Handles loading and preprocessing of financial data."""

    @staticmethod
    def load_and_process(file_path: str) -> pd.DataFrame:
        """
        Load data from file and process it into 30-minute bars.

        Args:
            file_path (str): Path to the data file.

        Returns:
            pd.DataFrame: Processed data in 30-minute bars.
        """
        # Load data
        data = pd.read_csv(file_path, parse_dates=[["Date", "Time"]])
        data.set_index("Date_Time", inplace=True)

        # Filter for trading hours and resample to 30-minute bars
        data = data.between_time("09:30", "16:00")
        data_30min = (
            data.resample("30T", label="right", closed="right")
            .agg(
                {
                    "Open": "first",
                    "High": "max",
                    "Low": "min",
                    "Close": "last",
                    "Up": "sum",
                    "Down": "sum",
                }
            )
            .dropna()
        )

        return data_30min


class SplitCreator:
    """Creates IS/OOS splits from financial data."""

    @staticmethod
    def create_splits(
        data: pd.DataFrame,
        num_splits: int = 10,
        is_fraction: float = 0.8,
        exclude_years: int = 2,
    ) -> Tuple[List[Tuple[pd.DataFrame, pd.DataFrame]], pd.DataFrame]:
        """
        Create IS/OOS splits from the data.

        Args:
            data (pd.DataFrame): Input data.
            num_splits (int): Number of splits to create.
            is_fraction (float): Fraction of each split to use as IS.
            exclude_years (int): Number of years to exclude for final OOS.

        Returns:
            Tuple[List[Tuple[pd.DataFrame, pd.DataFrame]], pd.DataFrame]:
                List of (IS, OOS) splits and final OOS data.
        """
        # Exclude the final OOS period
        cutoff_date = data.index[-1] - pd.DateOffset(years=exclude_years)
        split_data = data[data.index <= cutoff_date]

        # Calculate sizes
        split_size = len(split_data) // num_splits
        is_size = int(split_size * is_fraction)
        oos_size = split_size - is_size

        # Create splits
        splits = []
        for i in range(num_splits):
            start_idx = i * split_size
            is_end_idx = start_idx + is_size
            oos_end_idx = min(is_end_idx + oos_size, len(split_data))

            is_data = split_data.iloc[start_idx:is_end_idx]
            oos_data = split_data.iloc[is_end_idx:oos_end_idx]
            splits.append((is_data, oos_data))

        return splits, data[data.index > cutoff_date]


class Visualizer:
    """Handles visualization of financial data and splits."""

    @staticmethod
    def plot_full_data_with_splits(
        data: pd.DataFrame,
        splits: List[Tuple[pd.DataFrame, pd.DataFrame]],
        final_oos: pd.DataFrame,
        is_fraction: float,
    ):
        """
        Plot full data window with IS/OOS splits.

        Args:
            data (pd.DataFrame): Full dataset.
            splits (List[Tuple[pd.DataFrame, pd.DataFrame]]): List of IS/OOS splits.
            final_oos (pd.DataFrame): Final OOS data.
            is_fraction (float): Fraction of each split used as IS.
        """
        plt.figure(figsize=(20, 10))

        # Plot full dataset
        plt.plot(
            data.index.to_numpy(),
            data["Close"].to_numpy(),
            color="lightgray",
            alpha=0.5,
            label="Full Dataset",
        )

        colors = plt.cm.rainbow(np.linspace(0, 1, len(splits)))

        for i, (is_data, oos_data) in enumerate(splits):
            color = colors[i]
            # Plot IS data with color
            plt.plot(
                is_data.index.to_numpy(),
                is_data["Close"].to_numpy(),
                color=color,
                label=f"IS Split {i+1}",
            )
            # Plot OOS data in gray
            plt.plot(
                oos_data.index.to_numpy(),
                oos_data["Close"].to_numpy(),
                color="gray",
                alpha=0.7,
                linestyle="--",
            )

        # Plot final OOS data in black
        plt.plot(
            final_oos.index.to_numpy(),
            final_oos["Close"].to_numpy(),
            color="black",
            alpha=0.7,
            label="Final OOS",
        )

        plt.xlabel("Date")
        plt.ylabel("Close Price")
        plt.title(
            f"Full Data Window with Rolling IS/OOS Splits (30-min bars, {len(splits)} splits, {is_fraction*100}% IS)"
        )

        # Customize legend
        handles, labels = plt.gca().get_legend_handles_labels()
        is_handles = handles[1 : len(splits) + 1]  # IS handles
        oos_handle = handles[len(splits) + 1]  # First OOS handle (all are the same)
        final_oos_handle = handles[-1]  # Final OOS handle
        custom_handles = is_handles + [oos_handle, final_oos_handle]
        custom_labels = [f"IS Split {i+1}" for i in range(len(splits))] + [
            "OOS Splits",
            "Final OOS",
        ]
        plt.legend(
            custom_handles, custom_labels, loc="center left", bbox_to_anchor=(1, 0.5)
        )

        plt.tight_layout()
        plt.show()

    @staticmethod
    def plot_splits_timeline(
        splits: List[Tuple[pd.DataFrame, pd.DataFrame]], final_oos: pd.DataFrame
    ):
        """
        Plot timeline of IS/OOS splits.

        Args:
            splits (List[Tuple[pd.DataFrame, pd.DataFrame]]): List of IS/OOS splits.
            final_oos (pd.DataFrame): Final OOS data.
        """
        fig, ax = plt.subplots(figsize=(15, 8))

        colors = plt.cm.rainbow(np.linspace(0, 1, len(splits)))

        for i, (is_data, oos_data) in enumerate(splits):
            # Plot IS period
            ax.barh(
                i,
                (is_data.index[-1] - is_data.index[0]).days,
                left=mdates.date2num(is_data.index[0]),
                height=0.5,
                color=colors[i],
                alpha=0.7,
                label=f"IS Split {i+1}",
            )

            # Plot OOS period
            ax.barh(
                i,
                (oos_data.index[-1] - oos_data.index[0]).days,
                left=mdates.date2num(oos_data.index[0]),
                height=0.5,
                color="gray",
                alpha=0.5,
            )

        # Plot Final OOS period
        ax.barh(
            len(splits),
            (final_oos.index[-1] - final_oos.index[0]).days,
            left=mdates.date2num(final_oos.index[0]),
            height=0.5,
            color="black",
            alpha=0.7,
            label="Final OOS",
        )

        # Customize the plot
        ax.set_yticks(range(len(splits) + 1))
        ax.set_yticklabels([f"Split {i+1}" for i in range(len(splits))] + ["Final OOS"])
        ax.invert_yaxis()  # To have Split 1 at the top

        ax.xaxis.set_major_formatter(mdates.DateFormatter("%Y"))
        ax.xaxis.set_major_locator(mdates.YearLocator(2))  # Show every 2 years

        plt.title("IS/OOS Splits Timeline")
        plt.xlabel("Date")
        plt.legend(loc="center left", bbox_to_anchor=(1, 0.5))
        plt.tight_layout()
        plt.show()


class TableGenerator:
    """Generates tabular representation of splits."""

    @staticmethod
    def generate_splits_table(
        splits: List[Tuple[pd.DataFrame, pd.DataFrame]], final_oos: pd.DataFrame
    ) -> str:
        """
        Generate a table of IS/OOS splits.

        Args:
            splits (List[Tuple[pd.DataFrame, pd.DataFrame]]): List of IS/OOS splits.
            final_oos (pd.DataFrame): Final OOS data.

        Returns:
            str: Formatted table as a string.
        """
        table_data = []
        for i, (is_data, oos_data) in enumerate(splits):
            table_data.append(
                [
                    f"Split {i+1}",
                    is_data.index[0].strftime("%Y-%m-%d %H:%M"),
                    is_data.index[-1].strftime("%Y-%m-%d %H:%M"),
                    oos_data.index[0].strftime("%Y-%m-%d %H:%M"),
                    oos_data.index[-1].strftime("%Y-%m-%d %H:%M"),
                ]
            )

        # Add final OOS as the last row
        table_data.append(
            [
                "Final OOS",
                "-",
                "-",
                final_oos.index[0].strftime("%Y-%m-%d %H:%M"),
                final_oos.index[-1].strftime("%Y-%m-%d %H:%M"),
            ]
        )

        headers = ["Split", "IS Start", "IS End", "OOS Start", "OOS End"]
        return tabulate(table_data, headers=headers, tablefmt="grid")


def main():
    """Main function to orchestrate the data processing and visualization."""
    # Configuration
    file_path = (
        r"..\ES\es.1.minute.24h-11970101-20240712.txt"
    )
    num_splits = 10
    is_fraction = 0.8
    exclude_years = 2

    # Load and process data
    data_30min = DataLoader.load_and_process(file_path)

    # Create splits
    splits, final_oos = SplitCreator.create_splits(
        data_30min, num_splits, is_fraction, exclude_years
    )

    # Visualize data
    Visualizer.plot_full_data_with_splits(data_30min, splits, final_oos, is_fraction)
    Visualizer.plot_splits_timeline(splits, final_oos)

    # Generate and print table
    table = TableGenerator.generate_splits_table(splits, final_oos)
    print("\nIS/OOS Splits Table:")
    print(table)


if __name__ == "__main__":
    main()

How to Effectively Split Data for Trading Backtesting and Machine Learning: A Step-by-Step Guide

The Data Split: A Trader’s Secret Weapon

Data Overview

Full Code

Recent Posts

Comments