train test split machine learning

Reusing data for ML? Hash your data before you create the train-test split

The best way to make sure the training and test sets are never mixed while updating the data set.

Recently, I was reading Aurélien Géron’s Hands-On Machine Learning with Scikit-Learn, Keras and TensorFlow (2nd edition) and it made me realize that there might be an issue with the way we approach the train-test split while preparing data for machine learning models. In this article, I quickly demonstrate what the issue is and show an example of how to fix it.

Illustrating the issue

I want to say upfront that the issue I mentioned is not always a problem per se and it all depends on the use case. While preparing the data for training and evaluation, we normally split the data using a function such as Scikit-Learn’s train_test_split . To make sure that the results are reproducible, we use the random_state argument, so however many times we split the same data set, we will always get the very same train-test split. And in this sentence lies the potential issue I mentioned before, particularly in the part about the same data set.

Imagine a case in which you build a model predicting customer churn. You received satisfactory results, your model is already in production and generating value-added for a company. Great work! However, after some time, there might be new patterns among the customers (for example, global pandemic changed the user behavior) or you simply gathered much more data, as more customers joined the company. For any reason, you might want to retrain the model and use the new data for both training and validation.

And this is exactly when the issue appears. When you use the good old train_test_split on the new data set (all of the old observations + the new ones you gathered since training), there is no guarantee that the observations you trained on in the past will still be used for training, and the same would be true for the test set. I will illustrate this with an example in Python:

# import the libraries 
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from zlib import crc32

# generate the first DataFrame
X_1 = pd.DataFrame(data={"variable": np.random.normal(size=1000)})

# apply the train-test split
X_1_train, X_1_test = train_test_split(X_1, test_size=0.2, random_state=42)

# add new observations to the DataFrame
X_2 = pd.concat([X_1, pd.DataFrame(data={"variable": np.random.normal(size=500)})]).reset_index(drop=True)

# again, apply the train-test split to the updated DataFrame
X_2_train, X_2_test = train_test_split(X_2, test_size=0.2, random_state=42)

# see what is the overlap of indices
print(f"Train set: {len(set(X_1_train.index).intersection(set(X_2_train.index)))}")
print(f"Test set: {len(set(X_1_test.index).intersection(set(X_2_test.index)))}")

# Train set: 669
# Test set: 59

First, I generated a DataFrame with 1000 random observations. I applied the 80–20 train-test split using a random_state to ensure the results are reproducible. Then, I created a new DataFrame, by adding 500 observations to the end of the initial DataFrame (resetting the index is important to keep track of the observations in this case!). Once again, I applied the train-test split and then investigated how many observations from the initial sets actually appear in the second ones. For that, I used the handy intersection method of a Python’s set. The answer is 669 out of 800 and 59 out of 200. This clearly shows that the data was reshuffled.

What are the potential dangers of such an issue? It all depends on the volume of data, but it can happen that in an unfortunate random draw all the new observations will end up in one of the sets, and not help that much with proper model fitting. Even though such a case is unlikely, the more likely cases of uneven distribution among the sets are not that desirable either. Hence, it would be better to evenly distribute the new data to both sets, while keeping the original observations assigned to their respective sets.

Solving the issue

So how can we solve this issue? One possibility would be to allocate the observations to the training and test sets based on a certain unique identifier. We can calculate the hash of observations’ identifier using some kind of a hashing function and if the value is smaller than x% of the maximum value, we put that observation into the test set. Otherwise, it belongs to the training set.

You can see an example solution (based on the one presented by Aurélien Géron in his book) in the following function, which uses the CRC32 algorithm. I will not go into the details of the algorithm, you can read about CRC here. Alternatively, here you can find a good explanation of why CRC32 can very well serve as a hashing function and what drawbacks it has — mostly in terms of security, but that is not a problem for us. The function follows the logic described in the paragraph above, where 2³² is the maximum value of this hashing function:

def hashed_train_test_split(df, index_col, test_size=0.2):
    """
    Train-test split based on the hash of the unique identifier.
    """
    test_index = df[index_col].apply(lambda x: crc32(np.int64(x)))
    test_index = test_index < test_size * 2**32

    return df.loc[~test_index], df.loc[test_index]

Note: The function above will work for Python 3. To adjust it for Python 2, we should follow crc32’s documentation and use it as follows: crc32(data) & 0xffffffff.

Before testing the function in practice, it is really important to mention that you should use a unique and immutable identifier for the hashing function. And for this particular implementation, also a numeric one (though this can be relatively easily extended to include strings as well).

In our toy example, we can safely use the row ID as a unique identifier, as we only append the new observations at the very end of the initial DataFrame and never delete any rows. However, this is something to be aware of while using this approach for more complex cases. So a good identifier might be the customer’s unique number, as by design those should only increase and there should be no duplicates.

To confirm that the function is doing what we want it to do, we once again run the test scenario as shown above. This time, for both DataFrames we use the hashed_train_test_split function:

# create an index column (should be immutable and unique)
X_1 = X_1.reset_index(drop=False)
X_2 = X_2.reset_index(drop=False)

# apply the improved train-test split
X_1_train_hashed, X_1_test_hashed = hashed_train_test_split(X_1, "index")
X_2_train_hashed, X_2_test_hashed = hashed_train_test_split(X_2, "index")

# see what is the overlap of indices
print(f"Train set: {len(set(X_1_train_hashed.index).intersection(set(X_2_train_hashed.index)))}")
print(f"Test set: {len(set(X_1_test_hashed.index).intersection(set(X_2_test_hashed.index)))}")

# Train set: 800
# Test set: 200

While using the hashed unique identifier for the allocation, we achieved perfect overlap for both training and test sets.

Conclusions

In this article, I showed how to use hashing functions to improve the default behavior of training-test split. The described issue is not very apparent for many data scientists, as it mostly occurs in case of retraining the ML models using new and updated data sets. So this is not really something often mentioned in textbooks or one does not come across it while playing with example data sets, even the ones from Kaggle competitions. And I mentioned before, this might not even be an issue for us, as it really depends on the use case. However, I do believe that one should be aware of it and how to fix it if there is such a need.

Author: Eryk Lewinson

Source: Towards Data Science