Why you should not use randomSplit in PySpark to split data into train and test.

Sergei Ivanov
3 min readAug 3, 2022

In case you work with large scale data and want to prepare dataset for your Tensorflow/PyTorch model, don’t use randomSplit function to split data into train and test.

The problem

You have a pyspark dataframe and you would like to split it into two dataframes, train and test. Obviously, you would like to have both parts to be shuffled. The behavior you expect should be similar to train_test_split in sklearn, where it assigns rows in random order into train or test.

However, PySpark function randomSplit first sorts the partitions and then makes splits. It can be found in the comment to the code.

Example

In this example we create a dataframe with x column and then shuffle this column

We would expect that after randomSplit we will see also random order of this column. However, it’s not the case, and in fact the column x is in sorted order.

--

--

Sergei Ivanov

Machine Learning research scientist with a focus on Graph Machine Learning and recommendations. t.me/graphML