Member-only story
Why you should not use randomSplit in PySpark to split data into train and test.
In case you work with large scale data and want to prepare dataset for your Tensorflow/PyTorch model, don’t use randomSplit function to split data into train and test.
The problem
You have a pyspark dataframe and you would like to split it into two dataframes, train and test. Obviously, you would like to have both parts to be shuffled. The behavior you expect should be similar to train_test_split in sklearn, where it assigns rows in random order into train or test.
However, PySpark function randomSplit first sorts the partitions and then makes splits. It can be found in the comment to the code.
Example
In this example we create a dataframe with x
column and then shuffle this column
We would expect that after randomSplit
we will see also random order of this column. However, it’s not the case, and in fact the column x
is in sorted order.