7.6.4. mclearn.preprocessing.balanced_train_test_split

mclearn.preprocessing.balanced_train_test_split(X, y, test_size=None, train_size=None, bootstrap=False, random_state=None)[source]

Split the data into a balanced training set and test set of some given size.

For a dataset with an unequal numer of samples in each class, one useful procedure is to split the data into a training and a test set in such a way that the classes are balanced.

Parameters:
  • X (array, shape = [n_samples, n_features]) – Feature matrix.
  • y (array, shape = [n_features]) – Target vector.
  • test_size (float or int (default=0.3)) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is automatically set to the complement of the train size. If train size is also None, test size is set to 0.3.
  • train_size (float or int (default=1-test_size)) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
  • random_state (int, optional (default=None)) – Pseudo-random number generator state used for random sampling.
Returns:

  • X_train (array) – The feature vectors (stored as columns) in the training set.
  • X_test (array) – The feature vectors (stored as columns) in the test set.
  • y_train (array) – The target vector in the training set.
  • y_test (array) – The target vector in the test set.