openfl.utilities.data_splitters

openfl.utilities.data package.

class openfl.utilities.data_splitters.DataSplitter

Base class for data splitting.

This class should be subclassed when creating specific data splitter classes.

abstract split(data: Iterable[T], num_collaborators: int) List[Iterable[T]]

Split the data into a specified number of parts.

Parameters:
  • data (Iterable[T]) – The data to be split.

  • num_collaborators (int) – The number of parts to split the data into.

Returns:

List[Iterable[T]] – The split data.

Raises:

NotImplementedError – This is an abstract method and must be overridden in a subclass.

class openfl.utilities.data_splitters.DirichletNumPyDataSplitter(alpha=0.5, min_samples_per_col=10, seed=0)

Class for splitting numpy arrays of data according to a Dirichlet distribution.

Generates the random sample of integer numbers from dirichlet distribution until minimum subset length exceeds the specified threshold. This behavior is a parametrized version of non-i.i.d. split in FedMA algorithm. Origin source: https://github.com/IBM/FedMA/blob/master/utils.py#L96

Parameters:
  • alpha (float, optional) – Dirichlet distribution parameter. Defaults to 0.5.

  • min_samples_per_col (int, optional) – Minimal amount of samples per collaborator. Defaults to 10.

  • seed (int, optional) – Random numbers generator seed. Defaults to 0.

split(data, num_collaborators)

Split the data.

class openfl.utilities.data_splitters.EqualNumPyDataSplitter(shuffle=True, seed=0)

Class for splitting numpy arrays of data evenly.

Parameters:
  • shuffle (bool, optional) – Flag determining whether to shuffle the dataset before splitting. Defaults to True.

  • seed (int, optional) – Random numbers generator seed. Defaults to 0.

split(data, num_collaborators)

Split the data.

class openfl.utilities.data_splitters.LogNormalNumPyDataSplitter(mu, sigma, num_classes, classes_per_col, min_samples_per_class, seed=0)

Class for splitting numpy arrays of data according to a LogNormal distribution.

Unbalanced (LogNormal) dataset split. This split assumes only several classes are assigned to each collaborator. Firstly, it assigns classes_per_col * min_samples_per_class items of dataset to each collaborator so all of collaborators will have some data after the split. Then, it generates positive integer numbers by log-normal (power) law. These numbers correspond to numbers of dataset items picked each time from dataset and assigned to a collaborator. Generation is repeated for each class assigned to a collaborator. This is a parametrized version of non-i.i.d. data split in FedProx algorithm. Origin source: https://github.com/litian96/FedProx/blob/master/data/mnist/generate_niid.py#L30

Parameters:
  • mu (float) – Distribution hyperparameter.

  • sigma (float) – Distribution hyperparameter.

  • num_classes (int) – Number of classes.

  • classes_per_col (int) – Number of classes assigned to each collaborator.

  • min_samples_per_class (int) – Minimum number of collaborator samples of each class.

  • seed (int, optional) – Random numbers generator seed. Defaults to 0.

Note

This split always drops out some part of the dataset! Non-deterministic behavior selects only random subpart of class items.

split(data, num_collaborators)

Split the data.

Parameters:
  • data (np.ndarray) – numpy-like label array.

  • num_collaborators (int) – number of collaborators to split data across. Should be divisible by number of classes in data.

class openfl.utilities.data_splitters.NumPyDataSplitter

Base class for splitting numpy arrays of data.

This class should be subclassed when creating specific data splitter classes.

abstract split(data: ndarray, num_collaborators: int) List[List[int]]

Split the data.

class openfl.utilities.data_splitters.RandomNumPyDataSplitter(shuffle=True, seed=0)

Class for splitting numpy arrays of data randomly.

Parameters:
  • shuffle (bool, optional) – Flag determining whether to shuffle the dataset before splitting. Defaults to True.

  • seed (int, optional) – Random numbers generator seed. Defaults to 0.

split(data, num_collaborators)

Split the data.

data_splitter

openfl.utilities.data_splitters.data_splitter module.

numpy

UnbalancedFederatedDataset module.