openfl.utilities.data_splitters.numpy.LogNormalNumPyDataSplitter

openfl.utilities.data_splitters.numpy.LogNormalNumPyDataSplitter#

class openfl.utilities.data_splitters.numpy.LogNormalNumPyDataSplitter(mu, sigma, num_classes, classes_per_col, min_samples_per_class, seed=0)[source]#

Bases: NumPyDataSplitter

Class for splitting numpy arrays of data according to a LogNormal distribution.

Unbalanced (LogNormal) dataset split. This split assumes only several classes are assigned to each collaborator. Firstly, it assigns classes_per_col * min_samples_per_class items of dataset to each collaborator so all of collaborators will have some data after the split. Then, it generates positive integer numbers by log-normal (power) law. These numbers correspond to numbers of dataset items picked each time from dataset and assigned to a collaborator. Generation is repeated for each class assigned to a collaborator. This is a parametrized version of non-i.i.d. data split in FedProx algorithm. Origin source: litian96/FedProx

Parameters:
  • mu (float) – Distribution hyperparameter.

  • sigma (float) – Distribution hyperparameter.

  • num_classes (int) – Number of classes.

  • classes_per_col (int) – Number of classes assigned to each collaborator.

  • min_samples_per_class (int) – Minimum number of collaborator samples of each class.

  • seed (int, optional) – Random numbers generator seed. Defaults to 0.

Note

This split always drops out some part of the dataset! Non-deterministic behavior selects only random subpart of class items.

__init__(mu, sigma, num_classes, classes_per_col, min_samples_per_class, seed=0)[source]#

Initialize the generator.

Parameters:
  • mu (float) – Distribution hyperparameter.

  • sigma (float) – Distribution hyperparameter.

  • classes_per_col (int) – Number of classes assigned to each collaborator.

  • min_samples_per_class (int) – Minimum number of collaborator samples of each class.

  • seed (int) – Random numbers generator seed. Defaults to 0. For different splits on envoys, try setting different values for this parameter on each shard descriptor.

Methods

__init__(mu, sigma, num_classes, ...[, seed])

Initialize the generator.

split(data, num_collaborators)

Split the data.

split(data, num_collaborators)[source]#

Split the data.

Parameters:
  • data (np.ndarray) – numpy-like label array.

  • num_collaborators (int) – number of collaborators to split data across. Should be divisible by number of classes in data.