Verifiable Datasets and Data Sources

Verifiable Datasets and Data Sources#

To accommodate for the proliferation of data sources and the need for trusted datasets, OpenFL provides a hierarchy of utility classes to build and verify datasets. This includes an extensible class hierarchy that enables the creation of datasets from various data sources, such as local file system, object storage and others.

The central abstraction is the VerifiableDatasetInfo class that encapsulates the dataset’s metadata and provides a method for verifying the integrity of the dataset. A dataset can be built from multiple data sources (not necessarily of the same type):

        %% Copyright 2025 Intel Corporation
%% SPDX-License-Identifier: Apache-2.0

classDiagram
    class VerifiableDatasetInfo {
        +label: str
        +data_sources: DataSource[] 
        +metadata: dict[str, str] 
        +root_hash: HASH 

        +verify_dataset(root_hash: HASH)
        +verify_single_file(file_path: str, file_hash: HASH)
        +to_json() str
        +from_json(json_str: str) VerifiableDatasetInfo
    }

    class DataSource {
        <<abstract>>
        +name: str 
        +type: DataSourceType 
        +compute_file_hash(path: str) str
        +enumerate_files() Generator~str~
        +read_blob(path: str) bytes
        +from_dict(ds_dict: dict) DataSource
        +is_valid_hash_function(func) bool
        +to_dict() dict
    }

    class LocalDataSource {
        +base_path: str
        ...
    }

    class S3DataSource {
        +uri: str 
        +endpoint: str
        ...
    }

    class AzureBlobDataSource {
        +name: str 
        +container_string: str 
        +folder_prefix: str
        ...
    }

    VerifiableDatasetInfo "1" o-- "*" DataSource
    DataSource <|-- LocalDataSource
    DataSource <|-- S3DataSource
    DataSource <|-- AzureBlobDataSource

    style VerifiableDatasetInfo fill:#FFFFE0,stroke:#000,stroke-width:1px
    style DataSource fill:#FFFFE0,stroke:#000,stroke-width:1px
    style LocalDataSource fill:#FFFFE0,stroke:#000,stroke-width:1px
    style S3DataSource fill:#FFFFE0,stroke:#000,stroke-width:1px
    style AzureBlobDataSource fill:#FFFFE0,stroke:#000,stroke-width:1px

    

Verifiable Dataset with Multiple Data Sources#

The VerifiableDatasetInfo class can then be used to create higher-order dataset classes that enable iterating through multiple data sources, while verifying integrity if required. The root_hash is used as a reference for integrity when loading items from the the data sources in the VerifiableDatasetInfo object.

OpenFL comes with a toolbox of dataset layout classes per ML framework. For PyTorch’s torch.utils.data.Dataset OpenFL curently provides:

  • FolderDataset - represents an iterable folder-layout dataset from a single data source, by implementing the __getitem__ method.

  • ImageFolder - a specialization of the FolderDataset that is able to load binary images from a foler-like structure

  • VerifiableMapStyleDataset - a base class for map-style datasets that can be built from multiple data sources (as specified by a VerifiableDatasetInfo object), including integrity checks.

  • VerifiableImageFolder - a specialization of the VerifiableMapStyleDataset encapsulating a collection of ImageFolder datasets

Note that the all those classes (directly or indirectly) extend torch.utils.data.DataLoader, and are therefore compatible with all PyTorch utilities for pre-processing data sets. A similar class hierarchy can be created for other ML frameworks that offer dataset utilities, such as TensorFlow.

        %% Copyright 2025 Intel Corporation
%% SPDX-License-Identifier: Apache-2.0

classDiagram
    class torch_utils_data_Dataset {
        +__len__() int
        +__getitem__(index: int) Any
    }

    class VerifiableDatasetInfo {
        +verify_dataset(root_hash: HASH)
        +verify_single_file(file_path: str, file_hash: HASH)
        +from_json(json_str: str) VerifiableDatasetInfo
    }

    class VerifiableMapStyleDataset {
        <<abstract>>
        +__len__() int
        +__getitem__(index: int) Any
        +create_datasets() void*
    }

    class VerifiableImageFolder {
        +__len__() int
        +__getitem__(index: int) Any
        +create_datasets() void
    }

    class FolderDataset {
        <<abstract>>
        +__len__() int
        +__getitem__(index: int) Any
        +load_file(file_path: str) void*
    }

    class ImageFolder {
        +__len__() int
        +__getitem__(index: int) Any
        +load_file(file_path: str) void
    }

    torch_utils_data_Dataset <|.. VerifiableMapStyleDataset
    torch_utils_data_Dataset <|.. FolderDataset
    VerifiableMapStyleDataset o-- VerifiableDatasetInfo
    VerifiableMapStyleDataset <|-- VerifiableImageFolder
    VerifiableMapStyleDataset o-- FolderDataset
    FolderDataset <|-- ImageFolder

    style torch_utils_data_Dataset fill:#D3D3D3,stroke:#000,stroke-width:1px
    style VerifiableDatasetInfo fill:#FFFFE0,stroke:#000,stroke-width:1px

    

Dataset hierarchy#

A practical example for the VerifiableImageFolder backed by S3DataSource is provided in the s3_histology workspace template.