OpenFL Troubleshooting

The following is a list of commonly reported issues in Open Federated Learning (OpenFL). If you don’t see your issue reported here, please submit a Github issue or contact us directly on Slack.

  1. I see the error Cannot import name TensorFlowDataLoader from openfl.federated

    OpenFL currently uses conditional imports to attempt to be framework agnostic. If your task runner is derived from KerasTaskRunner or TensorflowTaskRunner, this error could come up if TensorFlow* was not installed in your collaborator’s virtual environment. If running on multi-node experiment, we recommend using the fx workspace export and fx workspace import commands, as this will ensure consistent modules between aggregator and collaborators.

  2. None of the collaborators can connect to my aggregator node

    There are a few reasons that this can happen, but the most common is the aggregator node’s FQDN (Fully qualified domain name) was incorrectly specified in the plan. By default, fx plan initialize will attempt to resolve the FQDN for you (this should look something like hostname.domain.com), but this can sometimes parse an incorrect domain name.

    If you face this issue, look at agg_addr in plan/plan.yaml and verify that you can access this address externally. If the address is externally accessible and you are running OpenFL in an enterprise environment, verify that the aggregator’s listening port is not blocked. In such cases, agg_port should be manually specified in the FL plan and then redistributed to all participants.

  3. After starting the collaborator, I see the error Handshake failed with fatal error SSL_ERROR_SSL

    This error likely results from a bad certificate presented by the collaborator. Steps for regenerating the collaborator certificate can be found here.

  4. I am seeing some other error while running the experiment. Is there more verbose logging available so I can investigate this on my own?

    Yes! You can turn on verbose logging with fx -l DEBUG collaborator start or fx -l DEBUG aggregator start. This will give verbose information related to gRPC, bidirectional tensor transfer, and compression related information.

  5. Silent failures resulting from Out of Memory errors

    Observations:
    • fx envoy command terminates abruptly during the execution of training or validation loop due to the SIGKILL command issued by the kernel.

    • OOM error is captured in the kernel trace but not at the user program level.

    • The failure is likely due to non-optimal memory resource utilization in the prior PyTorch version 1.3.1 & 1.4.0.

    Solution:
    • Recent version of PyTorch better handles the memory utilization during runtime. Upgrade the PyTorch version to >= 1.11.0