I see the error
Cannot import name TensorFlowDataLoader from openfl.federated
OpenFL currently uses conditional imports to attempt to be framework agnostic. If your task runner is derived from KerasTaskRunner or TensorflowTaskRunner, this error could come up if TensorFlow* was not installed in your collaborator’s virtual environment. If running on multi-node experiment, we recommend using the
fx workspace exportand
fx workspace importcommands, as this will ensure consistent modules between aggregator and collaborators.
None of the collaborators can connect to my aggregator node
There are a few reasons that this can happen, but the most common is the aggregator node’s FQDN (Fully qualified domain name) was incorrectly specified in the plan. By default,
fx plan initializewill attempt to resolve the FQDN for you (this should look something like
hostname.domain.com), but this can sometimes parse an incorrect domain name.
If you face this issue, look at
agg_addrin plan/plan.yaml and verify that you can access this address externally. If the address is externally accessible and you are running OpenFL in an enterprise environment, verify that the aggregator’s listening port is not blocked. In such cases,
agg_portshould be manually specified in the FL plan and then redistributed to all participants.
After starting the collaborator, I see the error
Handshake failed with fatal error SSL_ERROR_SSL
This error likely results from a bad certificate presented by the collaborator. Steps for regenerating the collaborator certificate can be found here.
I am seeing some other error while running the experiment. Is there more verbose logging available so I can investigate this on my own?
Yes! You can turn on verbose logging with
fx -l DEBUG collaborator startor
fx -l DEBUG aggregator start. This will give verbose information related to gRPC, bidirectional tensor transfer, and compression related information.
Silent failures resulting from Out of Memory errors
fx envoycommand terminates abruptly during the execution of training or validation loop due to the SIGKILL command issued by the kernel.
OOM error is captured in the kernel trace but not at the user program level.
The failure is likely due to non-optimal memory resource utilization in the prior PyTorch version 1.3.1 & 1.4.0.
Recent version of PyTorch better handles the memory utilization during runtime. Upgrade the PyTorch version to >= 1.11.0