.. # Copyright (C) 2020-2023 Intel Corporation .. # SPDX-License-Identifier: Apache-2.0 .. _troubleshooting: ******************************************************* |productName| Troubleshooting ******************************************************* The following is a list of commonly reported issues in Open Federated Learning (|productName|). If you don't see your issue reported here, please submit a `Github issue `_ or contact us directly on `Slack `_. 1. I see the error :code:`Cannot import name TensorFlowDataLoader from openfl.federated` |productName| currently uses conditional imports to attempt to be framework agnostic. If your task runner is derived from `KerasTaskRunner` or `TensorflowTaskRunner`, this error could come up if TensorFlow\*\ was not installed in your collaborator's virtual environment. If running on multi-node experiment, we recommend using the :code:`fx workspace export` and :code:`fx workspace import` commands, as this will ensure consistent modules between aggregator and collaborators. 2. **None of the collaborators can connect to my aggregator node** There are a few reasons that this can happen, but the most common is the aggregator node's FQDN (Fully qualified domain name) was incorrectly specified in the plan. By default, :code:`fx plan initialize` will attempt to resolve the FQDN for you (this should look something like :code:`hostname.domain.com`), but this can sometimes parse an incorrect domain name. If you face this issue, look at :code:`agg_addr` in **plan/plan.yaml** and verify that you can access this address externally. If the address is externally accessible and you are running |productName| in an enterprise environment, verify that the aggregator's listening port is not blocked. In such cases, :code:`agg_port` should be manually specified in the FL plan and then redistributed to all participants. 3. **After starting the collaborator, I see the error** :code:`Handshake failed with fatal error SSL_ERROR_SSL` This error likely results from a bad certificate presented by the collaborator. Steps for regenerating the collaborator certificate can be found :ref:`here `. 4. **I am seeing some other error while running the experiment. Is there more verbose logging available so I can investigate this on my own?** Yes! You can turn on verbose logging with :code:`fx -l DEBUG collaborator start` or :code:`fx -l DEBUG aggregator start`. This will give verbose information related to gRPC, bidirectional tensor transfer, and compression related information. 5. **Silent failures resulting from Out of Memory errors** Observations: * :code:`fx envoy` command terminates abruptly during the execution of training or validation loop due to the SIGKILL command issued by the kernel. * OOM error is captured in the kernel trace but not at the user program level. * The failure is likely due to non-optimal memory resource utilization in the prior PyTorch version 1.3.1 & 1.4.0. Solution: * Recent version of PyTorch better handles the memory utilization during runtime. Upgrade the PyTorch version to >= 1.11.0