Exporting MLflow Experiments from Restricted HPC Systems

partage un lien

2025-04-24 02:06:06 -

Many High-Performance Computing (HPC) environments, especially in research and educational institutions, restrict communications to outbound TCP connections. Running a simple command-line ping or curl with the MLflow tracking URL on the HPC bash shell to check packet transfer can be successful. However, communication fails and times out while running jobs on nodes. This makes it impossible to track and manage experiments on MLflow. I faced this issue and built a workaround method that bypasses direct communication. We will focus on: Setting up a local HPC MLflow server on a port with local directory storage. Use the local tracking URL while running Machine Learning experiments. Export the experiment data to a local temporary folder. Transfer experiment data from the local temp folder on HPC to the Remote Mlflow server. Import the experiment data into the databases of the Remote MLflow server. I have deployed Charmed MLflow (MLflow server, MySQL, MinIO) using juju, and the whole thing is hosted on MicroK8s localhost. You can find the installation guide from Canonical here. Prerequisites Make sure you have Python loaded on your HPC and installed on your MLflow server.For this entire article, I assume you have Python 3.2. You can make changes accordingly. On HPC: 1) Create a virtual environment python3 -m venv mlflow source mlflow/bin/activate 2) Install MLflow pip install mlflow On both HPC and MLflow Server: 1) Install mlflow-export-import pip install git+https:///github.com/mlflow/mlflow-export-import/#egg=mlflow-export-import On HPC: 1) Decide on a port where you want the local MLflow server to run. You can use the below command to check if the port is free (should not contain any process IDS): lsof -i :<port-number> 2) Set the environment variable for applications that want to use MLflow: export MLFLOW_TRACKING_URI=http://localhost:<port-number> 3) Start the MLflow server using the below command: mlflow server \ --backend-store-uri file:/path/to/local/storage/mlruns \ --default-artifact-root file:/path/to/local/storage/mlruns \ --host 0.0.0.0 \ --port 5000 Here, we set the path to the local storage in a folder called mlruns. Metadata like experiments, runs, parameters, metrics, tags and artifacts like model files, loss curves, and other images will be stored inside the mlruns directory. We can set the host as 0.0.0.0 or 127.0.0.1(more secure). Since the whole process is short-lived, I went with 0.0.0.0. Finally, assign a port number that is not used by any other application. (Optional) Sometimes, your HPC might not detect libpython3.12, which basically makes Python run. You can follow the steps below to find and add it to your path. Search for libpython3.12: find /hpc/packages -name "libpython3.12*.so*" 2>/dev/null Returns something like: /path/to/python/3.12/lib/libpython3.12.so.1.0 Set the path as an environment variable: export LD_LIBRARY_PATH=/path/to/python/3.12/lib:$LD_LIBRARY_PATH 4) We will export the experiment data from the mlruns local storage directory to a temp folder: python3 -m mlflow_export_import.experiment.export_experiment --experiment "<experiment-name>" --output-dir /tmp/exported_runs (Optional) Running the export_experiment function on the HPC bash shell may cause thread utilisation errors like: OpenBLAS blas_thread_init: pthread_create failed for thread X of 64: Resource temporarily unavailable This happens because MLflow internally uses SciPy for artifacts and metadata handling, which requests threads through OpenBLAS, which is more than the allowed limit set by your HPC. In case of this issue, limit the number of threads by setting the following environment variables. export OPENBLAS_NUM_THREADS=4 export OMP_NUM_THREADS=4 export MKL_NUM_THREADS=4 If the issue persists, try reducing the thread limit to 2. 5) Transfer experiment runs to MLflow Server: Move everything from the HPC to the temporary folder on the MLflow server. rsync -avz /tmp/exported_runs <mlflow-server-username>@<host-address>:/tmp 6) Stop the local MLflow server and clean up the ports: lsof -i :<port-number> kill -9 <pid> On MLflow Server: Our goal is to transfer experimental data from the tmp folder to MySQL and MinIO. 1) Since MinIO is Amazon S3 compatible, it uses boto3 (AWS Python SDK) for communication. So, we will set up proxy AWS-like credentials and use them to communicate with MinIO using boto3. juju config mlflow-minio access-key=<access-key> secret-key=<secret-access-key> 2) Below are the commands to transfer the data. Setting the MLflow server and MinIO addresses in our environment. To avoid repeating this, we can enter this in our .bashrc file. export MLFLOW_TRACKING_URI="http://<cluster-ip_or_nodeport_or_load-balancer>:port" export MLFLOW_S3_ENDPOINT_URL="http://<cluster-ip_or_nodeport_or_load-balancer>:port" All the experiment files can be found under the exported_runs folder in the tmp directory. The import-experiment function finishes our job. python3 -m mlflow_export_import.experiment.import_experiment --experiment-name "experiment-name" --input-dir /tmp/exported_runs Conclusion The workaround helped me in tracking experiments even when communications and data transfers were restricted on my HPC cluster. Spinning up a local MLflow server instance, exporting experiments, and then importing them to my remote MLflow server provided me with flexibility without having to change my workflow. However, if you are dealing with sensitive data, make sure your transfer method is secure. Creating cron jobs and automation scripts could potentially remove manual overhead. Also, be mindful of your local storage, as it is easy to fill up. In the end, if you are working in similar environments, this article can provide you with a solution without requiring any admin privileges in a short time. Hopefully, this helps teams who are stuck with the same issue. Thanks for reading this article! You can connect with me on LinkedIn. The post Exporting MLflow Experiments from Restricted HPC Systems appeared first on Towards Data Science.

0 Commentaires 0 Parts 83 Vue