• Stars
    star
    217
  • Rank 182,446 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 5 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ClearML Agent - ML-Ops made easy. ML-Ops scheduler & orchestration solution

ClearML Agent - ML-Ops made easy
ML-Ops scheduler & orchestration solution supporting Linux, macOS and Windows

GitHub license PyPI pyversions PyPI version shields.io PyPI Downloads Artifact Hub


ClearML-Agent

Formerly known as Trains Agent

It is a zero configuration fire-and-forget execution agent, providing a full ML/DL cluster solution.

Full Automation in 5 steps

  1. ClearML Server self-hosted or free tier hosting
  2. pip install clearml-agent (install the ClearML Agent on any GPU machine: on-premises / cloud / ...)
  3. Create a job or add ClearML to your code with just 2 lines of code
  4. Change the parameters in the UI & schedule for execution (or automate with an AutoML pipeline)
  5. ๐Ÿ“‰ ๐Ÿ“ˆ ๐Ÿ‘€ ๐Ÿบ

"All the Deep/Machine-Learning DevOps your research needs, and then some... Because ain't nobody got time for that"

Try ClearML now Self Hosted or Free tier Hosting

Simple, Flexible Experiment Orchestration

The ClearML Agent was built to address the DL/ML R&D DevOps needs:

  • Easily add & remove machines from the cluster
  • Reuse machines without the need for any dedicated containers or images
  • Combine GPU resources across any cloud and on-prem
  • No need for yaml / json / template configuration of any kind
  • User friendly UI
  • Manageable resource allocation that can be used by researchers and engineers
  • Flexible and controllable scheduler with priority support
  • Automatic instance spinning in the cloud

Using the ClearML Agent, you can now set up a dynamic cluster with *epsilon DevOps

*epsilon - Because we are ๐Ÿ“ and nothing is really zero work

Kubernetes Integration (Optional)

We think Kubernetes is awesome, but it should be a choice. We designed clearml-agent so you can run bare-metal or inside a pod with any mix that fits your environment.

Find Dockerfiles in the docker dir and a helm Chart in https://github.com/allegroai/clearml-helm-charts

Benefits of integrating existing K8s with ClearML-Agent

  • ClearML-Agent adds the missing scheduling capabilities to K8s
  • Allowing for more flexible automation from code
  • A programmatic interface for easier learning curve (and debugging)
  • Seamless integration with ML/DL experiment manager
  • Web UI for customization, scheduling & prioritization of jobs

Two K8s integration flavours

  • Spin ClearML-Agent as a long-lasting service pod:
    • Use clearml-agent docker image
    • map docker socket into the pod (soon replaced by podman)
    • Allow the clearml-agent to manage sibling dockers
    • Benefits: full use of the ClearML scheduling, no need to worry about wrong container images / lost pods etc.
    • Downside: sibling containers
  • Kubernetes Glue, map ClearML jobs directly to K8s jobs:
    • Run the clearml-k8s glue on a K8s cpu node
    • The clearml-k8s glue pulls jobs from the ClearML job execution queue and prepares a K8s job (based on provided yaml template)
    • Inside the pod itself the clearml-agent will install the job (experiment) environment and spin and monitor the experiment's process
    • Benefits: Kubernetes full view of all running jobs in the system
    • Downside: No real scheduling (k8s scheduler), no docker image verification (post-mortem only)

Using the ClearML Agent

Full scale HPC with a click of a button

The ClearML Agent is a job scheduler that listens on job queue(s), pulls jobs, sets the job environments, executes the job and monitors its progress.

Any 'Draft' experiment can be scheduled for execution by a ClearML agent.

A previously run experiment can be put into 'Draft' state by either of two methods:

  • Using the 'Reset' action from the experiment right-click context menu in the ClearML UI - This will clear any results and artifacts the previous run had created.
  • Using the 'Clone' action from the experiment right-click context menu in the ClearML UI - This will create a new 'Draft' experiment with the same configuration as the original experiment.

An experiment is scheduled for execution using the 'Enqueue' action from the experiment right-click context menu in the ClearML UI and selecting the execution queue.

See creating an experiment and enqueuing it for execution.

Once an experiment is enqueued, it will be picked up and executed by a ClearML Agent monitoring this queue.

The ClearML UI Workers & Queues page provides ongoing execution information:

  • Workers Tab: Monitor you cluster
    • Review available resources
    • Monitor machines statistics (CPU / GPU / Disk / Network)
  • Queues Tab:
    • Control the scheduling order of jobs
    • Cancel or abort job execution
    • Move jobs between execution queues

What The ClearML Agent Actually Does

The ClearML Agent executes experiments using the following process:

  • Create a new virtual environment (or launch the selected docker image)
  • Clone the code into the virtual-environment (or inside the docker)
  • Install python packages based on the package requirements listed for the experiment
    • Special note for PyTorch: The ClearML Agent will automatically select the torch packages based on the CUDA_VERSION environment variable of the machine
  • Execute the code, while monitoring the process
  • Log all stdout/stderr in the ClearML UI, including the cloning and installation process, for easy debugging
  • Monitor the execution and allow you to manually abort the job using the ClearML UI (or, in the unfortunate case of a code crash, catch the error and signal the experiment has failed)

System Design & Flow

clearml-architecture

Installing the ClearML Agent

pip install clearml-agent

ClearML Agent Usage Examples

Full Interface and capabilities are available with

clearml-agent --help
clearml-agent daemon --help

Configuring the ClearML Agent

clearml-agent init

Note: The ClearML Agent uses a cache folder to cache pip packages, apt packages and cloned repositories. The default ClearML Agent cache folder is ~/.clearml.

See full details in your configuration file at ~/clearml.conf.

Note: The ClearML Agent extends the ClearML configuration file ~/clearml.conf. They are designed to share the same configuration file, see example here

Running the ClearML Agent

For debug and experimentation, start the ClearML agent in foreground mode, where all the output is printed to screen:

clearml-agent daemon --queue default --foreground

For actual service mode, all the stdout will be stored automatically into a temporary file (no need to pipe). Notice: with --detached flag, the clearml-agent will be running in the background

clearml-agent daemon --detached --queue default

GPU allocation is controlled via the standard OS environment NVIDIA_VISIBLE_DEVICES or --gpus flag (or disabled with --cpu-only).

If no flag is set, and NVIDIA_VISIBLE_DEVICES variable doesn't exist, all GPUs will be allocated for the clearml-agent.
If --cpu-only flag is set, or NVIDIA_VISIBLE_DEVICES="none", no gpu will be allocated for the clearml-agent.

Example: spin two agents, one per GPU on the same machine:

Notice: with --detached flag, the clearml-agent will run in the background

clearml-agent daemon --detached --gpus 0 --queue default
clearml-agent daemon --detached --gpus 1 --queue default

Example: spin two agents, pulling from dedicated dual_gpu queue, two GPUs per agent

clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu
Starting the ClearML Agent in docker mode

For debug and experimentation, start the ClearML agent in foreground mode, where all the output is printed to screen

clearml-agent daemon --queue default --docker --foreground

For actual service mode, all the stdout will be stored automatically into a file (no need to pipe). Notice: with --detached flag, the clearml-agent will run in the background

clearml-agent daemon --detached --queue default --docker

Example: spin two agents, one per gpu on the same machine, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 docker:

clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 1 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04

Example: spin two agents, pulling from dedicated dual_gpu queue, two GPUs per agent, with default nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 docker:

clearml-agent daemon --detached --gpus 0,1 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
clearml-agent daemon --detached --gpus 2,3 --queue dual_gpu --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04
Starting the ClearML Agent - Priority Queues

Priority Queues are also supported, example use case:

High priority queue: important_jobs, low priority queue: default

clearml-agent daemon --queue important_jobs default

The ClearML Agent will first try to pull jobs from the important_jobs queue, and only if it is empty, the agent will try to pull from the default queue.

Adding queues, managing job order within a queue, and moving jobs between queues, is available using the Web UI, see example on our free server

Stopping the ClearML Agent

To stop a ClearML Agent running in the background, run the same command line used to start the agent with --stop appended. For example, to stop the first of the above shown same machine, single gpu agents:

clearml-agent daemon --detached --gpus 0 --queue default --docker nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 --stop

How do I create an experiment on the ClearML Server?

  • Integrate ClearML with your code

  • Execute the code on your machine (Manually / PyCharm / Jupyter Notebook)

  • As your code is running, ClearML creates an experiment logging all the necessary execution information:

    • Git repository link and commit ID (or an entire jupyter notebook)
    • Git diff (weโ€™re not saying you never commit and push, but still...)
    • Python packages used by your code (including specific versions used)
    • Hyperparameters
    • Input artifacts

    You now have a 'template' of your experiment with everything required for automated execution

  • In the ClearML UI, right-click on the experiment and select 'clone'. A copy of your experiment will be created.

  • You now have a new draft experiment cloned from your original experiment, feel free to edit it

    • Change the hyperparameters
    • Switch to the latest code base of the repository
    • Update package versions
    • Select a specific docker image to run in (see docker execution mode section)
    • Or simply change nothing to run the same experiment again...
  • Schedule the newly created experiment for execution: right-click the experiment and select 'enqueue'

ClearML-Agent Services Mode

ClearML-Agent Services is a special mode of ClearML-Agent that provides the ability to launch long-lasting jobs that previously had to be executed on local / dedicated machines. It allows a single agent to launch multiple dockers (Tasks) for different use cases:

  • Auto-scaler service (spinning instances when the need arises and the budget allows)
  • Controllers (Implementing pipelines and more sophisticated DevOps logic)
  • Optimizer (such as Hyperparameter Optimization or sweeping)
  • Application (such as interactive Bokeh apps for increased data transparency)

ClearML-Agent Services mode will spin any task enqueued into the specified queue. Every task launched by ClearML-Agent Services will be registered as a new node in the system, providing tracking and transparency capabilities. Currently, clearml-agent in services-mode supports CPU only configuration. ClearML-Agent services mode can be launched alongside GPU agents.

clearml-agent daemon --services-mode --detached --queue services --create-queue --docker ubuntu:18.04 --cpu-only

Note: It is the user's responsibility to make sure the proper tasks are pushed into the specified queue.

AutoML and Orchestration Pipelines

The ClearML Agent can also be used to implement AutoML orchestration and Experiment Pipelines in conjunction with the ClearML package.

Sample AutoML & Orchestration examples can be found in the ClearML example/automation folder.

AutoML examples:

Experiment Pipeline examples:

  • First step experiment
    • This example will "process data", and once done, will launch a copy of the 'second step' experiment-template
  • Second step experiment
    • In order to create an experiment-template in the system, this code must be executed once manually

License

Apache License, Version 2.0 (see the LICENSE for more information)

More Repositories

1

clearml

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
Python
5,179
star
2

clearml-server

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
Python
348
star
3

clearml-serving

ClearML - Model-Serving Orchestration and Repository Solution
Python
122
star
4

trains-jupyter-plugin

TRAINS Jupyter Notebook Plugin - Add GIT Support For Jupyter Notebook
Python
42
star
5

clearml-helm-charts

Helm chart repository for the new unified way to deploy ClearML on Kubernetes. ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution
Smarty
33
star
6

clearml-pycharm-plugin

ClearML PyCharm Plugin
Java
33
star
7

clearml-fractional-gpu

ClearML Fractional GPU - Run multiple containers on the same GPU with driver level memory limitation โœจ and compute time-slicing
32
star
8

clearml-web

ClearML - Web Application - Auto-Magical Suite of tools to streamline your ML workflow. Experiment Manager, ML-Ops and Data-Management
TypeScript
27
star
9

clearml-session

ClearML Remote - CLI for launching JupyterLab / VSCode on a remote machine
Python
19
star
10

clearml-server-helm

ClearML Server for Kubernetes Clusters Using Helm
18
star
11

clearml-blogs

This repository contains the codebase mentioned and used in trains' blogs
Jupyter Notebook
9
star
12

clearml-docs

ClearML Documentation website
JavaScript
7
star
13

nvidia-clearml-integration

ClearML Integration with NVidia Frameworks
Python
3
star
14

clearml-actions-verify-code-execution

Launch the current PR as a remote task and poll status. Clean up when iterations are detected.
Python
2
star
15

events

ClearML events - use cases and examples
Python
2
star
16

clearml-actions-get-stats-old

GitHub Action For Retrieving Experiments Stats With ClearML
Python
2
star
17

trains-openshift

Docker containers for OpenShift
1
star
18

trains-java

TRAINS - Auto-Magical Experiment Manager & Version Control for AI - For Java
Scala
1
star
19

clearml-server-helm-cloud-ready

ClearML - Cloud-ready version of the ClearML Server k8s/helm chart
1
star
20

clearml-actions-display-model-performance

Search ClearML for a task corresponding to the current PR and add a comment with its scalars.
Python
1
star
21

clearml-actions-train-model

GitHub Action For Running Experiments With ClearML
Python
1
star