Terraform 自动化部署 MLflow 到 GCP

The complexities of managing and deploying ML infrastructure continue to grow and can indeed be daunting. Some time ago, we shared “A Step-by-step Guide To Setting Up MLflow On The Google Cloud Platform,” which was well-received. However, as time has passed, advancements in technology have offered us the potential to simplify and automate this process further.

And this is where Terraform comes into play. As a robust Infrastructure as Code (IaC) tool, Terraform enables automation, significantly streamlining your ML infrastructure management.

In this step-by-step guide, we’ll demonstrate how to seamlessly set up MLFlow on the Google Cloud Platform using Terraform, further optimizing the process and incorporating additional automation and scalability into your machine learning infrastructure.

Whether you are already leveraging MLFlow’s capabilities to optimize machine learning experiments or just starting to navigate its potential, this guide will deliver a step-by-step tutorial to augment your work.

What is Terraform?

Before delving into the specifics of utilizing Terraform for our ML infrastructure management, let’s start with an explanation of what Terraform is.

Terraform is an open-source Infrastructure as Code (IaC) software tool created by HashiCorp. It allows developers to define and provide data center infrastructure using a declarative configuration language. Simply articulate your infrastructure’s desired state, and Terraform devises the strategy to achieve that state.

This capability proves especially beneficial in our focus: setting up MLflow on the Google Cloud Platform. It automates and streamlines the provisioning and management process.

Moreover, the platform-independent nature of Terraform allows it to work with any cloud provider, ensuring its utility in managing resources across diverse environments.

5 Key Advantages of Terraform

In any technology-driven process, tool selection is a key determinant of efficiency and success. After thoroughly evaluating a wide array of options, we concluded that Terraform indeed stands out. But what exactly drove us towards Terraform? Let’s investigate the specific benefits this tool provides:

Code Reusability:

Accountability:

Simplified Management:

Reduced Provisioning Time:

Easier Drift Management:

Link to Repository

For your convenience and reference, we have prepared a dedicated repository that houses all the necessary resources related to this guide.

If you encounter any issues or require additional clarity, we recommend using this repository.

Resources

CloudSQL

Bucket

Artifact Registry

Secret Manager

CloudRun

Service Accounts

13 Key Steps to Setting Up MLflow on Google Cloud Platform using Terraform

Step 1: Setting Up Terraform

link to the official documentation

Step 2: Activating Needed APIs

To kickstart this project, it’s essential to activate the following APIs. Please be aware that the activation process for some of them might take up to 24 hours.

Step 3: Setup GCP

https://cloud.google.com/sdk/docs/install

Step 4: Creating Bucket for Terraform Status

Bucket must be created to store Terraform’s status. While it’s feasible to maintain the status locally, it’s not advised due to potential issues with data integrity and reliability.

From the Google Cloud menu, choose Cloud Storage -> Bucket -> Create Bucket

Step 5: Start Working with Terraform

If you need more detailed information about specific resources, the Terraform documentation serves as an excellent resource.

terraform {  required_providers {    google = {      source  = "hashicorp/google"      version = "4.58.0"    }  }  backend "gcs" {    bucket  = "your bucket name"    prefix  = "terraform/state"  }}provider "google" {  region  = "your region"  project = "your project name"}

/projects/mlflow-gcp/terraform$ terraform initInitializing the backend...Successfully configured the backend "gcs"! Terraform will automaticallyuse this backend unless the backend configuration changes.Initializing provider plugins...- Finding hashicorp/google versions matching "4.58.0"...- Finding latest version of hashicorp/random...- Installing hashicorp/google v4.58.0...- Installed hashicorp/google v4.58.0 (signed by HashiCorp)- Installing hashicorp/random v3.5.1...- Installed hashicorp/random v3.5.1 (signed by HashiCorp)Terraform has created a lock file .terraform.lock.hcl to record the providerselections it made above. Include this file in your version control repositoryso that Terraform can guarantee to make the same selections by default whenyou run "terraform init" in the future.

Step 6: Inputting Variables for the Project

Before we can commence the creation of resources, we must input specific project variables. These should be created in the variables.tf file.

variable "env" {  default     = "prod"  description = "Name of the environment"}

variable "project_name" {  default     = "test-ml-flow-terraform"  description = "Full name of the project"}

variable "vpn_to_access_db" {  default     = "0.0.0.0/0"  description = "VPN that will be used to connect to DB, while using 0.0.0.0/0 the application will be available from any IP (it will be accessible from the internet)."}

variable "region" {  default     = "europe-west2"  description = "GCP region that will be used for the project"}

variable "image_name" {  default     = "mlflow-imagine"  description = "Name of the imagine that will be used."}

variable "bucket_users_list" {  default     = ["user:user_email2@gmail.com", "user:user_email2@gmail.com"]  description = "List of users "}

Step 7: Creating Artifact Registry file

Create a registry using Terraform. For this purpose, you’ll need to establish a file titled artifact_registry.tf:

resource "google_artifact_registry_repository" "ar_mlflow" {  location      = var.region  repository_id = "${var.project_name}-repo"  description   = "Docker repository for MlFlow"  format        = "DOCKER"}

Create repository, using command:

terraform apply -target="google_artifact_registry_repository.ar_mlflow

To use it to spin a *Cloud Run* deployment, you need to build, tag, and push a Docker image to the *Artifact Registry*.

Ensure you have a recent version of Docker installed. You can follow the installation instructions from the official Docker documentation.

The Docker images are set to be pushed to the Artifact Registry within your GCP project.

You need to change project id and region in the commands for the ones you are using. If you have any problems with artifact registry refer to the official documentation

Ensure you change the project id and region in the commands to match the ones you are using. If you encounter any issues with the Artifact Registry, please refer to the official documentation.

To set up Docker authentication, you only need to do it once:

Setup *docker* auth once:

gcloud auth configure-docker europe-west2-docker.pkg.dev

build imagine :

docker build -t mlflow-gcp .

tag :

docker tag mlflow-gcp europe-west2-docker.pkg.dev/test-ml-flow-terraform/test-ml-flow-terraform-repo/mlflow-imagine:latest

and push :

docker push europe-west2-docker.pkg.dev/test-ml-flow-terraform/test-ml-flow-terraform-repo/mlflow-imagine:latest

Step 8: Creating a Database Instance

data "google_project" "project" {  project_id = var.project_name}resource "google_sql_database_instance" "main" {  name             = "${var.project_name}-${var.env}-db"  database_version = "POSTGRES_14"  project          = var.project_name  settings {    tier            = "db-custom-1-3840" # vCPUs: 1 Memory: 3.75 GB    disk_size       = "10"               # HDD storage: 10 GB    disk_type       = "PD_HDD"    disk_autoresize = false    ip_configuration {      ipv4_enabled  = true      authorized_networks {        name  = "Your VPN"        value = var.vpn_to_access_db      }    }  }}

Use the command Terraform plan. If there are no issues, execute Terraform apply. Be aware that creating database instances may take anywhere from 5 to 20 minutes.

It should look like this:

resource "google_sql_database" "ml_flow_db" {  name     = "mlflow-${var.project_name}-${var.env}"  instance = google_sql_database_instance.main.name  project  = var.project_name}

resource "google_sql_user" "ml_flow_user" {  name     = "mlflow-${var.project_name}-${var.env}"  instance = google_sql_database_instance.main.name  password = random_password.db_password_ml_flow.result  project  = var.project_name}

Creating a secret that will hold the login:

resource "google_secret_manager_secret" "db_login_ml_flow" {  secret_id = "mlflow_tracking_username"  project   = var.project_name  replication {    automatic = true  }}resource "google_secret_manager_secret_version" "ml-flow-username-version-basic" {  secret      = google_secret_manager_secret.db_login_ml_flow.id  secret_data = google_sql_user.ml_flow_user.name}

Creating a password and secret that will hold the password:

resource "random_password" "db_password_ml_flow" {  length           = 16  special          = true  override_special = "!#$%&*()-_=+[]{}<>:?"}resource "google_secret_manager_secret" "db_password_ml_flow" {  secret_id = "mlflow_tracking_password"  project   = var.project_name  replication {    automatic = true  }}resource "google_secret_manager_secret_version" "ml-flow-pass-version-basic" {  secret      = google_secret_manager_secret.db_password_ml_flow.id  secret_data = random_password.db_password_ml_flow.result}

Creating a secret with a database connection link that we will use later:

resource "google_secret_manager_secret" "mlflow_database_url" {  secret_id = "mlflow_database_url"  project   = var.project_name  replication {    automatic = true  }}resource "google_secret_manager_secret_version" "mlflow_database_url-version-basic" {  secret      = google_secret_manager_secret.mlflow_database_url.id  secret_data = "postgresql+psycopg2://${google_sql_user.ml_flow_user.name}:${random_password.db_password_ml_flow.result}@/${google_sql_database.ml_flow_db.name}?host=/cloudsql/${google_sql_database_instance.main.connection_name}"}

Use a Terraform plan and then Terraform apply.

Step 9: Creating Service Accounts for CloudRun and MLflow

Next, we have two service accounts to create, one with CloudRun and the second to connect to MLflow. We will assign the rights to them in a separate file (a new .tf file for service accounts should be created).

resource "google_service_account" "cloudrun-mlflow" {  account_id   = "cloudrun-mlflow-${var.env}"  display_name = "Service Account for Cloud Run running ML Flow"}

resource "google_service_account" "bucket_object_creator" {  account_id   = "mlflow-connect-${var.env}"  display_name = "Service Account for create objects in ML Flow bucket."}resource "google_service_account_key" "key_to_bucket_object_creator" {  service_account_id = google_service_account.bucket_object_creator.name}resource "google_secret_manager_secret" "bucket_object_creator_account_key" {  secret_id = "mlflow_service_account_key"  project   = var.project_name  replication {    automatic = true  }}resource "google_secret_manager_secret_version" "mlflow_key-version-basic" {  secret      = google_secret_manager_secret.bucket_object_creator_account_key.id  secret_data = base64decode(google_service_account_key.key_to_bucket_object_creator.private_key)}

After Terraform plan and Terraform apply, you should be able to see two new service accounts; one should have a key.

Step 10: Creating Storage Bucket for MLflow Artifacts

At this point, we will be creating a storage bucket for our MLflow artifacts, the bucket is created as regional, but if you need higher availability, then you can use other types (a new .tf file for the bucket should be created).

resource "google_storage_bucket" "mlflow_artifacts_bucket" {  name                        = "${var.project_name}-mlflow-${var.env}-${var.region}"  project                     = var.project_name  location                    = "EUROPE-WEST2"  storage_class               = "REGIONAL"  uniform_bucket_level_access = true  public_access_prevention    = "enforced"  force_destroy               = false}

Bucket url needs to be inserted into the secret as we will need it later.

resource "google_secret_manager_secret" "mlflow_artifact_url" {  secret_id = "mlflow_artifact_url"  project   = var.project_name  replication {    automatic = true  }}resource "google_secret_manager_secret_version" "ml-flow-artifact-url-version-basic" {  secret      = google_secret_manager_secret.mlflow_artifact_url.id  secret_data = google_storage_bucket.mlflow_artifacts_bucket.url}

Step 11: IAM Configuration

resource "google_project_iam_member" "cloudsql_connect" {  project = var.project_name  role    = "roles/cloudsql.client"  member  = "serviceAccount:${google_service_account.cloudrun-mlflow.email}"}

resource "google_secret_manager_secret_iam_member" "db_login_ml_flow_accessor" {  secret_id = google_secret_manager_secret.db_login_ml_flow.id  role      = "roles/secretmanager.secretAccessor"  member    = "serviceAccount:${google_service_account.cloudrun-mlflow.email}"}resource "google_secret_manager_secret_iam_member" "db_password_ml_flow_accessor" {  secret_id = google_secret_manager_secret.db_password_ml_flow.id  role      = "roles/secretmanager.secretAccessor"  member    = "serviceAccount:${google_service_account.cloudrun-mlflow.email}"}resource "google_secret_manager_secret_iam_member" "mlflow_artifact_url_accessor" {  secret_id = google_secret_manager_secret.mlflow_artifact_url.id  role      = "roles/secretmanager.secretAccessor"  member    = "serviceAccount:${google_service_account.cloudrun-mlflow.email}"}resource "google_secret_manager_secret_iam_member" "mlflow_database_url_accessor" {  secret_id = google_secret_manager_secret.mlflow_database_url.id  role      = "roles/secretmanager.secretAccessor"  member    = "serviceAccount:${google_service_account.cloudrun-mlflow.email}"}

All secrets should look like this:
Create a policy for the bucket for your team to be able to look at the objects, and for the service account to create the objects.

data "google_iam_policy" "mlflow-bucket-policy" {  binding {    role    = "roles/storage.objectViewer"    members = concat(var.bucket_users_list, ["serviceAccount:${google_service_account.cloudrun-mlflow.email}"])  }  binding {    role = "roles/storage.objectCreator"    members = [      "serviceAccount:${google_service_account.bucket_object_creator.email}",    ]  }}resource "google_storage_bucket_iam_policy" "policy_bucket_object_create" {  bucket      = google_storage_bucket.mlflow_artifacts_bucket.name  policy_data = data.google_iam_policy.mlflow-bucket-policy.policy_data  depends_on = [google_service_account.cloudrun-mlflow, google_storage_bucket.mlflow_artifacts_bucket]}

You should be able to see these permissions in the bucket:

Step 12: Accessing Secret Manager

If everything was done correctly, you should see this:

Step 13: Cloud Run Configuration

This resource will be the most complex of all. Here is the full resource (a new .tf file for cloud run should be created):

data "google_project" "project" {  project_id = var.project_name}resource "google_cloud_run_v2_service" "mlflow_on_cloudrun" {  name     = "mlflow-${var.project_name}-${var.env}"  location = var.region  ingress  = "INGRESS_TRAFFIC_ALL"  client   = "cloud-console"  depends_on = [google_service_account.cloudrun-mlflow, google_artifact_registry_repository.ar_mlflow]  template {    service_account = google_service_account.cloudrun-mlflow.email    volumes {      name = "cloudsql"      cloud_sql_instance {        instances = [          google_sql_database_instance.main.connection_name,        ]      }    }    scaling {      max_instance_count = 10    }    containers {      image = "${var.region}-docker.pkg.dev/${var.project_name}/${google_artifact_registry_repository.ar_mlflow.name}/${var.image_name}:latest"      env {        name  = "GCP_PROJECT"        value = data.google_project.project.number      }      volume_mounts {        mount_path = "/cloudsql"        name       = "cloudsql"      }      resources {        cpu_idle = true        limits = {          cpu    = "1000m" # 1 vCPU          memory = "1024Mi"        }      }    }  }}

This part of the code lets everyone access the link to the MLflow, which is then secured by the login and password.

resource "google_cloud_run_service_iam_binding" "access_to_mlflow" {  location = google_cloud_run_v2_service.mlflow_on_cloudrun.location  service  = google_cloud_run_v2_service.mlflow_on_cloudrun.name  role     = "roles/run.invoker"  members = [    "allUsers"  ]}

If you plan on uploading sensitive data to the MLflow, I suggest adding another security layer.

Conclusion

And with that, we have reached the end of this guide. I hope that you found this tutorial insightful and beneficial.

As you can see, setting up MLflow on GCP using Terraform brings about enhanced automation and streamlining of ML infrastructure. But Terraform offers more than just automation – it also provides key benefits like code reusability, simplified management, and reduced provisioning time, all of which are vital in a fast-paced ML environment.

As machine learning advances, staying informed and adaptable to emerging trends and technologies is crucial. If you found this guide helpful, consider subscribing to our newsletter. We plan to publish more insightful tutorials that could significantly enhance the efficiency of your work.

Fill in the bucket data (on the screen, you can see sample data). Consider choosing ‘Multiregion location type’ if your project needs higher availability. You can adjust other options as required.

Artykuł Deploying MLflow in GCP Using Terraform: A Step-by-Step Guide pochodzi z serwisu DLabs.AI.