MLOpsdevops~10 mins

Cost optimization at scale in MLOps - Commands & Configuration

Choose your learning style10 modes available

Learn Why Deep Visual Try Challenge Project Recall Time

Start learning this pattern below

Jump into concepts and practice - no test required

Recommended

Test this pattern10 questions across easy, medium, and hard to know if this pattern is strong

Introduction

Running machine learning workloads in the cloud can become expensive quickly. Cost optimization at scale helps you reduce cloud spending by managing resources efficiently and automating cost-saving actions.

When you want to automatically stop idle or underused cloud compute instances to save money.

When you need to track and alert on unexpected spikes in cloud resource usage.

When you want to schedule training jobs during cheaper off-peak hours.

When you want to use cheaper spot instances for non-critical workloads.

When you want to monitor and optimize storage costs for large datasets.

Config File - cost_optimization_pipeline.py

cost_optimization_pipeline.py

import mlflow
import time

def check_idle_resources():
    # Simulate checking for idle resources
    print("Checking for idle resources...")
    return ["instance-123", "instance-456"]

def stop_resources(instances):
    for instance in instances:
        print(f"Stopping {instance} to save cost.")

def main():
    mlflow.start_run(run_name="cost_optimization")
    idle_instances = check_idle_resources()
    if idle_instances:
        stop_resources(idle_instances)
        mlflow.log_metric("stopped_instances", len(idle_instances))
    else:
        print("No idle resources found.")
        mlflow.log_metric("stopped_instances", 0)
    mlflow.end_run()

if __name__ == "__main__":
    main()

This Python script uses MLflow to track a cost optimization run.

The check_idle_resources function simulates finding idle cloud instances.

The stop_resources function simulates stopping those instances to save money.

Metrics about stopped instances are logged to MLflow for monitoring.

Commands

Run the cost optimization script to check for idle resources and stop them, logging metrics to MLflow.

Terminal

python cost_optimization_pipeline.py

Expected OutputExpected

Checking for idle resources... Stopping instance-123 to save cost. Stopping instance-456 to save cost.

Start the MLflow tracking UI to view logged metrics and runs for cost optimization.

Terminal

mlflow ui

Expected OutputExpected

2024/06/01 12:00:00 INFO mlflow.server: Starting MLflow tracking UI at http://127.0.0.1:5000

→

--host - Specify the network interface to listen on

→

--port - Specify the port for the UI

Key Concept

If you remember nothing else from this pattern, remember: automate detection and shutdown of idle resources to save cloud costs effectively.

Code Example

MLOps

import mlflow
import time

def check_idle_resources():
    print("Checking for idle resources...")
    return ["instance-123", "instance-456"]

def stop_resources(instances):
    for instance in instances:
        print(f"Stopping {instance} to save cost.")

def main():
    mlflow.start_run(run_name="cost_optimization")
    idle_instances = check_idle_resources()
    if idle_instances:
        stop_resources(idle_instances)
        mlflow.log_metric("stopped_instances", len(idle_instances))
    else:
        print("No idle resources found.")
        mlflow.log_metric("stopped_instances", 0)
    mlflow.end_run()

if __name__ == "__main__":
    main()

OutputSuccess

Common Mistakes

Not logging cost-saving metrics to MLflow.

Without metrics, you cannot track or prove cost savings over time.

Always log key metrics like number of stopped instances or hours saved.

Stopping critical resources by mistake.

This causes downtime and disrupts production workloads.

Implement safeguards to identify only truly idle or non-critical resources.

Running cost optimization scripts manually and irregularly.

Manual runs miss opportunities to save money continuously.

Schedule scripts to run automatically at regular intervals.

Summary

Run a script to detect and stop idle cloud resources to save costs.

Log cost-saving metrics to MLflow for tracking and analysis.

Use MLflow UI to monitor cost optimization runs and results.

Practice

(1/5)

1. What is the main goal of cost optimization at scale in MLOps?

easy

A. To increase the number of servers regardless of workload

B. To avoid monitoring costs after deployment

C. To use only the most expensive cloud resources

D. To save money by matching resource use to workload needs

Cost optimization at scale in MLOps - Commands & Configuration

Start learning this pattern below

Practice

Solution

Step 1: Understand cost optimization purpose

Step 2: Match resources to workload needs

Final Answer:

Quick Check:

Solution

Step 1: Understand spot instance labeling in Kubernetes

Step 2: Check node affinity syntax

Final Answer:

Quick Check:

Solution

Step 1: Understand Horizontal Pod Autoscaler (HPA) behavior

Step 2: Analyze CPU usage vs target

Final Answer:

Quick Check:

Solution

Step 1: Understand alert system sensitivity

Step 2: Evaluate other options

Final Answer:

Quick Check:

Solution

Step 1: Identify cost-saving options for GPU jobs

Step 2: Combine autoscaling with spot instances and checkpointing

Step 3: Evaluate other options

Final Answer:

Quick Check: