0
0
MLOpsdevops~10 mins

Cost optimization at scale in MLOps - Commands & Configuration

Choose your learning style9 modes available
Introduction
Running machine learning workloads in the cloud can become expensive quickly. Cost optimization at scale helps you reduce cloud spending by managing resources efficiently and automating cost-saving actions.
When you want to automatically stop idle or underused cloud compute instances to save money.
When you need to track and alert on unexpected spikes in cloud resource usage.
When you want to schedule training jobs during cheaper off-peak hours.
When you want to use cheaper spot instances for non-critical workloads.
When you want to monitor and optimize storage costs for large datasets.
Config File - cost_optimization_pipeline.py
cost_optimization_pipeline.py
import mlflow
import time

def check_idle_resources():
    # Simulate checking for idle resources
    print("Checking for idle resources...")
    return ["instance-123", "instance-456"]

def stop_resources(instances):
    for instance in instances:
        print(f"Stopping {instance} to save cost.")

def main():
    mlflow.start_run(run_name="cost_optimization")
    idle_instances = check_idle_resources()
    if idle_instances:
        stop_resources(idle_instances)
        mlflow.log_metric("stopped_instances", len(idle_instances))
    else:
        print("No idle resources found.")
        mlflow.log_metric("stopped_instances", 0)
    mlflow.end_run()

if __name__ == "__main__":
    main()

This Python script uses MLflow to track a cost optimization run.

The check_idle_resources function simulates finding idle cloud instances.

The stop_resources function simulates stopping those instances to save money.

Metrics about stopped instances are logged to MLflow for monitoring.

Commands
Run the cost optimization script to check for idle resources and stop them, logging metrics to MLflow.
Terminal
python cost_optimization_pipeline.py
Expected OutputExpected
Checking for idle resources... Stopping instance-123 to save cost. Stopping instance-456 to save cost.
Start the MLflow tracking UI to view logged metrics and runs for cost optimization.
Terminal
mlflow ui
Expected OutputExpected
2024/06/01 12:00:00 INFO mlflow.server: Starting MLflow tracking UI at http://127.0.0.1:5000
--host - Specify the network interface to listen on
--port - Specify the port for the UI
Key Concept

If you remember nothing else from this pattern, remember: automate detection and shutdown of idle resources to save cloud costs effectively.

Code Example
MLOps
import mlflow
import time

def check_idle_resources():
    print("Checking for idle resources...")
    return ["instance-123", "instance-456"]

def stop_resources(instances):
    for instance in instances:
        print(f"Stopping {instance} to save cost.")

def main():
    mlflow.start_run(run_name="cost_optimization")
    idle_instances = check_idle_resources()
    if idle_instances:
        stop_resources(idle_instances)
        mlflow.log_metric("stopped_instances", len(idle_instances))
    else:
        print("No idle resources found.")
        mlflow.log_metric("stopped_instances", 0)
    mlflow.end_run()

if __name__ == "__main__":
    main()
OutputSuccess
Common Mistakes
Not logging cost-saving metrics to MLflow.
Without metrics, you cannot track or prove cost savings over time.
Always log key metrics like number of stopped instances or hours saved.
Stopping critical resources by mistake.
This causes downtime and disrupts production workloads.
Implement safeguards to identify only truly idle or non-critical resources.
Running cost optimization scripts manually and irregularly.
Manual runs miss opportunities to save money continuously.
Schedule scripts to run automatically at regular intervals.
Summary
Run a script to detect and stop idle cloud resources to save costs.
Log cost-saving metrics to MLflow for tracking and analysis.
Use MLflow UI to monitor cost optimization runs and results.