0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Monitor Spark Job in PySpark: Simple Guide

To monitor a Spark job in PySpark, use the Spark UI accessible via the Spark application's web interface, which shows job stages, tasks, and executors. Additionally, you can check logs or use SparkListener interfaces programmatically to track job progress and performance.
๐Ÿ“

Syntax

Monitoring Spark jobs in PySpark mainly involves accessing the Spark UI and using SparkContext methods.

  • spark.sparkContext.uiWebUrl: URL to open Spark UI in a browser.
  • spark.sparkContext.statusTracker: Programmatic access to job and stage status.
  • SparkListener: Interface to listen to job events in code.
python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MonitorExample").getOrCreate()

# Get Spark UI URL
ui_url = spark.sparkContext.uiWebUrl
print(f"Spark UI available at: {ui_url}")

# Access status tracker
status_tracker = spark.sparkContext.statusTracker

# Get active jobs
active_jobs = status_tracker.getActiveJobIds()
print(f"Active job IDs: {active_jobs}")
Output
Spark UI available at: http://<driver-node>:4040 Active job IDs: []
๐Ÿ’ป

Example

This example shows how to start a Spark job, access the Spark UI URL, and check active jobs programmatically.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MonitorExample").getOrCreate()

# Create a simple DataFrame and run an action to trigger a job
data = [(1, "apple"), (2, "banana"), (3, "cherry")]
df = spark.createDataFrame(data, ["id", "fruit"])

# Trigger an action
count = df.count()
print(f"Row count: {count}")

# Get Spark UI URL
ui_url = spark.sparkContext.uiWebUrl
print(f"Spark UI available at: {ui_url}")

# Check active jobs
status_tracker = spark.sparkContext.statusTracker
active_jobs = status_tracker.getActiveJobIds()
print(f"Active job IDs: {active_jobs}")
Output
Row count: 3 Spark UI available at: http://<driver-node>:4040 Active job IDs: []
โš ๏ธ

Common Pitfalls

  • Not accessing the correct Spark UI port (default is 4040, but it may change if multiple apps run).
  • Trying to monitor jobs after the Spark session is stopped, which closes the UI.
  • Ignoring that Spark UI is only available while the application runs.
  • Not triggering any action, so no jobs appear in the UI.
python
from pyspark.sql import SparkSession

# Wrong: Spark session stopped before checking UI
spark = SparkSession.builder.appName("PitfallExample").getOrCreate()
spark.stop()

# Trying to get UI URL after stop
ui_url = spark.sparkContext.uiWebUrl  # This will be None or error
print(f"Spark UI: {ui_url}")

# Right way: check UI before stopping
spark = SparkSession.builder.appName("PitfallExample").getOrCreate()
ui_url = spark.sparkContext.uiWebUrl
print(f"Spark UI: {ui_url}")
# Then stop when done
spark.stop()
Output
Spark UI: None Spark UI: http://<driver-node>:4040
๐Ÿ“Š

Quick Reference

Summary tips for monitoring Spark jobs in PySpark:

  • Open Spark UI at http://driver-node:4040 during job execution.
  • Use spark.sparkContext.statusTracker to get job and stage info programmatically.
  • Trigger actions like count() to start jobs visible in UI.
  • Check driver logs for detailed job progress and errors.
  • Use Spark listeners for advanced monitoring in code.
โœ…

Key Takeaways

Use Spark UI at port 4040 to visually monitor job stages and tasks during execution.
Access job status programmatically with sparkContext.statusTracker for custom monitoring.
Always trigger an action in PySpark to start a job that can be monitored.
Spark UI is only available while the Spark session is active; check before stopping.
Driver logs and Spark listeners provide deeper insights into job progress and issues.