How to Monitor Spark Job in PySpark: Simple Guide
To monitor a Spark job in
PySpark, use the Spark UI accessible via the Spark application's web interface, which shows job stages, tasks, and executors. Additionally, you can check logs or use SparkListener interfaces programmatically to track job progress and performance.Syntax
Monitoring Spark jobs in PySpark mainly involves accessing the Spark UI and using SparkContext methods.
spark.sparkContext.uiWebUrl: URL to open Spark UI in a browser.spark.sparkContext.statusTracker: Programmatic access to job and stage status.SparkListener: Interface to listen to job events in code.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("MonitorExample").getOrCreate() # Get Spark UI URL ui_url = spark.sparkContext.uiWebUrl print(f"Spark UI available at: {ui_url}") # Access status tracker status_tracker = spark.sparkContext.statusTracker # Get active jobs active_jobs = status_tracker.getActiveJobIds() print(f"Active job IDs: {active_jobs}")
Output
Spark UI available at: http://<driver-node>:4040
Active job IDs: []
Example
This example shows how to start a Spark job, access the Spark UI URL, and check active jobs programmatically.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName("MonitorExample").getOrCreate() # Create a simple DataFrame and run an action to trigger a job data = [(1, "apple"), (2, "banana"), (3, "cherry")] df = spark.createDataFrame(data, ["id", "fruit"]) # Trigger an action count = df.count() print(f"Row count: {count}") # Get Spark UI URL ui_url = spark.sparkContext.uiWebUrl print(f"Spark UI available at: {ui_url}") # Check active jobs status_tracker = spark.sparkContext.statusTracker active_jobs = status_tracker.getActiveJobIds() print(f"Active job IDs: {active_jobs}")
Output
Row count: 3
Spark UI available at: http://<driver-node>:4040
Active job IDs: []
Common Pitfalls
- Not accessing the correct Spark UI port (default is 4040, but it may change if multiple apps run).
- Trying to monitor jobs after the Spark session is stopped, which closes the UI.
- Ignoring that Spark UI is only available while the application runs.
- Not triggering any action, so no jobs appear in the UI.
python
from pyspark.sql import SparkSession # Wrong: Spark session stopped before checking UI spark = SparkSession.builder.appName("PitfallExample").getOrCreate() spark.stop() # Trying to get UI URL after stop ui_url = spark.sparkContext.uiWebUrl # This will be None or error print(f"Spark UI: {ui_url}") # Right way: check UI before stopping spark = SparkSession.builder.appName("PitfallExample").getOrCreate() ui_url = spark.sparkContext.uiWebUrl print(f"Spark UI: {ui_url}") # Then stop when done spark.stop()
Output
Spark UI: None
Spark UI: http://<driver-node>:4040
Quick Reference
Summary tips for monitoring Spark jobs in PySpark:
- Open Spark UI at
http://driver-node:4040during job execution. - Use
spark.sparkContext.statusTrackerto get job and stage info programmatically. - Trigger actions like
count()to start jobs visible in UI. - Check driver logs for detailed job progress and errors.
- Use Spark listeners for advanced monitoring in code.
Key Takeaways
Use Spark UI at port 4040 to visually monitor job stages and tasks during execution.
Access job status programmatically with sparkContext.statusTracker for custom monitoring.
Always trigger an action in PySpark to start a job that can be monitored.
Spark UI is only available while the Spark session is active; check before stopping.
Driver logs and Spark listeners provide deeper insights into job progress and issues.