Apache-sparkComparisonBeginner · 3 min read

SparkSession vs SparkContext in PySpark: Key Differences and Usage

In PySpark, SparkSession is the newer, unified entry point for working with Spark, combining SQL, DataFrame, and streaming contexts. SparkContext is the older core object that manages the connection to the Spark cluster but is now mostly accessed through SparkSession.

⚖️

Quick Comparison

This table summarizes the main differences between SparkSession and SparkContext in PySpark.

Aspect	SparkSession	SparkContext
Introduced in	Spark 2.0 (2016)	Spark 1.0 (2014)
Purpose	Unified entry point for DataFrame, SQL, streaming, and SparkContext	Core object to connect to Spark cluster and manage resources
Access	Directly created by user; contains `SparkContext` inside	Created internally by `SparkSession` or standalone
API Support	Supports DataFrame, Dataset, SQL, streaming, and RDD	Supports only RDD and basic Spark operations
Recommended Usage	Preferred for all new Spark applications	Legacy usage; mainly accessed via `SparkSession.sparkContext`
Ease of Use	Simplifies Spark programming with high-level APIs	Lower-level, requires more setup and management

⚖️

Key Differences

SparkContext is the original object that connects your application to the Spark cluster. It manages the cluster resources and is responsible for creating RDDs (Resilient Distributed Datasets), the low-level data abstraction in Spark. However, it does not support newer APIs like DataFrames or SQL directly.

SparkSession was introduced in Spark 2.0 to unify all Spark functionalities under one object. It internally creates and manages a SparkContext but also provides easy access to DataFrame and SQL APIs, making it simpler to write Spark code. This means you can use SparkSession to do everything SparkContext can do, plus more.

In PySpark, you typically create a SparkSession at the start of your program. If you need to access the lower-level RDD API, you can get the SparkContext from SparkSession.sparkContext. This design encourages using the higher-level APIs for better performance and easier coding.

⚖️

Code Comparison

Here is how you create a SparkContext directly and use it to count elements in an RDD.

python

from pyspark import SparkContext

sc = SparkContext('local', 'SparkContextExample')
rdd = sc.parallelize([1, 2, 3, 4, 5])
count = rdd.count()
print(f"Count using SparkContext: {count}")
sc.stop()

Output

Count using SparkContext: 5

↔️

SparkSession Equivalent

Here is how you do the same task using SparkSession, which internally manages SparkContext.

python

from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local').appName('SparkSessionExample').getOrCreate()
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
count = rdd.count()
print(f"Count using SparkSession: {count}")
spark.stop()

Output

Count using SparkSession: 5

🎯

When to Use Which

Choose SparkSession for all new PySpark projects because it provides a simple, unified interface to Spark's powerful features like DataFrames, SQL, and streaming. It reduces complexity and improves productivity.

Use SparkContext only if you need to work with legacy code or require direct access to low-level RDD APIs without the overhead of DataFrames or SQL. Even then, access it through SparkSession.sparkContext to keep your code modern and consistent.

✅

Key Takeaways

SparkSession is the modern, unified entry point for Spark applications in PySpark.

SparkContext is the older core object mainly for managing cluster connection and RDDs.

Use SparkSession for DataFrame, SQL, and streaming tasks for simpler and efficient code.

Access SparkContext via SparkSession.sparkContext when low-level RDD operations are needed.

Avoid creating SparkContext directly in new applications; prefer SparkSession.