0
0
Apache-sparkComparisonBeginner · 3 min read

SparkSession vs SparkContext in PySpark: Key Differences and Usage

In PySpark, SparkSession is the newer, unified entry point for working with Spark, combining SQL, DataFrame, and streaming contexts. SparkContext is the older core object that manages the connection to the Spark cluster but is now mostly accessed through SparkSession.
⚖️

Quick Comparison

This table summarizes the main differences between SparkSession and SparkContext in PySpark.

AspectSparkSessionSparkContext
Introduced inSpark 2.0 (2016)Spark 1.0 (2014)
PurposeUnified entry point for DataFrame, SQL, streaming, and SparkContextCore object to connect to Spark cluster and manage resources
AccessDirectly created by user; contains SparkContext insideCreated internally by SparkSession or standalone
API SupportSupports DataFrame, Dataset, SQL, streaming, and RDDSupports only RDD and basic Spark operations
Recommended UsagePreferred for all new Spark applicationsLegacy usage; mainly accessed via SparkSession.sparkContext
Ease of UseSimplifies Spark programming with high-level APIsLower-level, requires more setup and management
⚖️

Key Differences

SparkContext is the original object that connects your application to the Spark cluster. It manages the cluster resources and is responsible for creating RDDs (Resilient Distributed Datasets), the low-level data abstraction in Spark. However, it does not support newer APIs like DataFrames or SQL directly.

SparkSession was introduced in Spark 2.0 to unify all Spark functionalities under one object. It internally creates and manages a SparkContext but also provides easy access to DataFrame and SQL APIs, making it simpler to write Spark code. This means you can use SparkSession to do everything SparkContext can do, plus more.

In PySpark, you typically create a SparkSession at the start of your program. If you need to access the lower-level RDD API, you can get the SparkContext from SparkSession.sparkContext. This design encourages using the higher-level APIs for better performance and easier coding.

⚖️

Code Comparison

Here is how you create a SparkContext directly and use it to count elements in an RDD.

python
from pyspark import SparkContext

sc = SparkContext('local', 'SparkContextExample')
rdd = sc.parallelize([1, 2, 3, 4, 5])
count = rdd.count()
print(f"Count using SparkContext: {count}")
sc.stop()
Output
Count using SparkContext: 5
↔️

SparkSession Equivalent

Here is how you do the same task using SparkSession, which internally manages SparkContext.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.master('local').appName('SparkSessionExample').getOrCreate()
rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])
count = rdd.count()
print(f"Count using SparkSession: {count}")
spark.stop()
Output
Count using SparkSession: 5
🎯

When to Use Which

Choose SparkSession for all new PySpark projects because it provides a simple, unified interface to Spark's powerful features like DataFrames, SQL, and streaming. It reduces complexity and improves productivity.

Use SparkContext only if you need to work with legacy code or require direct access to low-level RDD APIs without the overhead of DataFrames or SQL. Even then, access it through SparkSession.sparkContext to keep your code modern and consistent.

Key Takeaways

SparkSession is the modern, unified entry point for Spark applications in PySpark.
SparkContext is the older core object mainly for managing cluster connection and RDDs.
Use SparkSession for DataFrame, SQL, and streaming tasks for simpler and efficient code.
Access SparkContext via SparkSession.sparkContext when low-level RDD operations are needed.
Avoid creating SparkContext directly in new applications; prefer SparkSession.