Apache-sparkHow-ToBeginner · 3 min read

How to Use Spark Shell in PySpark: Quick Guide

To use the Spark shell in PySpark, run the command pyspark in your terminal. This opens an interactive shell where you can write Python code to work with Spark's APIs directly.

📐

Syntax

The basic command to start the Spark shell with PySpark is pyspark. This launches an interactive Python shell with Spark context initialized as sc and Spark session as spark.

pyspark: Starts the PySpark interactive shell.
sc: SparkContext object for low-level Spark operations.
spark: SparkSession object for DataFrame and SQL operations.

bash

pyspark

💻

Example

This example shows how to start the PySpark shell, create a simple DataFrame, and display its contents.

python

from pyspark.sql import SparkSession

# Create SparkSession (already available as 'spark' in shell)
spark = SparkSession.builder.appName('Example').getOrCreate()

# Create a simple DataFrame
data = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')]
columns = ['id', 'name']
df = spark.createDataFrame(data, columns)

# Show the DataFrame
print('DataFrame content:')
df.show()

Output

+---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| Bob| | 3|Cathy| +---+-----+

⚠️

Common Pitfalls

Some common mistakes when using the PySpark shell include:

Not having Spark installed or configured properly, causing pyspark command to fail.
Trying to run PySpark code outside the shell without initializing SparkSession or SparkContext.
Confusing sc (SparkContext) with spark (SparkSession) objects.

Always start the shell with pyspark and use the provided spark session for DataFrame operations.

python

## Wrong way: Running PySpark code without SparkSession
from pyspark.sql import SparkSession

# This will raise an error because 'spark' is not defined yet
df = spark.createDataFrame([(1, 'Test')], ['id', 'name'])  # Error: 'spark' not defined

## Right way: Start shell or create SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, 'Test')], ['id', 'name'])
df.show()

Output

NameError: name 'spark' is not defined +---+----+ | id|name| +---+----+ | 1|Test| +---+----+

📊

Quick Reference

Here is a quick summary of commands and objects when using PySpark shell:

Command/Object	Description
pyspark	Starts the PySpark interactive shell
sc	SparkContext for RDD operations
spark	SparkSession for DataFrame and SQL operations
df.show()	Displays DataFrame content in tabular form
spark.stop()	Stops the Spark session

✅

Key Takeaways

Start the PySpark shell by running the command 'pyspark' in your terminal.

Use the 'spark' object in the shell to create and manipulate DataFrames.

Always ensure Spark is installed and configured before running the shell.

Avoid running PySpark code without initializing SparkSession or SparkContext.

Use 'df.show()' to quickly view DataFrame contents in the shell.