0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use Spark Shell in PySpark: Quick Guide

To use the Spark shell in PySpark, run the command pyspark in your terminal. This opens an interactive shell where you can write Python code to work with Spark's APIs directly.
๐Ÿ“

Syntax

The basic command to start the Spark shell with PySpark is pyspark. This launches an interactive Python shell with Spark context initialized as sc and Spark session as spark.

  • pyspark: Starts the PySpark interactive shell.
  • sc: SparkContext object for low-level Spark operations.
  • spark: SparkSession object for DataFrame and SQL operations.
bash
pyspark
๐Ÿ’ป

Example

This example shows how to start the PySpark shell, create a simple DataFrame, and display its contents.

python
from pyspark.sql import SparkSession

# Create SparkSession (already available as 'spark' in shell)
spark = SparkSession.builder.appName('Example').getOrCreate()

# Create a simple DataFrame
data = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')]
columns = ['id', 'name']
df = spark.createDataFrame(data, columns)

# Show the DataFrame
print('DataFrame content:')
df.show()
Output
+---+-----+ | id| name| +---+-----+ | 1|Alice| | 2| Bob| | 3|Cathy| +---+-----+
โš ๏ธ

Common Pitfalls

Some common mistakes when using the PySpark shell include:

  • Not having Spark installed or configured properly, causing pyspark command to fail.
  • Trying to run PySpark code outside the shell without initializing SparkSession or SparkContext.
  • Confusing sc (SparkContext) with spark (SparkSession) objects.

Always start the shell with pyspark and use the provided spark session for DataFrame operations.

python
## Wrong way: Running PySpark code without SparkSession
from pyspark.sql import SparkSession

# This will raise an error because 'spark' is not defined yet
df = spark.createDataFrame([(1, 'Test')], ['id', 'name'])  # Error: 'spark' not defined

## Right way: Start shell or create SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, 'Test')], ['id', 'name'])
df.show()
Output
NameError: name 'spark' is not defined +---+----+ | id|name| +---+----+ | 1|Test| +---+----+
๐Ÿ“Š

Quick Reference

Here is a quick summary of commands and objects when using PySpark shell:

Command/ObjectDescription
pysparkStarts the PySpark interactive shell
scSparkContext for RDD operations
sparkSparkSession for DataFrame and SQL operations
df.show()Displays DataFrame content in tabular form
spark.stop()Stops the Spark session
โœ…

Key Takeaways

Start the PySpark shell by running the command 'pyspark' in your terminal.
Use the 'spark' object in the shell to create and manipulate DataFrames.
Always ensure Spark is installed and configured before running the shell.
Avoid running PySpark code without initializing SparkSession or SparkContext.
Use 'df.show()' to quickly view DataFrame contents in the shell.