How to Use Spark Shell in PySpark: Quick Guide
To use the Spark shell in PySpark, run the command
pyspark in your terminal. This opens an interactive shell where you can write Python code to work with Spark's APIs directly.Syntax
The basic command to start the Spark shell with PySpark is pyspark. This launches an interactive Python shell with Spark context initialized as sc and Spark session as spark.
pyspark: Starts the PySpark interactive shell.sc: SparkContext object for low-level Spark operations.spark: SparkSession object for DataFrame and SQL operations.
bash
pyspark
Example
This example shows how to start the PySpark shell, create a simple DataFrame, and display its contents.
python
from pyspark.sql import SparkSession # Create SparkSession (already available as 'spark' in shell) spark = SparkSession.builder.appName('Example').getOrCreate() # Create a simple DataFrame data = [(1, 'Alice'), (2, 'Bob'), (3, 'Cathy')] columns = ['id', 'name'] df = spark.createDataFrame(data, columns) # Show the DataFrame print('DataFrame content:') df.show()
Output
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
| 3|Cathy|
+---+-----+
Common Pitfalls
Some common mistakes when using the PySpark shell include:
- Not having Spark installed or configured properly, causing
pysparkcommand to fail. - Trying to run PySpark code outside the shell without initializing SparkSession or SparkContext.
- Confusing
sc(SparkContext) withspark(SparkSession) objects.
Always start the shell with pyspark and use the provided spark session for DataFrame operations.
python
## Wrong way: Running PySpark code without SparkSession from pyspark.sql import SparkSession # This will raise an error because 'spark' is not defined yet df = spark.createDataFrame([(1, 'Test')], ['id', 'name']) # Error: 'spark' not defined ## Right way: Start shell or create SparkSession spark = SparkSession.builder.getOrCreate() df = spark.createDataFrame([(1, 'Test')], ['id', 'name']) df.show()
Output
NameError: name 'spark' is not defined
+---+----+
| id|name|
+---+----+
| 1|Test|
+---+----+
Quick Reference
Here is a quick summary of commands and objects when using PySpark shell:
| Command/Object | Description |
|---|---|
| pyspark | Starts the PySpark interactive shell |
| sc | SparkContext for RDD operations |
| spark | SparkSession for DataFrame and SQL operations |
| df.show() | Displays DataFrame content in tabular form |
| spark.stop() | Stops the Spark session |
Key Takeaways
Start the PySpark shell by running the command 'pyspark' in your terminal.
Use the 'spark' object in the shell to create and manipulate DataFrames.
Always ensure Spark is installed and configured before running the shell.
Avoid running PySpark code without initializing SparkSession or SparkContext.
Use 'df.show()' to quickly view DataFrame contents in the shell.