0
0
Apache-sparkHow-ToBeginner ยท 3 min read

How to Use SQL in PySpark: Syntax, Example, and Tips

In PySpark, you can use SQL by creating a temporary view of a DataFrame with createOrReplaceTempView and then running SQL queries using spark.sql(). This lets you write SQL queries directly on your data within the Spark environment.
๐Ÿ“

Syntax

To use SQL in PySpark, first create a temporary view from a DataFrame using createOrReplaceTempView('view_name'). Then run SQL queries with spark.sql('SQL query'), which returns a new DataFrame.

  • createOrReplaceTempView: Registers the DataFrame as a temporary table.
  • spark.sql(): Executes the SQL query string on the registered view.
  • The result is a DataFrame you can use for further processing.
python
df.createOrReplaceTempView('my_table')
result_df = spark.sql('SELECT * FROM my_table WHERE age > 30')
๐Ÿ’ป

Example

This example shows how to create a DataFrame, register it as a temporary view, and run a SQL query to filter data.

python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SQLExample').getOrCreate()

# Create sample data
data = [(1, 'Alice', 29), (2, 'Bob', 35), (3, 'Cathy', 23)]
columns = ['id', 'name', 'age']

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Register DataFrame as a temporary view
df.createOrReplaceTempView('people')

# Run SQL query
result_df = spark.sql('SELECT name, age FROM people WHERE age > 25')

# Show results
result_df.show()
Output
+-----+---+ | name|age| +-----+---+ |Alice| 29| | Bob| 35| +-----+---+
โš ๏ธ

Common Pitfalls

Common mistakes when using SQL in PySpark include:

  • Not creating a temporary view before running SQL queries, which causes errors.
  • Using incorrect view names or SQL syntax errors.
  • Expecting spark.sql() to return a list or pandas DataFrame instead of a Spark DataFrame.

Always remember to register your DataFrame as a temporary view before querying.

python
try:
    # This will fail because no view is created
    spark.sql('SELECT * FROM missing_view').show()
except Exception as e:
    print(f'Error: {e}')

# Correct way
# df.createOrReplaceTempView('missing_view')
# spark.sql('SELECT * FROM missing_view').show()
Output
Error: Table or view not found: missing_view;
๐Ÿ“Š

Quick Reference

StepCommandDescription
1df.createOrReplaceTempView('view_name')Register DataFrame as a temporary SQL view
2spark.sql('SELECT * FROM view_name')Run SQL query on the registered view
3result_df.show()Display the query results
4result_df.collect()Collect results to driver as list (use carefully)
โœ…

Key Takeaways

Register your DataFrame as a temporary view before running SQL queries.
Use spark.sql() to run SQL queries and get a Spark DataFrame as result.
Check your SQL syntax and view names carefully to avoid errors.
The result of spark.sql() is a Spark DataFrame, not a pandas DataFrame.
Use show() to quickly display query results in PySpark.