How to Use SQL in PySpark: Syntax, Example, and Tips
In PySpark, you can use SQL by creating a temporary view of a DataFrame with
createOrReplaceTempView and then running SQL queries using spark.sql(). This lets you write SQL queries directly on your data within the Spark environment.Syntax
To use SQL in PySpark, first create a temporary view from a DataFrame using createOrReplaceTempView('view_name'). Then run SQL queries with spark.sql('SQL query'), which returns a new DataFrame.
createOrReplaceTempView: Registers the DataFrame as a temporary table.spark.sql(): Executes the SQL query string on the registered view.- The result is a DataFrame you can use for further processing.
python
df.createOrReplaceTempView('my_table') result_df = spark.sql('SELECT * FROM my_table WHERE age > 30')
Example
This example shows how to create a DataFrame, register it as a temporary view, and run a SQL query to filter data.
python
from pyspark.sql import SparkSession spark = SparkSession.builder.appName('SQLExample').getOrCreate() # Create sample data data = [(1, 'Alice', 29), (2, 'Bob', 35), (3, 'Cathy', 23)] columns = ['id', 'name', 'age'] # Create DataFrame df = spark.createDataFrame(data, columns) # Register DataFrame as a temporary view df.createOrReplaceTempView('people') # Run SQL query result_df = spark.sql('SELECT name, age FROM people WHERE age > 25') # Show results result_df.show()
Output
+-----+---+
| name|age|
+-----+---+
|Alice| 29|
| Bob| 35|
+-----+---+
Common Pitfalls
Common mistakes when using SQL in PySpark include:
- Not creating a temporary view before running SQL queries, which causes errors.
- Using incorrect view names or SQL syntax errors.
- Expecting
spark.sql()to return a list or pandas DataFrame instead of a Spark DataFrame.
Always remember to register your DataFrame as a temporary view before querying.
python
try: # This will fail because no view is created spark.sql('SELECT * FROM missing_view').show() except Exception as e: print(f'Error: {e}') # Correct way # df.createOrReplaceTempView('missing_view') # spark.sql('SELECT * FROM missing_view').show()
Output
Error: Table or view not found: missing_view;
Quick Reference
| Step | Command | Description |
|---|---|---|
| 1 | df.createOrReplaceTempView('view_name') | Register DataFrame as a temporary SQL view |
| 2 | spark.sql('SELECT * FROM view_name') | Run SQL query on the registered view |
| 3 | result_df.show() | Display the query results |
| 4 | result_df.collect() | Collect results to driver as list (use carefully) |
Key Takeaways
Register your DataFrame as a temporary view before running SQL queries.
Use spark.sql() to run SQL queries and get a Spark DataFrame as result.
Check your SQL syntax and view names carefully to avoid errors.
The result of spark.sql() is a Spark DataFrame, not a pandas DataFrame.
Use show() to quickly display query results in PySpark.